Festival is intended to be a framework within which you can experiment with new synthesis techniques. One of the major directions we are going is the automatic training of models from data in databases of natural speech. This chapter shows some of the functions Festival supports to make data extraction and model building from databases easy.
This chapter is split into four sections, collecting databases, labelling, extraction of features and building models from the data.
Getting the right type of data is important when we are going to be using its content to build models. Future work in Festival is likely to allow building of waveform synthesizers from databases of natural speech from a single speaker. These unit selection techniques encapture the properties of the database itself so it is important that the database is of the right form. A database of isolated words may provide clearly articulated synthesized speech but it is going to sound like isolated words, even when used for the synthesis of continuous speech. Therefore if you wish your synthesizer to read the news stories it is better if your database contains the same sort of data, i.e. the reading of news stories.
It is not just the acoustic properties of the database that are important, the prosodic properties are important too. If you are going to use your synthesizer to synthesize a wide range of intonational tunes, your database should have a reasonable number of examples of these if you wish any model to be properly trained from the data. Although prosodic rules may be hand written, or trained models hand-modified to cover phenomena not actually present or rare in your database (e.g. phrase final rises), if unit selection synthesis is being used your prosodic models will be predicting variation that is not present in the database and hence unit selection will not be optimal. Thus in unit selection synthesis prosodic models should, where possible, be trained (or at least parameterized) from the same database that is used for unit selection.
The following specific points are worth considering when designing a database.
Ideally the database should be tuned for the requirements of the synthesizer voice. It therefore should have
Over all, around an hour of speech should be sufficient. Pruning methods are likely to allow reduction of this for unit selection but the whole database can be used for training of prosodic models.
Many of the techniques used to join and manipulate the units selected from a database require pitch marking, so it is best if a larynograph is used during the recording session. A larynograph records impedance across the vocal folds. This can be used to more accurately find the pitch marks within a signal. A head mounted microphone should also be used. Because of the amount of resources required to record, label and tune a database to build a speech waveform synthesizer, care should be taken to record the highest quality signal is the best surroundings. We are still some way from getting people to talk into a cheap far-field microphone on their PC while the TV plays in the background and then successfully build a high quality synthesizer from it.
In order for Festival to use a database it is most useful to build utterance structures for each utterance in the database. As we talked about earlier utterance structures contain streams of ordered items, with relations between these items. Given such a structure we can easily read in the full utterance structure and access it, dumping information in a normalised way allowing for easy building and testing of models.
Of course the level of labelling that exists, or that you are willing to do by hand or using some automatic tool for a particular database will vary. For many purposes you will at least need phonetic labelling. Hand labelled data is still better than auto-labelled data, but that could change. The size and consistency of the data is important too, though further issues regarding that subject are dealt with in the next chapter.
In all for this example we will need labels for: segments, syllables, words, phrases, intonation events, pitch targets. Some of these can be derived, some need to be labelled.
These files are assumed to be stored in a directory `festival/relations/' with a file extension identifying which utterance relation they represent. They should be in Entropic's Xlabel format, though its fairly easy to convert any reasonable format to the Xlabel format.
Once these files are created a utterance file can be automatically created from the above data. Note it is pretty easy to get the simply list relations right but building the relations between these simple lists is a little harder. Firstly labelling is rarely accurate and small windows of error must be allowed to ensure things line up properly. The second problem is that some label files identify point type information (IntEvent and Target) while others identify segments (e.g. Segment, Words etc.). Relations have to know this in order to get it right. For example it not right for all syllables between two IntEvents to be linked to the latter IntEvent, only to the Syllable the last IntEvent is within.
The script `make_utts' automatically builds the utterance files from the above labelled files.
This script will generate utterance files for each example file in the database which can be loaded into Festival and used either to do "natural synthesis", or used to extract data for training or test data for building models.
The easiest way to extract features form a labelled database of the form described in the previous section is by loading in each of the utterance structures and dumping the desired features.
Using the same mechanism to extract the features as will eventually be
used by models built from the features has the important advantage of
avoiding spurious errors easily introduced when collecting data. For
example a feature such as n.accent
in a Festival utterance will
be defined as 0 when there is no next accent. Extracting all the
accents and using an external program to calculate the next accent may
make a different decision so that when the generated model is used a
different value for this feature will be produced. Such mismatches
in training models and actual use are unfortunately common, so using
the same mechanism to extract data for training, and for actual
use it worthwhile.
The Festival function utt.features
takes an utterance, a
relation name and a list of desired features as an argument. This
function can be used to dump desired features for each item in
a desired relation in each utterance.
The script `festival/examples/dumpfeats' gives Scheme code to dump features. Its takes argument identifying the Relation to be dumped, the desired features, and the utterances to dump them from. The resulst may be dumped into a single file or into a set of files based on the name of the utterance. Also arbitrary other code may be included to add new features.
This section describes how to build models from data extracted from databases as described in the previous section. It uses the CART building program, `wagon' which is available as with the Edinburgh Speech Tools Library. But the data is suitable for many other types of model building techniques, such as linear regression or neural networks.
Wagon is described in the Speech Tools Library Manual, though we will cover simple use here. To use Wagon you need a datafile and a data description file.
A datafile consists of a number of vectors one per line each containing the same number of fields. This, not coincidentally, is exactly the format produced by the `dumpfeats' described in the previous section. The data description file describes the fields in the datafile and their range. Fields may be of any of the following types: class (a list of symbols), floats, or ignored. Wagon will build a classification tree if the first field (the predictee) is of type class, or a regression tree if the first field is a float. An example data description file would be
( ( segment_duration float ) ( name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh ) ( n.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh ) ( p.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh ) ( R:SylStructure.parent.position_type 0 final initial mid single ) ( pos_in_syl float ) ( syl_initial 0 1 ) ( syl_final 0 1) ( R:Sylstructure.parent.R:Syllable.p.syl_break 0 1 3 ) ( R:Sylstructure.parent.syl_break 0 1 3 4 ) ( R:Sylstructure.parent.R:Syllable.n.syl_break 0 1 3 4 ) ( R:Sylstructure.parent.R:Syllable.p.stress 0 1 ) ( R:Sylstructure.parent.stress 0 1 ) ( R:Sylstructure.parent.R:Syllable.n.stress 0 1 ) )
The script `COURSEDIR/bin/make_wgn_desc' goes some way to helping you build a Wagon description file. Given a datafile and a file contain the field names, it will construct an approximation of the description file. This file should still be edited as all fields are treated as of type class by `make_wgn_desc' and you may want to change them some of them to float.
The data file must be a single file, although we created a number of feature files by the process described in the previous section. From a list of file ids select, say, 80% of them, as training data and cat them into a single datafile. The remaining 20% may be catted together as test data.
To build a tree use a command like
wagon -desc DESCFILE -data TRAINFILE -test TESTFILE
The minimum cluster size (default 50) may be reduced using the
command line option -stop
plus a number.
Varying the features and stop size may improve the results.
Also you can try -stepwise
which will look for the best features
incrementally, testing the result on the test set
Building the models and getting good figures is only one part of the process, you must integrate this model into Festival is its going to be of any use. In the case of CART trees generated by Wagon, Festival supports these directly. In the case of CART trees predicting zscores, or factors to modify duration averages by such trees can be used as is.
Other parts of the distributed system use CART trees, and linear regression models that were training using the processes described in this chapter. Some other parts of the distributed system use CART trees which were written by hand and may be improved by properly applying these processes.
These exercises ideally require a labelled database, but an example one is provided in `COURSEDIR/gsw/' to make the exercises possible without requiring hours of phonetic labelling.
segment_duration name p.name n.name R:SylStructure.parent.position_type pos_in_syl syl_initial syl_final R:SylStructure.parent.R:Syllable.p.syl_break R:SylStructure.parent.syl_break R:SylStructure.parent.R:Syllable.n.syl_break R:SylStructure.parent.R:Syllable.p.stress R:SylStructure.parent.stress R:SylStructure.parent.R:Syllable.n.stressfor each segment in the database. Split this data into train and test data and build a CART tree from the training data and test it against the test data. Add any other features you think useful (especially the
ph_
features), to see if you can get better results.
wagon
for this. wagon
requires a
description file and a datafile. The data file is as dumped by
dumpfeats
. The description file must be written by hand,
or use `speech_tools/bin/make_wgn_desc' to give an approximation.
To run `wagon' use a command like
wagon -desc DESCFILE -data TRAINDATA -test TESTDATAThe default minimum cluster size is 50, try to vary that using the the option
-stop
to get better results.
Go to the first, previous, next, last section, table of contents.