Go to the first, previous, next, last section, table of contents.

8 Building models from databases

Festival is intended to be a framework within which you can experiment with new synthesis techniques. One of the major directions we are going is the automatic training of models from data in databases of natural speech. This chapter shows some of the functions Festival supports to make data extraction and model building from databases easy.

This chapter is split into four sections, collecting databases, labelling, extraction of features and building models from the data.

8.1 Collecting databases

Getting the right type of data is important when we are going to be using its content to build models. Future work in Festival is likely to allow building of waveform synthesizers from databases of natural speech from a single speaker. These unit selection techniques encapture the properties of the database itself so it is important that the database is of the right form. A database of isolated words may provide clearly articulated synthesized speech but it is going to sound like isolated words, even when used for the synthesis of continuous speech. Therefore if you wish your synthesizer to read the news stories it is better if your database contains the same sort of data, i.e. the reading of news stories.

It is not just the acoustic properties of the database that are important, the prosodic properties are important too. If you are going to use your synthesizer to synthesize a wide range of intonational tunes, your database should have a reasonable number of examples of these if you wish any model to be properly trained from the data. Although prosodic rules may be hand written, or trained models hand-modified to cover phenomena not actually present or rare in your database (e.g. phrase final rises), if unit selection synthesis is being used your prosodic models will be predicting variation that is not present in the database and hence unit selection will not be optimal. Thus in unit selection synthesis prosodic models should, where possible, be trained (or at least parameterized) from the same database that is used for unit selection.

The following specific points are worth considering when designing a database.

phonetic coverage: This is probably the most important criteria. Note that its not quite as simple as it sounds. Some phonemes are particularly rare, especially when they become effectively optional when labelling. For example syllabic l, m and n, may or may not be used depending on the phoneme set selected. Often syllabic consonants are included in the phone set but they are only occasionally labelled in the database itself. Their rarity makes it difficult to resynthesize them. Of course you'll need at least one occurrence of every phone in you database but many more in different contexts is desirably. Note that there is a distinction between phonetically balanced and phonetically rich databases. The first offers phones in the approximate distribution they appear in normal speech while the second biases distribution more towards the rarer combinations. Diphone databases typically go for phonetically rich in that they ensure occurrences of each phone explicitly in all (left) contexts. This measure is probably too strict for unit selection type databases and going for phonetically balanced but with a reasonable number of the least common phones is probably better as it offers more natural speech.
dialect: It is useful to have your speaker's dialect match your lexicon, as lexicons are difficult to reliably alter. This effectively limits databases to major dialects. Hopefully this will change with more generalized lexicons but at present better results will be attained on more standard dialects.
voice quality: Selecting a "good" speaker will make your synthesis better. "Good" unfortunately cannot yet be fully characterized. However many of the qualities people associate with good, clear speakers are exactly the properties that make it easier to do unit selection, and build prosodic models. Consistency is probably the most important characteristic. Fast, unusually widely varying speech is going to be harder to capture with models than clear consistent speech. That is, a synthesizer build from Patrick Stewart's voice is likely to be better than one built from Jim Carey's. Realistically be careful not to set your sights too high and hope to synthesize "character" voices. They are probably still too difficult to properly model. It is a lot of work to record and label a database so think carefully about the speaker's voice before you embark on the task.
prosodic coverage: Database consisting of just isolated words will contain inherent properties of isolated spoken words no matter what tricks you apply to your model building. Durations are typically longer, phones are more articulated and there are less intonational tunes. Sentences are better but they will still lack varying prosody that is used when reading longer passages. Isolated sentences (e.g. like TIMIT 460) lack interesting prosodic phrasing, and little intonation variation (typically no continuation rises, emphasis or phrase final rises). Reading short paragraphs is closer to the ideal, including quoted dialogue will also help the variation. Also common phrases, greetings often have their own very subtle properties that are very difficult to synthesize. Typically synthesis of "hello" is never as good as "The quick brown fox jumped over the lazy dog." Including common greetings (in proper context) are therefore worthwhile if you wish to synthesize such forms.
size: Obviously the larger the size of the database the better, but only within reason. The larger the size the more time would be needed to record it. Given multiple recording sessions the quality could be different making it less useful for unit selection. However such recording differences are less important for building prosodic models so more data can only help (colds, hangovers, etc. notwithstanding). Also remember that the more data you get the more you will need to label. A minimum size for unit selection is probably something like the 200 CSTR phonetically balanced sentences. These consist of about 10 minutes of speech (about 9,000 phones). This is really too small to train durations and intonation models from, though they can be used to parameterize models trained on larger data sets. The TIMIT 460 phonetically balanced sentences is probably a more acceptable option (around 15,000 phones). We are likely to use this as a base database size for at least some of our future experiments, because copies already exist and it is a clearly defined set that others can access to record their own databases. As mentioned above it does have prosodic limitations, but for the time being we can live with these. Better prosodic coverage may be found in databases like Boston University's Radio News Corpus ostendorf95. These are single speaker databases of short news stories. The f2b database consists of about 45 minutes of speech (about 40,000 phones), while f3a consists of almost 120 minutes (about 100,000 phones). The text used to build these databases is unfortunately under copyright so you will need to find your own text to use. Depending on future developments in Festival, we hope to release explicit recommendations for sizes and content of databases and their relative trade-offs

Ideally the database should be tuned for the requirements of the synthesizer voice. It therefore should have

Phonetic coverage, by careful design or be large enough (e.g. 45 minutes of speech). Specific additions may be added to cover unusual phone sequences.
Utterances should contain paragraphs rather than (in addition to) isolated sentences.
If the system is to be used for dialog, dialog examples should be included.
It is wise to include a number of occurrences of phrases, words that are most likely to be used. If the system is going to be used to give out telephone numbers, it is wise to actually include a reasonable number of number examples in the database. Also common greetings often have quite subtle prosodic qualities that people are aware of. Introductions and standard greetings should also be included, this will aid naturalness.

Over all, around an hour of speech should be sufficient. Pruning methods are likely to allow reduction of this for unit selection but the whole database can be used for training of prosodic models.

Many of the techniques used to join and manipulate the units selected from a database require pitch marking, so it is best if a larynograph is used during the recording session. A larynograph records impedance across the vocal folds. This can be used to more accurately find the pitch marks within a signal. A head mounted microphone should also be used. Because of the amount of resources required to record, label and tune a database to build a speech waveform synthesizer, care should be taken to record the highest quality signal is the best surroundings. We are still some way from getting people to talk into a cheap far-field microphone on their PC while the TV plays in the background and then successfully build a high quality synthesizer from it.

8.2 Labelling databases

In order for Festival to use a database it is most useful to build utterance structures for each utterance in the database. As we talked about earlier utterance structures contain streams of ordered items, with relations between these items. Given such a structure we can easily read in the full utterance structure and access it, dumping information in a normalised way allowing for easy building and testing of models.

Of course the level of labelling that exists, or that you are willing to do by hand or using some automatic tool for a particular database will vary. For many purposes you will at least need phonetic labelling. Hand labelled data is still better than auto-labelled data, but that could change. The size and consistency of the data is important too, though further issues regarding that subject are dealt with in the next chapter.

In all for this example we will need labels for: segments, syllables, words, phrases, intonation events, pitch targets. Some of these can be derived, some need to be labelled.

These files are assumed to be stored in a directory `festival/relations/' with a file extension identifying which utterance relation they represent. They should be in Entropic's Xlabel format, though its fairly easy to convert any reasonable format to the Xlabel format.

Segment: These give phoneme labels for files. Note the these labels must be members of the phoneset that you will be using for this database. Often phone label files may contain extra labels (e.g. beginning and end silence) which are not really part of the phoneset. You should remove (or relabel) these phones accordingly.
Word: Again these will need to be provided. The end of the word should come at the last phone in the word (or just after). Pauses/silences should not be part of the word.
Syllable: There is a chance these can be automatically generated from, Word and Segment files given a lexicon. Ideally these should include lexical stress.
IntEvent: These should ideally mark accent/boundary tone type for each syllable, but this almost definitely require hand-labelling. Also given that hand-labelled of accent type is often not very accurate, its arguable that anything other than accented vs. non-accented can be used reliably.
Phrase: This could just mark the last non-silence phone in each utterance, or before any silence phones in the whole utterance. The script, `make_Phrase' marks a phrase break before all non-continuous silences and at the the end of utterances based on the Segment files.
Target: This can be automatically derived from an F0 file and the Segment files. A marking of the mean F0 in each voiced phone seem to give adequate results. The script `make_Target' will do this assuming Segment files, and F0 files. A slight modification to this file will also generate F0 files using `pda' which is included with the Edinburgh Speech Tools or ESPS's `get_f0'. Better results may easily be possible by using more appropriate parameters to pitch extraction program.

Once these files are created a utterance file can be automatically created from the above data. Note it is pretty easy to get the simply list relations right but building the relations between these simple lists is a little harder. Firstly labelling is rarely accurate and small windows of error must be allowed to ensure things line up properly. The second problem is that some label files identify point type information (IntEvent and Target) while others identify segments (e.g. Segment, Words etc.). Relations have to know this in order to get it right. For example it not right for all syllables between two IntEvents to be linked to the latter IntEvent, only to the Syllable the last IntEvent is within.

The script `make_utts' automatically builds the utterance files from the above labelled files.

This script will generate utterance files for each example file in the database which can be loaded into Festival and used either to do "natural synthesis", or used to extract data for training or test data for building models.

8.3 Extracting features

The easiest way to extract features form a labelled database of the form described in the previous section is by loading in each of the utterance structures and dumping the desired features.

Using the same mechanism to extract the features as will eventually be used by models built from the features has the important advantage of avoiding spurious errors easily introduced when collecting data. For example a feature such as n.accent in a Festival utterance will be defined as 0 when there is no next accent. Extracting all the accents and using an external program to calculate the next accent may make a different decision so that when the generated model is used a different value for this feature will be produced. Such mismatches in training models and actual use are unfortunately common, so using the same mechanism to extract data for training, and for actual use it worthwhile.

The Festival function utt.features takes an utterance, a relation name and a list of desired features as an argument. This function can be used to dump desired features for each item in a desired relation in each utterance.

The script `festival/examples/dumpfeats' gives Scheme code to dump features. Its takes argument identifying the Relation to be dumped, the desired features, and the utterances to dump them from. The resulst may be dumped into a single file or into a set of files based on the name of the utterance. Also arbitrary other code may be included to add new features.

8.4 Building models

This section describes how to build models from data extracted from databases as described in the previous section. It uses the CART building program, `wagon' which is available as with the Edinburgh Speech Tools Library. But the data is suitable for many other types of model building techniques, such as linear regression or neural networks.

Wagon is described in the Speech Tools Library Manual, though we will cover simple use here. To use Wagon you need a datafile and a data description file.

A datafile consists of a number of vectors one per line each containing the same number of fields. This, not coincidentally, is exactly the format produced by the `dumpfeats' described in the previous section. The data description file describes the fields in the datafile and their range. Fields may be of any of the following types: class (a list of symbols), floats, or ignored. Wagon will build a classification tree if the first field (the predictee) is of type class, or a regression tree if the first field is a float. An example data description file would be

(
( segment_duration float )
( name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n 
    ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( n.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n 
    ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( p.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n 
    ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( R:SylStructure.parent.position_type 0 final initial mid single )
( pos_in_syl float )
( syl_initial 0 1 )
( syl_final 0 1)
( R:Sylstructure.parent.R:Syllable.p.syl_break 0 1 3 )
( R:Sylstructure.parent.syl_break 0 1 3 4 )
( R:Sylstructure.parent.R:Syllable.n.syl_break 0 1 3 4 )
( R:Sylstructure.parent.R:Syllable.p.stress 0 1 )
( R:Sylstructure.parent.stress 0 1 )
( R:Sylstructure.parent.R:Syllable.n.stress 0 1 )
)

The script `COURSEDIR/bin/make_wgn_desc' goes some way to helping you build a Wagon description file. Given a datafile and a file contain the field names, it will construct an approximation of the description file. This file should still be edited as all fields are treated as of type class by `make_wgn_desc' and you may want to change them some of them to float.

The data file must be a single file, although we created a number of feature files by the process described in the previous section. From a list of file ids select, say, 80% of them, as training data and cat them into a single datafile. The remaining 20% may be catted together as test data.

To build a tree use a command like

wagon -desc DESCFILE -data TRAINFILE -test TESTFILE

The minimum cluster size (default 50) may be reduced using the command line option -stop plus a number.

Varying the features and stop size may improve the results.

Also you can try -stepwise which will look for the best features incrementally, testing the result on the test set

Building the models and getting good figures is only one part of the process, you must integrate this model into Festival is its going to be of any use. In the case of CART trees generated by Wagon, Festival supports these directly. In the case of CART trees predicting zscores, or factors to modify duration averages by such trees can be used as is.

Other parts of the distributed system use CART trees, and linear regression models that were training using the processes described in this chapter. Some other parts of the distributed system use CART trees which were written by hand and may be improved by properly applying these processes.

8.5 Exercises

These exercises ideally require a labelled database, but an example one is provided in `COURSEDIR/gsw/' to make the exercises possible without requiring hours of phonetic labelling.

From the given Segment, Word, Syllable and IntEvent files, create the Phrase and Target files. Then generate your own version of the utterance files. If you have an alternative database try and build as many of these files as you can and build utterances files for each file in your database.
Either for your own database or the from the utterances provided, extract data from which we will build duration models. As a start extract the following features
```
segment_duration
name 
p.name
n.name
R:SylStructure.parent.position_type
pos_in_syl
syl_initial
syl_final
R:SylStructure.parent.R:Syllable.p.syl_break
R:SylStructure.parent.syl_break
R:SylStructure.parent.R:Syllable.n.syl_break
R:SylStructure.parent.R:Syllable.p.stress
R:SylStructure.parent.stress
R:SylStructure.parent.R:Syllable.n.stress
```
for each segment in the database. Split this data into train and test data and build a CART tree from the training data and test it against the test data. Add any other features you think useful (especially the ph_ features), to see if you can get better results.
Try the same prediction again but use zscores, and/or log durations and see what the relative results are.

8.6 Hints

For new databases, this is not necessarily easy and may require quite a lot of hand checking, but the results are worth it.
You will need to use wagon for this. wagon requires a description file and a datafile. The data file is as dumped by dumpfeats. The description file must be written by hand, or use `speech_tools/bin/make_wgn_desc' to give an approximation. To run `wagon' use a command like
```
wagon -desc DESCFILE -data TRAINDATA -test TESTDATA
```
The default minimum cluster size is 50, try to vary that using the the option -stop to get better results.
You will have to write simple Unix functions to collect averages and calculate zscores for each feature vector in your datafile. Remember the RMSE and correlation are fundamentally different in different domains so cannot be directly compared. You would need to convert the data back into the same domain to compare to different prediction methods properly.

Go to the first, previous, next, last section, table of contents.