[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

26. Building models from databases

Because our research interests tend towards creating statistical models trained from real speech data, Festival offers various support for extracting information from speech databases, in a way suitable for building models.

Models for accent prediction, F0 generation, duration, vowel reduction, homograph disambiguation, phrase break assignment and unit selection have been built using Festival to extract and process various databases.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

26.1 Labelling databases

In order for Festival to use a database it is most useful to build utterance structures for each utterance in the database. As discussed earlier, utterance structures contain relations of items. Given such a structure for each utterance in a database we can easily read in the utterance representation and access it, dumping information in a normalised way allowing for easy building and testing of models.

Of course the level of labelling that exists, or that you are willing to do by hand or using some automatic tool, for a particular database will vary. For many purposes you will at least need phonetic labelling. Hand labelled data is still better than auto-labelled data, but that could change. The size and consistency of the data is important too.

For this discussion we will assume labels for: segments, syllables, words, phrases, intonation events, pitch targets. Some of these can be derived, some need to be labelled. This would not fail with less labelling but of course you wouldn’t be able to extract as much information from the result.

In our databases these labels are in Entropic’s Xlabel format, though it is fairly easy to convert any reasonable format.


These give phoneme labels for files. Note the these labels must be members of the phoneset that you will be using for this database. Often phone label files may contain extra labels (e.g. beginning and end silence) which are not really part of the phoneset. You should remove (or re-label) these phones accordingly.


Again these will need to be provided. The end of the word should come at the last phone in the word (or just after). Pauses/silences should not be part of the word.


There is a chance these can be automatically generated from Word and Segment files given a lexicon. Ideally these should include lexical stress.


These should ideally mark accent/boundary tone type for each syllable, but this almost definitely requires hand-labelling. Also given that hand-labelling of accent type is harder and not as accurate, it is arguable that anything other than accented vs. non-accented can be used reliably.


This could just mark the last non-silence phone in each utterance, or before any silence phones in the whole utterance.


This can be automatically derived from an F0 file and the Segment files. A marking of the mean F0 in each voiced phone seem to give adequate results.

Once these files are created an utterance file can be automatically created from the above data. Note it is pretty easy to get the streams right but getting the relations between the streams is much harder. Firstly labelling is rarely accurate and small windows of error must be allowed to ensure things line up properly. The second problem is that some label files identify point type information (IntEvent and Target) while others identify segments (e.g. Segment, Words etc.). Relations have to know this in order to get it right. For example is not right for all syllables between two IntEvents to be linked to the IntEvent, only to the Syllable the IntEvent is within.

The script ‘festival/examples/make_utts’ is an example Festival script which automatically builds the utterance files from the above labelled files.

The script, by default assumes, a hierarchy in an database directory of the following form. Under a directory ‘festival/’ where all festival specific database ifnromation can be kept, a directory ‘relations/’ contains a subdirectory for each basic relation (e.g. ‘Segment/’, ‘Syllable/’, etc.) Each of which contains the basic label files for that relation.

The following command will build a set of utterance structures (including building hte relations that link between these basic relations).

make_utts -phoneset radio festival/relation/Segment/*.Segment

This will create utterances in ‘festival/utts/’. There are a number of options to ‘make_utts’ use ‘-h’ to find them. The ‘-eval’ option allows extra scheme code to be loaded which may be called by the utterance building process. The function make_utts_user_function will be called on all utterance created. Redefining that in database specific loaded code will allow database specific fixed to the utterance.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

26.2 Extracting features

The easiest way to extract features from a labelled database of the form described in the previous section is by loading in each of the utterance structures and dumping the desired features.

Using the same mechanism to extract the features as will eventually be used by models built from the features has the important advantage of avoiding spurious errors easily introduced when collecting data. For example a feature such as n.accent in a Festival utterance will be defined as 0 when there is no next accent. Extracting all the accents and using an external program to calculate the next accent may make a different decision so that when the generated model is used a different value for this feature will be produced. Such mismatches in training models and actual use are unfortunately common, so using the same mechanism to extract data for training, and for actual use is worthwhile.

The recommedn method for extracting features is using the festival script ‘dumpfeats’. It basically takes a list of feature names and a list of utterance files and dumps the desired features.

Features may be dumped into a single file or into separate files one for each utterance. Feature names may be specified on the command line or in a separate file. Extar code to define new features may be loaded too.

For example suppose we wanted to save the features for a set of utterances include the duration, phone name, previous and next phone names for all segments in each utterance.

dumpfeats -feats "(segment_duration name p.name n.name)" \
          -output feats/%s.dur -relation Segment \

This will save these features in files named for the utterances they come from in the directory ‘feats/’. The argument to ‘-feats’ is treated as literal list only if it starts with a left parenthesis, otherwise it is treated as a filename contain named features (unbracketed).

Extra code (for new feature definitions) may be loaded through the ‘-eval’ option. If the argument to ‘-eval’ starts with a left parenthesis it is trated as an s-expression rather than a filename and is evaluated. If argument ‘-output’ contains "%s" it will be filled in with the utterance’s filename, if it is a simple filename the features from all utterances will be saved in that same file. The features for each item in the named relation are saved on a single line.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

26.3 Building models

This section describes how to build models from data extracted from databases as described in the previous section. It uses the CART building program, ‘wagon’ which is available in the speech tools distribution. But the data is suitable for many other types of model building techniques, such as linear regression or neural networks.

Wagon is described in the speech tools manual, though we will cover simple use here. To use Wagon you need a datafile and a data description file.

A datafile consists of a number of vectors one per line each containing the same number of fields. This, not coincidentally, is exactly the format produced by ‘dumpfeats’ described in the previous section. The data description file describes the fields in the datafile and their range. Fields may be of any of the following types: class (a list of symbols), floats, or ignored. Wagon will build a classification tree if the first field (the predictee) is of type class, or a regression tree if the first field is a float. An example data description file would be

( duration float )
( name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n 
    ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( n.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n 
    ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( p.name # @ @@ a aa ai au b ch d dh e e@ ei f g h i i@ ii jh k l m n 
    ng o oi oo ou p r s sh t th u u@ uh uu v w y z zh )
( R:SylStructure.parent.position_type 0 final initial mid single )
( pos_in_syl float )
( syl_initial 0 1 )
( syl_final 0 1)
( R:SylStructure.parent.R:Syllable.p.syl_break 0 1 3 )
( R:SylStructure.parent.syl_break 0 1 3 4 )
( R:SylStructure.parent.R:Syllable.n.syl_break 0 1 3 4 )
( R:SylStructure.parent.R:Syllable.p.stress 0 1 )
( R:SylStructure.parent.stress 0 1 )
( R:SylStructure.parent.R:Syllable.n.stress 0 1 )

The script ‘speech_tools/bin/make_wagon_desc’ goes some way to helping. Given a datafile and a file containing the field names, it will construct an approximation of the description file. This file should still be edited as all fields are treated as of type class by ‘make_wagon_desc’ and you may want to change them some of them to float.

The data file must be a single file, although we created a number of feature files by the process described in the previous section. From a list of file ids select, say, 80% of them, as training data and cat them into a single datafile. The remaining 20% may be catted together as test data.

To build a tree use a command like

wagon -desc DESCFILE -data TRAINFILE -test TESTFILE

The minimum cluster size (default 50) may be reduced using the command line option -stop plus a number.

Varying the features and stop size may improve the results.

Building the models and getting good figures is only one part of the process. You must integrate this model into Festival if its going to be of any use. In the case of CART trees generated by Wagon, Festival supports these directly. In the case of CART trees predicting zscores, or factors to modify duration averages, ees can be used as is.

Note there are other options to Wagon which may help build better CART models. Consult the chapter in the speech tools manual on Wagon for more information.

Other parts of the distributed system use CART trees, and linear regression models that were training using the processes described in this chapter. Some other parts of the distributed system use CART trees which were written by hand and may be improved by properly applying these processes.

[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Alan W Black on December 2, 2014 using texi2html 1.82.