Prosody Walkthrough

This section gives a walkthrough of a set of basic scripts that can be used to build duration and F0 models. The results will be reasonable but they are designed to be language independent and hence more appropriate models will almost certainly give better results. We have used these methods when building diphone voices for new languages when we know almost nothing explicit about the language structure. This walkthrough however explcitly covers most of the major steps and hence will be useful as a basis for building new better models.

In many ways this process is simialr to the limited domain voice building process. here we will design a set of prompts which are believed to cover the prosody that we wish to model, we record and label the data and then build models from the utterances built from the natural speech. In fact the basic structure for this uses the limited domain scripts for the initial part of the process.

The basic stages of this task are

Design database

The object here is to cpature enough speech in prosodic style that you wish your syntehsizer to use. Note as prosodic modeling is still and extremely difficult area all models are extremely imporerished (especially the very simple models we are presenting here), thus do not be too ambitious. However it is worthwhile consider if you wish dialog (i.e. conversational speech) or prose (i.e. read speech). Prose can be news reader style or story telling style. Most synthesizers are trained on news reader style becuase its fairly consistent and believe to be easier to model, and reading paragraphs of text is seens as a basic apllication for text to speech synthesizers. However today with more dialog systems such prosodic models are often not as appropriate.

Ideally your database will be marked up with prosodic tagging that your voice talent will understand and be able to deliver appropriately. Designing such a database isn't easy but when starting off in new languages anything may be better than fixed durations and a naive declining F0. Thus simply a list of 500 sentences from newspapers may give rise to better models than.

Suppose you have your 500 sentences, construct a prompt list as is done with the limited domain constuction. That is, you need a file of the form.

( sent_0001 "She had your dark suit in greasy washwater all year.")
( sent_0002 "Don't make me carry an oily rag like that.")
( sent_0003 "They wanted to go on a barge trip.")

Setup directory structure

As with the rest of the festvox tools, you need to set the following to environment variables to allow them to work properly. In bash or other Bourne shell compatibles type, with the appropriate pathnames for you installation of the Edinburgh Speech Tools an Festvox itself.

export FESTVOXDIR=/home/awb/projects/festvox
export ESTDIR=/home/awb/projects/speech_tools

For csh and its derivative you should type

setenv FESTVOXDIR /home/awb/projects/festvox
setenv ESTDIR /home/awb/projects/speech_tools

As the basic structure is so similar to the limited domain building structure, first you should all that setup procedure. If you are building prosodic models for an already existing limited domain then you do not need this part.

mkdir cmu_timit_awb
cd cmu_timit_awb
$FESTVOXDIR/src/ldom/setup_ldom cmu timit awb

The arguments are, institution, domain type, and speaker name.

After setting this up you need to also setup the extra directories and scripts need to build prosody models. This is done by the command


You shold copy your database files as created in the previous section into etc/.

Synthesizing prompts

We then synthesizer the prompts. As we are trying to collect natural speech these prompts should not normally be presented to the voice talent as they may then copy the syntehsizer intonation, which would almost certainly be a bad thing. As this will sometimes be the first serious use of a new diphone syntehsizerin a new language, (with impoverished prosody models) it is important to check that the prompts can be generate phonetically correct. This may require more additions to the lexicon and/or more token to word rules. We synthesize the prompts for two reasons. First, to use for autolabeling in that the synthesized prompts will be aligned using dtw against what the speaker actually says. Second we are trying to construct festival utterances structures for each utterance in this database with natural durations and F0. so we may learn from them.

You should change the line setting the "closest" voice

(set! cmu_timit_awb::closest_voice 'voice_kal_diphone)

This is in the file festvox/cmu_timit_awb_ldom.scm. This is the voice that will be used to syntehsized the prompts. Often this will be your new diphone voice.

Ideally we would like these utterances to also have natural phone sequences, such that schwas, allophones such as flaps, and post-lexical rules have been applied. At present we do not include that here though for more serious prosody modeling such phonomena should be included in the utterance structures here.

The prompts can be synthesizer by the command

festival -b festvox/build_ldom.scm '(build_prompts "etc/")'

Recording the prompts

The usual caveats apply to recording, (the Section called Recording under Unix in the Chapter called Basic Requirements) and the issues on selecting a speaker.

As prosody modeling is difficult, and if you are inexperienced in building such models, it is wise not to attempt anything hard. Just building reliable models for default unmarked intonation is very useful if your current models are simply the default fixed intonation. Thus the senetences should be read in a natural but not too varied style.

Recording can be done with pointyclicky or prompt_them. If you are using prompt_them, you should modify that script so that it does not play the prompts, as they will confuse the speaker. The speaker should simply read the text (and markup, if present).

pointyclicky etc/


bin/prompt_them etc/

Phonetically label prompts

After recording the spoken utterances must be labeled

bin/make_labs prompt-wav/*.wav

This is one of the computationally expensive parts of the process and for longer sentences it can require much memory too.

After autolabeling it is always worthwhiel to inspect the labels and correct mistakes. Phrasing can particularly cause problems so adding or deleting silences can make the derived prosody models much more accurate. You can use emulabel to to this.

emulabel etc/emu_lab

Extract pitchmarks and F0

At this point we diverge from the process used for building limited domain synthesizers. You can construct such synthesizers from the same recordings, maybe you wish more appropriate prosodic models for the fallback synthesizer. But at this poijnt we need to extract the pitchmark in a slightl different way. We are intending to extract F0 contours for all non-silence parts of the speech signal. We do this by extracting pitchmarks for the voiced sections alone then (in the next section) interpolating the F0 through the non-voiced (but non-silence) sections.

the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements discusses the setting of parameters to get bin/make_pm_wave to work for a particular voice. In this case we need those same parameters (which should be found by experiment). These shold be copied from bin/make_pm_wave and added to bin/make_F0_pm in the variable PM_ARGS. The distribution contains something like

PM_ARGS='-min 0.0057 -max 0.012 -def 0.01 -wave_end -lx_lf 140 -lx_lo 111 -lx_hf 80 -lx_ho 51 -med_o 0'

Importnantly this differs from the parameters in bin/make_pm_wave as we do not use the -fill option to fill in pitchmarks over the rest of the waveform.

The second part of this section is the construction of an F0 contour which is build from the extracted pitchmarks. Unvoiced speech sections are assigned an F0 contour by interpolation from the voiced section around it, and the result is smnoothed. The label files are used to define which parts of the signal are silence and which are speech.

The variable SILENCE in bin/make_f0_pm must be modified to reflect the symbol used for silence in your phoneset.

Once the pitchmark parameters have be determined, and the appropriate SILENCE value set you can extract the smoothed F0 by the command

bin/make_f0_pm wav/*.wav

You can view the F0 contrours with the command

emulabel etc/emu_f0

Build utterance structures

With the labels and F0 created we can now rebuild the utterance structures by syntehsizing the prompt snad merging in the values from the natural durations and F0 from the naturally spoken utterances.

festival -b festvox/build_ldom.scm '(build_utts "etc/")'

Duration models

The script bin/make_dur_model contains all of the following commands but it is wise to understand the stages as due to errors in labeling it may not all run completely smoothly and small fixes may be required.

We are building a duration model using a CART tree to predict zscore values for phones. Zscores (number of standard deviations from the mean) have often been used in duration modeling as they allow a certain amount of normalization over different phones.

You shold first look at the script bin/make_dur_model and edit the following three variable values


these should contain the name for silence in your phoneset, the call for the voice you are building the model for (or at least one that uses the same phoneset), and finally the name for the model, which can be the same INST_LANG_VOX part of the voice you call.

The first stage is to find the means and standard deviations for each phone. A festival script in the festival distribution is used to load in all the utetrances and a calculate these values. With the command

durmeanstd -output festival/dur/etc/durs.meanstd festival/utts/*.utt

You should check festival/dur/etc/durs.meanstd, the generated file to ensure that the numbers look raosnable. If there is only one example of a particular phone, the standard deviation cannot be calculated and the value is given as nan (not-a-number). Thus must be changed to a standard numeric value (say one-third or the mean). Also some of the values in this table maybe adversely affected by bad labeling so you may wish to hand modify the values, or go back and correct the labeling.

The next stage is extract the features from which we will predict the durations. The list of features extracted as in festival/dur/etc/dur/feats. These cover phonetic context, syllable, word position etc. These may or may not be appropriate for your new language or domain and you you may wish to add to these before doing the extraction. The extraction process takes each phoneme and dumps the named feature values for that phone into a file. This uses the standard festival script dumpfeats to do this. The command looks like

$DUMPFEATS -relation Segment -eval $VOICENAME \
      -feats festival/dur/etc/dur.feats =
      -output festival/dur/feats/%s.feats \
      -eval festival/dur/etc/logdurn.scm \

These feature files are then concatenated into a single file which is then split (90/10) into traing and test sets. The training set is further split force use as a held-out testset used in the training phase. Also at this stage we remove all silence phones form the training and test set. This is, perhaps naively, because the distribution of silences is very wide and often files contain silences at the start and end of utterances which themslves aren't part of the speech content (they're just the edges) and having these in the training set can skew the results.

This is done by the commands

cat festival/dur/feats/*.feats | \
      awk '{if ($2 != "'$SILENCENAME'") print $0}' >festival/dur/data/
bin/traintest festival/dur/data/
bin/traintest festival/dur/data/

For wagon the CART tree builder to work it needs to know what possible values each feature can take. This can mostly be determined automatically but some features may have values that could be either numeric or classes, thus we use a post-processing function on the automatically generated description file to get our desired result.

$ESTDIR/bin/make_wagon_desc festival/dur/data/ \
       festival/dur/etc/dur.feats festival/dur/etc/dur.desc
festival -b --heap 2000000 festvox/build_prosody.scm \
      $VOICENAME '(build_dur_feats_desc)'

Now we can build the model itself. A key factor in the time this takes (and the accuracy of the model) is the "stop" value, that is the number of examples that must exist before a split searched for. The smaller this number the longer the search will be, though up to a certasin point the more accurate the model will be. But at some level this will over train. The default in the distribution is 50 which may or may not be appropriate. Not for large databases and for smaller values of STOP the training may take days even on a fast processor.

Although we have guessed a reasonable value for this for databases of around 50-1000 utterances it may not be appropriate for you.

The learning technique used is basic CART tree growing but with an important extention which makes the process much more robust on unseen data but unfortunately much more computationally expensive. The -stepwise option on wagon incrementally searches for the best features to use in building three, in addition to at each iteration finding the best questions about each feature that best model data. If you want a quicker result removing the -stepwise option will give you that.

The basic wagon command is

wagon -data festival/dur/data/ \
      -desc festival/dur/etc/dur.desc \
      -test festival/dur/data/ \
      -stop $STOP \
      -output festival/dur/tree/$PREF.S$STOP.tree \

To tets the results on data not used inthe training we use the command

wagon_test -heap 2000000 -data festival/dur/data/ \
           -desc festival/dur/etc/dur.desc \
           -tree festival/dur/tree/$PREF.S$STOP.tree

Interpreting the results isn't easy in isolation. The smaller the RMSE (root mean squared error) the better and the larger the correlation is the better (it should never be greater than 1, and should never be below 0, though if you model is very bad it can be below 0). For English, with this script on a Timit database we get an RMSE value of 0.81 and correlation of 0.58, on the test data. Note these values are not in the abosolute domain (i.e. seconds) they are in the zscore domain.

The final stage, probably after a number of iterations of the build process we must package model into a scheme file that can be used with a voice. This scheme file contains the means and standard deviations (so we can convert the predicted values back into seconds) and the prediction tree itself. We also add in predictions for the silence phone by hand. The comamnd to generate this is

festival -b --heap 2000000 \
         festvox/build_prosody.scm $VOICENAME \
         '(finalize_dur_model "'$MODELNAME'" "'$PREF.S$STOP.tree'")'

This will generate a file festvox/cmu_timit_awb_dur.scm. If you model name is the same as the basic diphone voice you intend to use it in you cna simply copy this file to the festvox/ directory of your diphone voice and it will automatically work. But it is worth explaining what this install process really is. The duration model scheme file contains two lisp expression setting the the variables MODELNAME::phone_durs and MODELNAME::zdurtree. To uses these in a voice you must load this file, typically by adding

(require 'MODELNAME_dur)

to the diphone voice definition file (festvox/MODELNAME_diphone.scm). And then get the voice defintion to use these new variables. This is done by the commands in the voice definition function

  ;; Duration prediction
  (set! duration_cart_tree MODELNAME::zdurtree)
  (set! duration_ph_info MODELNAME::phone_durs)
  (Parameter.set 'Duration_Method 'Tree_ZScores)

F0 contour models

(what about accents ?)

extract features for prediction build feature description files Build regression model to preict F0 at start, mid and end of syllable construct scheme file with F0 model