Building Synthetic Voices | ||
---|---|---|
<<< Previous | Building prosodic models | Next >>> |
This section gives a walkthrough of a set of basic scripts that can be used to build duration and F0 models. The results will be reasonable but they are designed to be language independent and hence more appropriate models will almost certainly give better results. We have used these methods when building diphone voices for new languages when we know almost nothing explicit about the language structure. This walkthrough however explcitly covers most of the major steps and hence will be useful as a basis for building new better models.
In many ways this process is simialr to the limited domain voice building process. here we will design a set of prompts which are believed to cover the prosody that we wish to model, we record and label the data and then build models from the utterances built from the natural speech. In fact the basic structure for this uses the limited domain scripts for the initial part of the process.
The basic stages of this task are
Design database
Setup directory structure
Synthesizing prompts (for labeling)
Recording prompts
Phonetically label prompts
Extract pitchmarks and F0 contour
Build utterance structures
For duration models
extract means and standard deviations of phone durations
extract features for prediction
build feature description files
Build regression model to predict durations
construct scheme file with duration model
For F0 models
extract features for prediction
build feature description files
Build regression model to preict F0 at start, mid and end of syllable
construct scheme file with F0 model
The object here is to cpature enough speech in prosodic style that you wish your syntehsizer to use. Note as prosodic modeling is still and extremely difficult area all models are extremely imporerished (especially the very simple models we are presenting here), thus do not be too ambitious. However it is worthwhile consider if you wish dialog (i.e. conversational speech) or prose (i.e. read speech). Prose can be news reader style or story telling style. Most synthesizers are trained on news reader style becuase its fairly consistent and believe to be easier to model, and reading paragraphs of text is seens as a basic apllication for text to speech synthesizers. However today with more dialog systems such prosodic models are often not as appropriate.
Ideally your database will be marked up with prosodic tagging that your voice talent will understand and be able to deliver appropriately. Designing such a database isn't easy but when starting off in new languages anything may be better than fixed durations and a naive declining F0. Thus simply a list of 500 sentences from newspapers may give rise to better models than.
Suppose you have your 500 sentences, construct a prompt list as is done with the limited domain constuction. That is, you need a file of the form.
( sent_0001 "She had your dark suit in greasy washwater all year.")
( sent_0002 "Don't make me carry an oily rag like that.")
( sent_0003 "They wanted to go on a barge trip.")
...
As with the rest of the festvox tools, you need to set
the following to environment variables to allow them to work
properly. In bash
or other Bourne shell compatibles
type, with the appropriate pathnames for you installation of
the Edinburgh Speech Tools an Festvox itself.
Forexport FESTVOXDIR=/home/awb/projects/festvox
export ESTDIR=/home/awb/projects/speech_tools
csh
and its derivative you should type
As the basic structure is so similar to the limited domain building structure, first you should all that setup procedure. If you are building prosodic models for an already existing limited domain then you do not need this part.setenv FESTVOXDIR /home/awb/projects/festvox
setenv ESTDIR /home/awb/projects/speech_tools
The arguments are, institution, domain type, and speaker name.mkdir cmu_timit_awb
cd cmu_timit_awb
$FESTVOXDIR/src/ldom/setup_ldom cmu timit awb
After setting this up you need to also setup the extra directories and scripts need to build prosody models. This is done by the command
$FESTVOXDIR/src/prosody/setup_prosody
You shold copy your database files as created in the previous section into etc/.
We then synthesizer the prompts. As we are trying to collect natural speech these prompts should not normally be presented to the voice talent as they may then copy the syntehsizer intonation, which would almost certainly be a bad thing. As this will sometimes be the first serious use of a new diphone syntehsizerin a new language, (with impoverished prosody models) it is important to check that the prompts can be generate phonetically correct. This may require more additions to the lexicon and/or more token to word rules. We synthesize the prompts for two reasons. First, to use for autolabeling in that the synthesized prompts will be aligned using dtw against what the speaker actually says. Second we are trying to construct festival utterances structures for each utterance in this database with natural durations and F0. so we may learn from them.
You should change the line setting the "closest" voice
This is in the file festvox/cmu_timit_awb_ldom.scm. This is the voice that will be used to syntehsized the prompts. Often this will be your new diphone voice.(set! cmu_timit_awb::closest_voice 'voice_kal_diphone)
Ideally we would like these utterances to also have natural phone sequences, such that schwas, allophones such as flaps, and post-lexical rules have been applied. At present we do not include that here though for more serious prosody modeling such phonomena should be included in the utterance structures here.
The prompts can be synthesizer by the command
festival -b festvox/build_ldom.scm '(build_prompts "etc/timit.data")'
The usual caveats apply to recording, (the Section called Recording under Unix in the Chapter called Basic Requirements) and the issues on selecting a speaker.
As prosody modeling is difficult, and if you are inexperienced in building such models, it is wise not to attempt anything hard. Just building reliable models for default unmarked intonation is very useful if your current models are simply the default fixed intonation. Thus the senetences should be read in a natural but not too varied style.
Recording can be done with pointyclicky
or prompt_them
.
If you are using prompt_them
, you should modify that script so
that it does not play the prompts, as they will confuse the speaker.
The speaker should simply read the text (and markup, if present).
orpointyclicky etc/timit.data
bin/prompt_them etc/timit.data
After recording the spoken utterances must be labeled
This is one of the computationally expensive parts of the process and for longer sentences it can require much memory too.bin/make_labs prompt-wav/*.wav
After autolabeling it is always worthwhiel to inspect the labels and correct mistakes. Phrasing can particularly cause problems so adding or deleting silences can make the derived prosody models much more accurate. You can use emulabel to to this.
emulabel etc/emu_lab
At this point we diverge from the process used for building limited domain synthesizers. You can construct such synthesizers from the same recordings, maybe you wish more appropriate prosodic models for the fallback synthesizer. But at this poijnt we need to extract the pitchmark in a slightl different way. We are intending to extract F0 contours for all non-silence parts of the speech signal. We do this by extracting pitchmarks for the voiced sections alone then (in the next section) interpolating the F0 through the non-voiced (but non-silence) sections.
the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements
discusses the setting of
parameters to get bin/make_pm_wave to work for a particular
voice. In this case we need those same parameters (which should be
found by experiment). These shold be copied from
bin/make_pm_wave and added to bin/make_F0_pm in the
variable PM_ARGS
. The distribution contains something
like
Importnantly this differs from the parameters inPM_ARGS='-min 0.0057 -max 0.012 -def 0.01 -wave_end -lx_lf 140 -lx_lo 111 -lx_hf 80 -lx_ho 51 -med_o 0'
bin/make_pm_wave
as we do not use the -fill
option to fill in
pitchmarks over the rest of the waveform.The second part of this section is the construction of an F0 contour which is build from the extracted pitchmarks. Unvoiced speech sections are assigned an F0 contour by interpolation from the voiced section around it, and the result is smnoothed. The label files are used to define which parts of the signal are silence and which are speech.
The variable SILENCE
in bin/make_f0_pm
must be modified to
reflect the symbol used for silence in your phoneset.
Once the pitchmark parameters have be determined, and the appropriate
SILENCE
value set you can extract the smoothed F0 by the command
bin/make_f0_pm wav/*.wav
You can view the F0 contrours with the command
emulabel etc/emu_f0
With the labels and F0 created we can now rebuild the utterance structures by syntehsizing the prompt snad merging in the values from the natural durations and F0 from the naturally spoken utterances.
festival -b festvox/build_ldom.scm '(build_utts "etc/timit.data")'
The script bin/make_dur_model
contains all of
the following commands but it is wise to understand the stages as
due to errors in labeling it may not all run completely smoothly
and small fixes may be required.
We are building a duration model using a CART tree to predict zscore values for phones. Zscores (number of standard deviations from the mean) have often been used in duration modeling as they allow a certain amount of normalization over different phones.
You shold first look at the script bin/make_dur_model
and
edit the following three variable values
these should contain the name for silence in your phoneset, the call for the voice you are building the model for (or at least one that uses the same phoneset), and finally the name for the model, which can be the sameSILENCENAME=SIL
VOICENAME='(kal_diphone)'
MODELNAME=cmu_us_kal
INST_LANG_VOX
part of the voice you call. The first stage is to find the means and standard deviations for each phone. A festival script in the festival distribution is used to load in all the utetrances and a calculate these values. With the command
You should check festival/dur/etc/durs.meanstd, the generated file to ensure that the numbers look raosnable. If there is only one example of a particular phone, the standard deviation cannot be calculated and the value is given asdurmeanstd -output festival/dur/etc/durs.meanstd festival/utts/*.utt
nan
(not-a-number). Thus
must be changed to a standard numeric value (say one-third or the mean).
Also some of the values in this table maybe adversely affected by bad
labeling so you may wish to hand modify the values, or go back and
correct the labeling. The next stage is extract the features from which we will predict the
durations. The list of features extracted as in
festival/dur/etc/dur/feats. These cover phonetic context,
syllable, word position etc. These may or may not be appropriate for
your new language or domain and you you may wish to add to these before
doing the extraction. The extraction process takes each phoneme and
dumps the named feature values for that phone into a file. This uses
the standard festival script dumpfeats
to do this. The command
looks like
These feature files are then concatenated into a single file which is then split (90/10) into traing and test sets. The training set is further split force use as a held-out testset used in the training phase. Also at this stage we remove all silence phones form the training and test set. This is, perhaps naively, because the distribution of silences is very wide and often files contain silences at the start and end of utterances which themslves aren't part of the speech content (they're just the edges) and having these in the training set can skew the results.$DUMPFEATS -relation Segment -eval $VOICENAME \
-feats festival/dur/etc/dur.feats =
-output festival/dur/feats/%s.feats \
-eval festival/dur/etc/logdurn.scm \
festival/utts/*.utt
cat festival/dur/feats/*.feats | \
awk '{if ($2 != "'$SILENCENAME'") print $0}' >festival/dur/data/dur.data
bin/traintest festival/dur/data/dur.data
bin/traintest festival/dur/data/dur.data.train
For wagon
the CART tree builder to work it needs to know what
possible values each feature can take. This can mostly be determined
automatically but some features may have values that could be either
numeric or classes, thus we use a post-processing function on the
automatically generated description file to get our desired result.
$ESTDIR/bin/make_wagon_desc festival/dur/data/dur.data \
festival/dur/etc/dur.feats festival/dur/etc/dur.desc
festival -b --heap 2000000 festvox/build_prosody.scm \
$VOICENAME '(build_dur_feats_desc)'
Now we can build the model itself. A key factor in the time this takes
(and the accuracy of the model) is the "stop" value, that is the number
of examples that must exist before a split searched for. The smaller
this number the longer the search will be, though up to a certasin point
the more accurate the model will be. But at some level this will over
train. The default in the distribution is 50 which may or may not be
appropriate. Not for large databases and for smaller values of
STOP
the training may take days even on a fast processor.
Although we have guessed a reasonable value for this for databases of around 50-1000 utterances it may not be appropriate for you.
The learning technique used is basic CART tree growing but with an
important extention which makes the process much more robust on unseen
data but unfortunately much more computationally expensive. The
-stepwise
option on wagon incrementally searches for the best
features to use in building three, in addition to at each iteration
finding the best questions about each feature that best model data.
If you want a quicker result removing the -stepwise
option
will give you that.
To tets the results on data not used inthe training we use the commandwagon -data festival/dur/data/dur.data.train.train \
-desc festival/dur/etc/dur.desc \
-test festival/dur/data/dur.data.train.test \
-stop $STOP \
-output festival/dur/tree/$PREF.S$STOP.tree \
-stepwise
Interpreting the results isn't easy in isolation. The smaller the RMSE (root mean squared error) the better and the larger the correlation is the better (it should never be greater than 1, and should never be below 0, though if you model is very bad it can be below 0). For English, with this script on a Timit database we get an RMSE value of 0.81 and correlation of 0.58, on the test data. Note these values are not in the abosolute domain (i.e. seconds) they are in the zscore domain.wagon_test -heap 2000000 -data festival/dur/data/dur.data.test \
-desc festival/dur/etc/dur.desc \
-tree festival/dur/tree/$PREF.S$STOP.tree
The final stage, probably after a number of iterations of the build process we must package model into a scheme file that can be used with a voice. This scheme file contains the means and standard deviations (so we can convert the predicted values back into seconds) and the prediction tree itself. We also add in predictions for the silence phone by hand. The comamnd to generate this is
festival -b --heap 2000000 \
festvox/build_prosody.scm $VOICENAME \
'(finalize_dur_model "'$MODELNAME'" "'$PREF.S$STOP.tree'")'
This will generate a file festvox/cmu_timit_awb_dur.scm. If you
model name is the same as the basic diphone voice you intend to use it
in you cna simply copy this file to the festvox/ directory of
your diphone voice and it will automatically work. But it is worth
explaining what this install process really is. The duration model
scheme file contains two lisp expression setting the the variables
MODELNAME::phone_durs
and MODELNAME::zdurtree
. To
uses these in a voice you must load this file, typically by
adding
to the diphone voice definition file (festvox/MODELNAME_diphone.scm). And then get the voice defintion to use these new variables. This is done by the commands in the voice definition function(require 'MODELNAME_dur)
;; Duration prediction
(set! duration_cart_tree MODELNAME::zdurtree)
(set! duration_ph_info MODELNAME::phone_durs)
(Parameter.set 'Duration_Method 'Tree_ZScores)
(what about accents ?)
extract features for prediction build feature description files Build regression model to preict F0 at start, mid and end of syllable construct scheme file with F0 model
<<< Previous | Home | Next >>> |
Prosody Research | Up | Corpus development |