Building a Unit Selection Cluster Voice

The previous section gives the low level details ofin the building of a cluster unit selection voice. This section gives a higher level view with explict command that you should run. The steps involved in building a unit selection voices are basically the same as that for building a limited domain voice (the Chapter called Limited domain synthesis). Though in for general voices, in constrast to ldom voice, it is much more important to get all parts correct, from pitchmarks to labeling.

The following tasks are required:

The following are the commands that you must type (assuming all the other hardwork has been done beforehand. It is assume that the environment variables FESTVOXDIR and ESTDIR have been set to point to their respective directories. For example as

export FESTVOXDIR=/home/awb/projects/festvox
export ESTDIR=/home/awb/projects/speech_tools

Next you must select a name for the voice, by convention we use three part names consisting of a institution name, a language, and a speaker. Make a directory of that name and change directory into it

mkdir cmu_us_awb
cd cmu_us_awb

There is a basic set up script that will construct the directory structure and copy in the template files for voice building. If a fourth argument is given, it can be name one of the standard prompts list.

For example the simplest is uniphone. This contains three sentences which contain each of the US English phonemes once (if spoken appropriately). This prompt set is hopelessly minimal for any high quality synthesis but allows us to illustrate the process and allow you to build a voice quickly.

$FESTVOXDIR/src/unitsel/setup_clunits cmu us awb uniphone

Alternatively you can copy in a prompt list into the etc directory. The format of these should be in the standard "data" format as in

( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )

Note the spaces after the initial left parenthesis are significant, and double quotes and backslashes within the quote part must be escaped (with backslash) as is common in Perl or Festival itself.

The next stage is to generate waveforms to act as prompts, or timing cues even if the prompts are not actually played. The files are also used in aligning the spoken data.

festival -b festvox/build_clunits.scm '(build_prompts_waves "etc/")'

Use whatever prompt file you are intending to use. Note that you may want to add lexical entries to festvox/WHATEVER_lexicon.scm and other text analysis things as desired. The purpose is that the prompt files match the phonemes that the voice talent will actually say.

You may now record, assuming you have prepared the recording studio, gotten written permission to record your speaker (and explained to them what the resulting voice might be used for), checked recording levels and sound levels and shield the electrical equipment as much as possible.

./bin/prompt_them etc/

After recording the recorded files should be in wav/. It is wise to check that the are actually there and sound like you expected. Getting the recording quality as high as possible is fundamental to the success of building a voice.

Now we must label the spoken prompts. We do this my matching the synthesized prompts with the spoken ones. As we know where the phonemes begin and end in the synthesized prompts we can then map that onto the spoken ones and find the phoneme segments. This technique works fairly well, but it is far from perfect and it is worthwhile to at least check the result, and most probably fix the result by hand.

./bin/make_labs prompt-wav/*.wav

Especially in the case of the uniphone synthesizer, where there is one and only one occurrence of each phone they all must be correct so its important to check the labels by hand. Note for large collections you may find the full Sphinx based labeling technique better the Section called Labeling with Full Acoustic Models in the Chapter called Labeling Speech).

After labeling we can build the utterance structure using the prompt list and the now labeled phones and durations.

festival -b festvox/build_clunits.scm '(build_utts "etc/")'

The next stages are concerned with signal analysis, specifically pitch marking and cepstral parameter extraction. There are a number of methods for pitch mark extraction and a number of parameters within these files that may need tuning. Good pitch periods are important. See the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements . In its simplest case the follow may work

./bin/make_pm_wave wav/*.wav

The next stage it find the Mel Frequency Cepstral Coefficents. This is done pitch synchronously and hence depends on the pitch periods extracted above. These are used for clustering and for join measurements.

./bin/make_mcep wav/*.wav

Now we can do the main part of the build, building the cluster unit selection synthesizer. This consists of a number os stages all based on the controlling Festival script. The parameters of which are described above.

festival -b festvox/build_clunits.scm '(build_clunits "etc/")'

For large databases this can take some time to run as there is a squared aspect to this based on the number of instances of each unit type.