Grapheme-based Synthesizer

General Grapheme-based Voices

This might be one of the easier ways to build a synthesizer in a language for which you do not have many resources. In many cases the techniques described here will do well enough to provide an understandable useable synthesizer. You will need audio, and orthography for your language though not anything else. The techniques described in this chapter will provide generic phonetic support, but not require the construction of an explicit lexicon (though you could add explicit lexical entries for some words if you desire).

The techniques described here are unlikely to be better than techniques that require more language support (such as phoneme sets, lexicons and higher level knowledge of the language) so we will also assume that the data you have is not the best, and will address some issues in improving the quality of your data.

As always, high quality recording of large phonetically balanced corpora by a good consistent speaker will always be best. In our experience anything less that 30 minutes of actual speech (ignoring silence) will likely not give a good result. Where possible the data must be recorded using the same channel. Mixed-channel recordings typically give much poorer results, e.g. multiple sessions, multiple room acoustics are likely to degrade quality.

We break this chapter down into core (clustergen) building, and then discuss some further techniques that might be relevant to your particular language and how you might improve them.

As with all parts of festvox: you must set the following enviroment variables to where you have installed versions of the Edinburgh Speech Tools, the FestVox distribution and NITECH's SPTK

export ESTDIR=/home/awb/projects/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox
export SPTKDIR=/home/awb/projects/SPTK

We will use Thai as the example language.

mkdir cmu_thai_am
cd cmu_thai_am
$FESTVOXDIR/src/clustergen/setup_cg cmu thai am

This will set up a base voice (that is incomplete, as the phoneme definition is missing). We will generate the appropriate additional information from the language data you have collected.

We assume that you have already prepared a list of utterances and recorded them. See the Chapter called Corpus development for instructions on designing a corpus. We also assume that you have a prompt file is the format (with Thai encoded as unicode).

cp -p WHATEVER/ etc/
cp -p WHATEVER/wav/*.wav recording/

Assuming the recordings might not be as good as the could be you can power normalize them.

./bin/get_wavs recording/*.wav

Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by

./bin/prune_silence wav/*.wav

This uses F0 extraction to estimate where the speech starts and ends. This typically works well, but you should listen to the results to ensure it does the right thing. The original unpruned files are saved in unpruned/. A third pre-pocessing option is to shorted intrasentence silences, this is only desirable in non-ideal recorded data, but it can help labeling.

./bin/prune_middle_silence wav/*.wav

Note if you do not require these three stages, you can put your wavefiles directly into wav/

Now we can complete the voice templates, using the information in the etc/ prompt list.


This analyzes all the (unicode) characters in the prompt list and builds a mapping from the UNITRAN list of characters to phonemes. This updates the templates in festvox/ to prove a complete voice for that language. Note that this only provides a default pronunciation for characters in the language, this certainly isn't suitable for numbers symbols etc, nor will it deal well with languages with more opaque writting systems (e.g. (English and Chinese). But for many languages it gives a very good starting position.

Note it only deals with characters in your prompt list. So you should ensure you have character coverage in your prompt list.

Now you can build a voice as before. Firsty build the prompts and label the data.

./bin/do_build build_prompts etc/
./bin/do_build label etc/
./bin/do_clustergen parallel build_utts etc/
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters

Then do feature extraction

./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v

Build the models

./bin/traintest etc/
./bin/do_clustergen parallel cluster etc/
./bin/do_clustergen dur etc/

And generate some test examples, the first to give MCD and F0D objective measures, the second to generate standard tts output

./bin/do_clustergen cg_test resynth cgp etc/
./bin/do_clustergen cg_test tts tts etc/

Note that for grapheme voices you can build with the random forest techniques, and it should work, though you should only try that after confirming a base build works and that there are no prompts that cannot be processed properly.