|Building Synthetic Voices|
|<<< Previous||Next >>>|
This might be one of the easier ways to build a synthesizer in a language for which you do not have many resources. In many cases the techniques described here will do well enough to provide an understandable useable synthesizer. You will need audio, and orthography for your language though not anything else. The techniques described in this chapter will provide generic phonetic support, but not require the construction of an explicit lexicon (though you could add explicit lexical entries for some words if you desire).
The techniques described here are unlikely to be better than techniques that require more language support (such as phoneme sets, lexicons and higher level knowledge of the language) so we will also assume that the data you have is not the best, and will address some issues in improving the quality of your data.
As always, high quality recording of large phonetically balanced corpora by a good consistent speaker will always be best. In our experience anything less that 30 minutes of actual speech (ignoring silence) will likely not give a good result. Where possible the data must be recorded using the same channel. Mixed-channel recordings typically give much poorer results, e.g. multiple sessions, multiple room acoustics are likely to degrade quality.
We break this chapter down into core (clustergen) building, and then discuss some further techniques that might be relevant to your particular language and how you might improve them.
As with all parts of festvox: you must set the following enviroment variables to where you have installed versions of the Edinburgh Speech Tools, the FestVox distribution and NITECH's SPTK
We will use Thai as the example language.
This will set up a base voice (that is incomplete, as the phoneme definition is missing). We will generate the appropriate additional information from the language data you have collected.
$FESTVOXDIR/src/clustergen/setup_cg cmu thai am
We assume that you have already prepared a list of utterances and recorded them. See the Chapter called Corpus development for instructions on designing a corpus. We also assume that you have a prompt file is the txt.done.data format (with Thai encoded as unicode).
Assuming the recordings might not be as good as the could be you can power normalize them.
cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/
Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by
This uses F0 extraction to estimate where the speech starts and ends. This typically works well, but you should listen to the results to ensure it does the right thing. The original unpruned files are saved in unpruned/. A third pre-pocessing option is to shorted intrasentence silences, this is only desirable in non-ideal recorded data, but it can help labeling.
Note if you do not require these three stages, you can put your wavefiles directly into wav/
Now we can complete the voice templates, using the information in the etc/txt.done.data prompt list.
This analyzes all the (unicode) characters in the prompt list and builds a mapping from the UNITRAN list of characters to phonemes. This updates the templates in festvox/ to prove a complete voice for that language. Note that this only provides a default pronunciation for characters in the language, this certainly isn't suitable for numbers symbols etc, nor will it deal well with languages with more opaque writting systems (e.g. (English and Chinese). But for many languages it gives a very good starting position.
Note it only deals with characters in your prompt list. So you should ensure you have character coverage in your prompt list.
Now you can build a voice as before. Firsty build the prompts and label the data.
Then do feature extraction
./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
Build the models
./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v
And generate some test examples, the first to give MCD and F0D objective measures, the second to generate standard tts output
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train
./bin/do_clustergen cg_test resynth cgp etc/txt.done.data.test
./bin/do_clustergen cg_test tts tts etc/txt.done.data.test
Note that for grapheme voices you can build with the random forest techniques, and it should work, though you should only try that after confirming a base build works and that there are no prompts that cannot be processed properly.
|<<< Previous||Home||Next >>>|
|Recipes||Up||Building Indic voices|