Building Synthetic Voices | ||
---|---|---|
<<< Previous | Grapheme-based Synthesizer | Next >>> |
Languages of the Indian subcontinent have millions of speakers, but do not have a lot of resources and data. Some of the linguistic phenomena that occur in Indic languages include schwa deletion in Indo-Aryan languages, voicing rules in Tamil, stress patterns etc. We have a common Indic front end for building voices in these languages, with support for many of these phenomena.
Currently, we have explicit support for Hindi, Bengali, Kannada, Tamil, Telugu and Gujarati. Rajasthani and Assamese are also included and they use the same rules as Hindi and Bengali respectively although this may not be completely accurate.
This chapter describes how to build an Indic voice for the languages that are supported explicitly in Festvox and also how to add support for a new Indic language. We will use Hindi as the example language for this tutorial.
Part 1: Building voices that have explicit support (Hindi, Bengali, Kannada, Tamil, Telugu, Gujarati, Rajasthani, Assamese)
As with all parts of festvox: you must set the following enviroment variables to where you have installed versions of the Edinburgh Speech Tools, the FestVox distribution and NITECH's SPTK
export ESTDIR=/home/awb/projects/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox
export SPTKDIR=/home/awb/projects/SPTK
This will set up a base Indic voice. The file that is Indic-voice specific is festvox/indic_lexicon.scm in the voice directory. The default language in the Indic lexicon is Hindi. You should change this to the language that you are working with so that the right language-specific rules are used.mkdir cmu_indic_ss
cd cmu_indic_ss
$FESTVOXDIR/src/clustergen/setup_cg cmu indic ss
(defvar lex:language 'Hindi)
We assume that you have already prepared a list of utterances and recorded them. See the Chapter called Corpus development for instructions on designing a corpus. We also assume that you have a prompt file is the txt.done.data format (with your Indic prompts encoded as unicode).
Assuming the recordings might not be as good as the could be you can power normalize them.cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/
Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by./bin/get_wavs recording/*.wav
This uses F0 extraction to estimate where the speech starts and ends. This typically works well, but you should listen to the results to ensure it does the right thing. The original unpruned files are saved in unpruned/. A third pre-pocessing option is to shorted intrasentence silences, this is only desirable in non-ideal recorded data, but it can help labeling../bin/prune_silence wav/*.wav
Note if you do not require these three stages, you can put your wavefiles directly into wav/./bin/prune_middle_silence wav/*.wav
Now you can build a voice as before. Firsty build the prompts and label the data.
Then do feature extraction./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters
Build the models./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v
And generate some test examples, the first to give MCD and F0D objective measures, the second to generate standard tts output./bin/traintest etc/txt.done.data
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train
./bin/do_clustergen cg_test resynth cgp etc/txt.done.data.test
./bin/do_clustergen cg_test tts tts etc/txt.done.data.test
<<< Previous | Home | Next >>> |
Grapheme-based Synthesizer | Up | Creating support for new Indic languages |