Building Indic voices

Languages of the Indian subcontinent have millions of speakers, but do not have a lot of resources and data. Some of the linguistic phenomena that occur in Indic languages include schwa deletion in Indo-Aryan languages, voicing rules in Tamil, stress patterns etc. We have a common Indic front end for building voices in these languages, with support for many of these phenomena.

Currently, we have explicit support for Hindi, Bengali, Kannada, Tamil, Telugu and Gujarati. Rajasthani and Assamese are also included and they use the same rules as Hindi and Bengali respectively although this may not be completely accurate.

This chapter describes how to build an Indic voice for the languages that are supported explicitly in Festvox and also how to add support for a new Indic language. We will use Hindi as the example language for this tutorial.

Part 1: Building voices that have explicit support (Hindi, Bengali, Kannada, Tamil, Telugu, Gujarati, Rajasthani, Assamese)

As with all parts of festvox: you must set the following enviroment variables to where you have installed versions of the Edinburgh Speech Tools, the FestVox distribution and NITECH's SPTK

export ESTDIR=/home/awb/projects/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox
export SPTKDIR=/home/awb/projects/SPTK

mkdir cmu_indic_ss
cd cmu_indic_ss
$FESTVOXDIR/src/clustergen/setup_cg cmu indic ss

This will set up a base Indic voice. The file that is Indic-voice specific is festvox/indic_lexicon.scm in the voice directory. The default language in the Indic lexicon is Hindi. You should change this to the language that you are working with so that the right language-specific rules are used.

(defvar lex:language 'Hindi)

We assume that you have already prepared a list of utterances and recorded them. See the Chapter called Corpus development for instructions on designing a corpus. We also assume that you have a prompt file is the format (with your Indic prompts encoded as unicode).

cp -p WHATEVER/ etc/
cp -p WHATEVER/wav/*.wav recording/

Assuming the recordings might not be as good as the could be you can power normalize them.

./bin/get_wavs recording/*.wav

Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by

./bin/prune_silence wav/*.wav

This uses F0 extraction to estimate where the speech starts and ends. This typically works well, but you should listen to the results to ensure it does the right thing. The original unpruned files are saved in unpruned/. A third pre-pocessing option is to shorted intrasentence silences, this is only desirable in non-ideal recorded data, but it can help labeling.

./bin/prune_middle_silence wav/*.wav

Note if you do not require these three stages, you can put your wavefiles directly into wav/

Now you can build a voice as before. Firsty build the prompts and label the data.

./bin/do_build build_prompts etc/
./bin/do_build label etc/
./bin/do_clustergen parallel build_utts etc/
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters

Then do feature extraction

./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v

Build the models

./bin/traintest etc/
./bin/do_clustergen parallel cluster etc/
./bin/do_clustergen dur etc/

And generate some test examples, the first to give MCD and F0D objective measures, the second to generate standard tts output

./bin/do_clustergen cg_test resynth cgp etc/
./bin/do_clustergen cg_test tts tts etc/