Statistical Parametric Synthesis

Building a CLUSTERGEN Statistical Parametric Synthesizer

This method, inspired the work of Keiichi Tokuda and NITECH's HMM Speech Synthesis Toolkit, is a method for building statistical parametric synthesizers from databases of natural speech. Although the result is still not as crisp as a well done unit selection voice, this method is much easier to get a nice clear synthetic voice that models the original speaker well.

Although this method is partially "tagged on to" the clunits method, it is actually quite independent. The tasks are as follows.

We assume you have read the rest of this chapter (though, in reality, we know you probably haven't), thus the descriptions here are quite minimal.

First make an empty directory and in it run the setup_cg setup command.

    mkdir cmu_us_awb_arctic
    cd cmu_us_awb_arctic
    $FESTVOXDIR/src/clustergen/setup_cg cmu us awb_arctic

In you already have an existing voice running setup_cg will only copy in the necessary files for clustergen, however I recommend starting from scratch as I don't know when you created your previous voice and I'm not sure of its exact state.

Now you need to get your waveform files and prompt file. Put your waveform files in the wav/ and your prompt file in etc/txt.done.data. Note you should probably use bin/get_wavs to copy the wavefiles so that they get power normalized and get changed to a reasonable format (16KHz, 16bit, RIFF format).

In you are going to record them in your current directory, you should call

    ./bin/do_build build_prompts_waves

first to generate example waveforms, then use

    ./bin/prompt_them etc/txt.done.data 1

To prompt you and record the prompts. You must check that the recording actually works. It should generate recordings in the wav/. You can use $ESTDIR/bin/na_play to play the waveform files. prompt_them can be stopped with ctrl-c and restarted at the line number given as the second argument.

If you have collected the waveform files by some other process you do not need to generate the prompt waveform files thus you just use

    ./bin/do_build build_prompts

which will generate the prompt utterances (which are used to find the expected phones), but more the prompt waveforms.

The next stage is to label the data. If you aren't very knowledgeable about labeling in clustergen, you should use the EHMM labeler. EHMM constructs the labels in the right format for segments and HMM states. and matches them properly with what the synthesizer generates for the prompts. Using other labels is likely to cause more problems. Even if you already have other labels use EHMM first.

    ./bin/do_build build_prompts
    ./bin/do_build label
    ./bin/do_build build_utts

The EHMM labeler has been shown to be very reliable, and can nicely deal with silence insertion. It isn't very fast though and will take several hours. You can check the file ehmm/mod/log100.txt to see the Baum-Welch iterations, there will probably be 20-30. The ARCTIC a-set takes about 3-4 hours to label.

Parametric synthesis require a reversible parameterization, this set up here uses a form of mel cepstrum, the same version that is used by NITECH's basic HTS build. Parameter build is in two parts building the F0 and building the mceps themselves. Then these are combined into a single parameter file for each utterance in the database.

    ./bin/do_clustergen f0
    ./bin/do_clustergen mcep
    ./bin/do_clustergen voicing
    ./bin/do_clustergen combine_coeffs_v

The mcep part takes the longest. Note that the F0 part now tries to estimate the range of the F0 on the speaker and modifies parameters for the F0 extraction program. (The F0 params are saved in etc/f0.params.)

If you want to have a test set of utterances, you can separate out some of your prompt list. The test set should be put in the file etc/txt.done.data.test The follow commands will make a training and test set (every 10th prompt in the test set, the other 9 in the training set).

    ./bin/traintest etc/txt.done.data
    cat etc/txt.done.data.train >etc/txt.done.data

The next stage is to generate is to build the parametric model. There parts are required for this. This first is very quick and simply puts the state (and phone) names into their respective files. It assumes a file etc/statenames which is generate by EHMM. The second stage build the parametric models itself. The last builds a duration model for the state names

   ./bin/do_clustergen generate_statenames
   ./bin/do_clustergen generate_filters
   ./bin/do_clustergen cluster
   ./bin/do_clustergen dur

The resulting voice should now work

   festival festvox/cmu_us_awb_arctic_cg.scm
   ...
   festival> (voice_cmu_us_awb_arctic_cg)
   ...
   festival> (SayText "This is a little example.")

The voice can be packaged for distribution by the command

   ./bin/do_clustergen festvox_dist

This will generation festvox_cmu_us_awb_arctic_cg.tar.gz which will be quite small compared to a clunit voice made with the same databases. Because only the parameters are kept (in fact only means and standard deviations of clusters of of parameters) which do not include residual or excitation information the result is something orders of magnitude smaller that a full unit selection voices.

There two other options in the clustergen voice build. These involve modeling trajectories rather than individual vectors. They give objectively better results (though marginal subjectively better results for the voices we have tested). Instead of the line

   ./bin/do_clustergen cluster

You can run

   ./bin/do_clustergen trajectory

or the slightly better

   ./bin/do_clustergen trajectory_ola

These two options may run after the simple version of the voice.

You can test your voice with held out data, if you did this in the above step that created etc/txt.done.data.test You can run

   $FESTVOXDIR/src/clustergen/cg_test resynth cgp

NOTE: This no longer works automatically, as you need static mceps and ccoefs for this to work. This will create parameter files (and waveform files) in test/cgp. The output of the cg_test is also four measures the mean difference for all features in the parameter vector, for F0 alone, for all but F0, and MCD (mel ceprstral distortion).