Statistical Parametric Synthesis

Building a CLUSTERGEN Statistical Parametric Synthesizer

This method, inspired the work of Keiichi Tokuda and NITECH's HMM Speech Synthesis Toolkit, is a method for building statistical parametric synthesizers from databases of natural speech. Although the result is still not as crisp as a well done unit selection voice, this method is much easier to get a nice clear synthetic voice that models the original speaker well.

Although this method is partially "tagged on" to the clunits method, it is actually quite independent. The tasks are as follows.

We assume you have read the rest of this chapter (though, in reality, we know you probably haven't), thus the descriptions here are quite minimal.

First make an empty directory and in it run the setup_cg setup command.

    mkdir cmu_us_awb_arctic
    cd cmu_us_awb_arctic
    $FESTVOXDIR/src/clustergen/setup_cg cmu us awb_arctic

In you already have an existing voice running setup_cg while only copy in the necessary files for clustergen, however I'd recommend starting from scratch as I don't know when you created your previous voice and I'm not sure of its exact state.

Now you need to get your waveform files and prompt file. Put your waveform files in the wav/ and your prompt file in etc/txt.done.data. Note you should probably use bin/get_wavs to copy the wavefiles so that they get power normalized and get changed to a reasonable format (16KHz, 16bit, RIFF format).

The next stage if to label the data. If you aren't very knowledgeable about labeling in clustergen, you should use the EHMM labeler. EHMM constructs the labels in the right format for segments and HMM states. and matches them properly with what the synthesizer generates for the prompts. Using other labels is likely to cause more problems. Even if you already have other labels use EHMM first.

    ./bin/do_build build_prompts
    ./bin/do_build label
    ./bin/do_build build_utts

The EHMM labeler has been shown to be very reliable, and can nicely deal with silence insertion. It isn't very fast though and will take several hours. You can check the file ehmm/mod/log100.txt to see the Baum-Welch iterations, there will probably be 20-30. The ARCTIC a-set takes about 3-4 hours to label.

Parametric synthesis require a reversible parameterization, this set up here uses a form of mel cepstrum, the same version that is used by NITECH's basic HTS build. Parameter build is in two parts building the F0 and building the mceps themselves. Then these are combined into a single parameter file for each utterance in the database.

    ./bin/do_clustergen f0
    ./bin/do_clustergen mcep
    ./bin/do_clustergen combine_coeffs

The mcep part takes the longest. Note that the F0 part now tries to estimate the range of the F0 on the speaker and modifies parameters for the F0 extraction program. (The F0 params are saved in etc/f0.params.)

If you want to have a test set of utterances, you can separate out some of your prompt list. The test set should be put in the file etc/txt.done.data.test The follow commands will make a training and test set (every 10th prompt in the test set, the other 9 in the training set).

    ./bin/traintest etc/txt.done.data
    cat etc/txt.done.data.train >etc/txt.done.data

The next stage is to generate is to build the parametric model. There parts are required for this. This first is very quick and simply puts the state (and phone) names into their respective files. It assumes a file etc/statenames which is generate by EHMM. The second stage build the parametric models itself. The last builds a duration model for the state names

   ./bin/do_clustergen generate_statenames
   ./bin/do_clustergen cluster
   ./bin/do_clustergen dur

The resulting voice should now work

   festival festvox/cmu_us_awb_arctic_cg.scm
   ...
   festival> (voice_cmu_us_awb_arctic_cg)
   ...
   festival> (SayText "This is a little example.")

The voice can be packaged for distribution by the command

   ./bin/do_clusterget festvox_dist

This will generation festvox_cmu_us_awb_arctic_cg.tar.gz which will be quite small compared to a clunit voice made with the same databases. Because only the parameters are kept (in fact only means and standard deviations of clusters of of parameters) which do not include residual or excitation information the result is something orders of magnitude smaller that a full unit selection voices.

There two other options in the clustergen voice build. These involve modeling trajectories rather than individual vectors. They give objectively better results (though marginal subjectively better results for the voices we have tested). Instead of the line

   ./bin/do_clustergen cluster

You can run

   ./bin/do_clustergen trajectory

or the slightly better

   ./bin/do_clustergen trajectory_ola

These two options may run after the simple version of the voice.

You can test your voice with held out data, if you did this in the above step that created etc/txt.done.data.test You can run

   $FESTVOXDIR/src/clustergen/cg_test resynth cgp

This will create parameter files (and waveform files) in test/cgp. The output of the cg_test is also four measures the mean difference for all features in the parameter vector, for F0 alone, for all but F0, and MCD (mel ceprstral distortion).