Making it better:Mixed excitation and Random Forests

Given the base form of a clustergen voice, you can make better, both in using better signal parameterization, and/or better machine learning techniques. There is actually a large number of options available here, many of which are experimental, and some are dependent on the particualr voice (and the quality of the recordings) and some are just experimental, and don't actually making it better.

Adding parallel as the first argument to do_clustergen will make the script us all processors on the current machines. This will typically make builds much faster.

On important technique is mixed-excitation, this provides a better model for the excitation of the spectral signal. This can be used by first generating the the mixed-excitation strengths. You must have NITECH's SPTK3.6 (or later) installed to do this.

    export SPTKDIR=/usr/local/SPTK
    ./bin/do_clustergen parallel str_sptk

Then you need to combine these extra (5 coefficients per frame) to the standard combined coefficients.

    ./bin/do_clustergen parallel combine_coeffs_me

Then you need to set the lisp variable in festvox/clustergen.scm to use mixed excitation

    (set! cg:mixed_excitation t)

Then you can cluster the new set of parameters

    ./bin/do_clustergen parallel cluster etc/

You can generate an MCD for the text set with

    ./bin/do_clustergen cg_test resynth cgp etc/

We also support using random forests to get a better use of the limited data in a voice. We have scripts to build random forests, by randomly varying which features to use, for spectrum and duration prediction. We also include scripts to subselect the from the set of models generated to find an almost optimal set. This is best shown in the script build_cg_rfs_voice this somewhat ambitious script does a full build (with mixed excitation, move label and random forests), as well as build flite versions of the voice on the way. We have used this script for the released version of voices in festival 2.4 (and flite 2.0). We have used this for arctic type voices, large 20 hour voices and a large number of other language voices, both with crafted language components and grapheme based versions.