Next: Size and scaling Up: Flite: a small fast Previous: Languages, Lexicons and Voices

Building voices

As we want good compaction of data, we do want to define what are basically compilers of lexicons, unit databases, CART tree models etc into some efficient byte representation that can be linked in to the Flite binary. Rather than writing code that generates .o format binaries we have written conversion functions that will generate C code that can then be compiled into the appropriate binary representation.

In most cases this C code is only C data structures. As most of these structures will be constant we want these to be explicitly declared as such as that they will be read-only and can be put in ROM. Building a new synthesizer that uses the same basic voice definitions as an existing synthesizer requires us to be very specific of what really is in a voice definition.

Although we intend to follow the NSW model for text normalization [6], something Festival does not yet quite do, the basic ``expanders'' had to be explicitly recoded (e.g. number to word routines). This level of recoding for new languages is probably always going to be required and will never be automatically compiled from the Festival code.

Many of Festival models use simple CART trees; thus we include a simple routine that can take a Scheme CART tree as used by Festival and convert it to a C representation. This allows various models to be translated into C directly. CART trees consist of nodes, and leaves; the nodes consist of a question containing a feature pathname, an operator, and a value, plus an outcome-yes-node and an outcome-no-node. These can easily be encoded in an efficient C structure which can treated as a constant (const) object. Although you may chose between different CART trees at run-time, they will never be modified at run-time.

We have not yet made the conversion of a FestVox voice fully automatic and its not clear we ever will or should. Each voice definition in FestVox although follows a basic pattern may be customized in very idiosyncratic ways including specific tokenization rules and prosody rules. However we can provide the basic tools.

For basic diphone voices for known languages and simple generic limited domain voices built using the FestVox build model, we believe a generic conversion process is possible and will be provided, but there will always a fair amount of skill involved in conversion as there is in voice building itself.

Also as we expect that building a voice for Flite is not just a one-to-one mapping but a time when customization for size and speed will occur, human decisions will be necessary.

Next: Size and scaling Up: Flite: a small fast Previous: Languages, Lexicons and Voices

Alan W Black 2001-08-26