The basic synthesis process in Festival is viewed as applying a set of modules to an utterance. Each module will access various relations and items and potentially generate new features, items and relations. Thus as the modules are applied the utterance structure is filled in with more and more relations until ultimately the waveform is generated.
Modules may be written in C++ or Scheme. Which modules are executed are
defined in terms of the utterance
type, a simple feature on the
utterance itself. For most text-to-speech cases this is defined to be
Tokens. The function
utt.synth simply looks up an
utterance's type and then looks up the definition of the defined
synthesis process for that type and applies the named modules.
Synthesis types maybe defined using the function
For example definition for utterances of type
While a simpler case is when the input is phone names and we don't wish to do all that text analysis and prosody prediction. Then we use the type
Phones which simply
loads the phones, applies fixed prosody and the synthesizes
In general the modules named in the type definitions are general and actually allow further selection of more specific modules within them. For example the
Duration module respects the global
Duration_Method and will call then desired duration
module depending on this value.
When building a new voice you will probably not need to change any of these definitions, though you may wish to add a new module and we will show how to do that without requiring any change to the synthesis definitions in a later chapter.
There are many modules in the system, some simply wraparounds to choose between other modules. However the basic modules used for text-to-speech have the basic following function
basic token identification, used for homograph disambiguation
Apply the token to word rules building the
A standard part of speech tagger (if desired)
Phrase relation using the specified method. Various
are offered, from statistically trained models to simple CART trees.
Lexical look up building the
relations and the
SylStructure related these together.
Prediction of pauses, inserting silence into the
relation, again through a choice of different prediction mechanisms.
Prediction of accents and boundaries, building the
relation and the
Intonation relation that links IntEvents
to syllables. This can easily be parameterized for most practical
Post lexicon rules that can modify segments based on their context. This is used for things like vowel reduction, contractions, etc.
Prediction of durations of segments.
The second part of intonation. This creates the
relation representing the desired F0 contour.
A rather general function that in turn calls the appropriate method to actually generate the waveform.