The basic synthesis process in Festival is viewed as applying a set of modules to an utterance. Each module will access various relations and items and potentially generate new features, items and relations. Thus as the modules are applied the utterance structure is filled in with more and more relations until ultimately the waveform is generated.

Modules may be written in C++ or Scheme. Which modules are executed are defined in terms of the utterance type, a simple feature on the utterance itself. For most text-to-speech cases this is defined to be of type Tokens. The function utt.synth simply looks up an utterance's type and then looks up the definition of the defined synthesis process for that type and applies the named modules. Synthesis types maybe defined using the function defUttType. For example definition for utterances of type Tokens is

(defUttType Tokens
  (Token_POS utt) 
  (Token utt)        
  (POS utt)
  (Phrasify utt)
  (Word utt)
  (Pauses utt)
  (Intonation utt)
  (PostLex utt)
  (Duration utt)
  (Int_Targets utt)
  (Wave_Synth utt)

While a simpler case is when the input is phone names and we don't wish to do all that text analysis and prosody prediction. Then we use the type Phones which simply loads the phones, applies fixed prosody and the synthesizes the waveform

(defUttType Phones
  (Initialize utt)
  (Fixed_Prosody utt)
  (Wave_Synth utt)

In general the modules named in the type definitions are general and actually allow further selection of more specific modules within them. For example the Duration module respects the global parameter Duration_Method and will call then desired duration module depending on this value.

When building a new voice you will probably not need to change any of these definitions, though you may wish to add a new module and we will show how to do that without requiring any change to the synthesis definitions in a later chapter.

There are many modules in the system, some simply wraparounds to choose between other modules. However the basic modules used for text-to-speech have the basic following function


basic token identification, used for homograph disambiguation


Apply the token to word rules building the Word relation.


A standard part of speech tagger (if desired)


Build the Phrase relation using the specified method. Various are offered, from statistically trained models to simple CART trees.


Lexical look up building the Syllable and Segment relations and the SylStructure related these together.


Prediction of pauses, inserting silence into the Segment relation, again through a choice of different prediction mechanisms.


Prediction of accents and boundaries, building the IntEvent relation and the Intonation relation that links IntEvents to syllables. This can easily be parameterized for most practical intonation theories.


Post lexicon rules that can modify segments based on their context. This is used for things like vowel reduction, contractions, etc.


Prediction of durations of segments.


The second part of intonation. This creates the Target relation representing the desired F0 contour.


A rather general function that in turn calls the appropriate method to actually generate the waveform.