Building Synthetic Voices | ||
---|---|---|
<<< Previous | A Practical Speech Synthesis System | Next >>> |
The basic synthesis process in Festival is viewed as applying a set of modules to an utterance. Each module will access various relations and items and potentially generate new features, items and relations. Thus as the modules are applied the utterance structure is filled in with more and more relations until ultimately the waveform is generated.
Modules may be written in C++ or Scheme. Which modules are executed are
defined in terms of the utterance type
, a simple feature on the
utterance itself. For most text-to-speech cases this is defined to be
of type Tokens
. The function utt.synth
simply looks up an
utterance's type and then looks up the definition of the defined
synthesis process for that type and applies the named modules.
Synthesis types maybe defined using the function defUttType
.
For example definition for utterances of type Tokens
is
While a simpler case is when the input is phone names and we don't wish to do all that text analysis and prosody prediction. Then we use the type(defUttType Tokens
(Token_POS utt)
(Token utt)
(POS utt)
(Phrasify utt)
(Word utt)
(Pauses utt)
(Intonation utt)
(PostLex utt)
(Duration utt)
(Int_Targets utt)
(Wave_Synth utt)
)
Phones
which simply
loads the phones, applies fixed prosody and the synthesizes
the waveform
In general the modules named in the type definitions are general and actually allow further selection of more specific modules within them. For example the(defUttType Phones
(Initialize utt)
(Fixed_Prosody utt)
(Wave_Synth utt)
)
Duration
module respects the global
parameter Duration_Method
and will call then desired duration
module depending on this value. When building a new voice you will probably not need to change any of these definitions, though you may wish to add a new module and we will show how to do that without requiring any change to the synthesis definitions in a later chapter.
There are many modules in the system, some simply wraparounds to choose between other modules. However the basic modules used for text-to-speech have the basic following function
Token_POS
basic token identification, used for homograph disambiguation
Token
Apply the token to word rules building the Word
relation.
POS
A standard part of speech tagger (if desired)
Phrasify
Build the Phrase
relation using the specified method. Various
are offered, from statistically trained models to simple CART trees.
Word
Lexical look up building the Syllable
and Segment
relations and the SylStructure
related these together.
Pauses
Prediction of pauses, inserting silence into the Segment
relation, again through a choice of different prediction mechanisms.
Intonation
Prediction of accents and boundaries, building the IntEvent
relation and the Intonation
relation that links IntEvents
to syllables. This can easily be parameterized for most practical
intonation theories.
PostLex
Post lexicon rules that can modify segments based on their context. This is used for things like vowel reduction, contractions, etc.
Duration
Prediction of durations of segments.
Int_Targets
The second part of intonation. This creates the Target
relation representing the desired F0 contour.
Wave_Synth
A rather general function that in turn calls the appropriate method to actually generate the waveform.
<<< Previous | Home | Next >>> |
Utterance structure | Up | Utterance access |