Building Synthetic Voices | ||
---|---|---|
<<< Previous | A Practical Speech Synthesis System | Next >>> |
The basic building block for Festival is the utterance. The
structure consists of a set of relations over a set of
items. Each item represents a object such as a word, segment,
syllable, etc. while relations relate these items together. An item may
appear in multiple relations, such as a segment will be in a
Segment
relation and also in the SylStructure
relation.
Relations define an ordered structure over the items within them, in
general these may be arbitrary graphs but in practice so far we have
only used lists and trees Items may contain a number of
features.
There are no built-in relations in Festival and the names and use of them is controlled by the particular modules used to do synthesis. Language, voice and module specific relations can easy be created and manipulated. However within our basic voices we have followed a number of conventions that should be followed if you wish to use some of the existing modules.
The relation names used will depend on the particular structure chosen for your voice. So far most of our released voices have the same basic structure though some of our research voices contain quite a different set of relations. For our basic English voices the relations used are as follows
Text
Contains a single item which contains a feature with the input character string that is being synthesized
Token
A list of trees where each root of each tree is the white space separated tokenized object from the input character string. Punctuation and whitespace has been stripped and placed on features on these token items. The daughters of each of these roots are the list of words that the token is associated with. In many cases this is a one to one relationship, but in general it is one to zero or more. For example tokens comprising of digits will typically be associated with a number of words.
Word
The words in the utterance. By word we typically mean something
that can be given a pronunciation from a lexicon (or letter-to-sound
rules). However in most of our voices we distinguish pronunciation by
the words and a part of speech feature. Words with also be leaves of the
Token
relation, leaves of the Phrase
relation and roots of
the SylStructure
relation.
Phrase
A simple list of trees representing the prosodic phrasing on the
utterance. In our voices we only have one level of prosodic phrase
below the utterance (though you can easily add a deeper hierarchy
if your models require it). The tree roots are labeled with
the phrase type and the leaves of these trees are in the
Word
relation.
Syllable
A simple list of syllable items. These syllable items are intermediate
nodes in the SylStructure
relation allowing access to the words
these syllables are in and the segments that are in these syllables.
In this format no further onset/coda distinction is made explicit but can
be derived from this information.
Segment
A simple list of segment (phone) items. These form the leaves of
the SylStructure
relation through which we can find where each
segment is placed within its syllable and word. By convention
silence phones do not appear in any syllable (or word) but will
exist in the segment relation.
SylStructure
A list of tree structures over the items in the Word
,
Syllable
and Segment
items.
IntEvent
A simple list of intonation events (accents and boundaries).
These are related to syllables through the Intonation
relation.
Intonation
A list of trees whose roots are items in the Syllable
relation,
and daughters are in the IntEvent
relation. It is assumed that a
syllable may have a number of intonation events associated with it (at
least accents and boundaries), but an intonation event may only by
associated with one syllable.
Wave
A relation consisting of a single item that has a feature with the synthesized waveform.
Target
A list of trees whose roots are segments and daughters are F0 target points. This is only used by some intonation modules.
Unit, SourceSegments, Frames, SourceCoef TargetCoef
A number of relations used the the UniSyn
module.
<<< Previous | Home | Next >>> |
A Practical Speech Synthesis System | Up | Modules |