The basic building block for Festival is the utterance. The
structure consists of a set of relations over a set of
items. Each item represents a object such as a word, segment,
syllable, etc. while relations relate these items together. An item may
appear in multiple relations, such as a segment will be in a
Segment
relation and also in the SylStructure
relation.
Relations define an ordered structure over the items within them, in
general these may be arbitrary graphs but in practice so far we have
only used lists and trees Items may contain a number of
features.
There are no built-in relations in Festival and the names and use of them is controlled by the particular modules used to do synthesis. Language, voice and module specific relations can easy be created and manipulated. However within our basic voices we have followed a number of conventions that should be followed if you wish to use some of the existing modules.
The relation names used will depend on the particular structure chosen for your voice. So far most of our released voices have the same basic structure though some of our research voices contain quite a different set of relations. For our basic English voices the relations used are as follows
Text
Token
Word
Token
relation, leaves of the Phrase
relation and roots of
the SylStructure
relation.
Phrase
Word
relation.
Syllable
SylStructure
relation allowing access to the words
these syllables are in and the segments that are in these syllables.
In this format no further onset/coda distinction is made explicit but can
be derived from this information.
Segment
SylStructure
relation through which we can find where each
segment is placed within its syllable and word. By convention
silence phones do not appear in any syllable (or word) but will
exist in the segment relation.
SylStructure
Word
,
Syllable
and Segment
items.
IntEvent
Intonation
relation.
Intonation
Syllable
relation,
and daughters are in the IntEvent
relation. It is assumed that a
syllable may have a number of intonation events associated with it (at
least accents and boundaries), but an intonation event may only by
associated with one syllable.
Wave
Target
Unit, SourceSegments, Frames, SourceCoef TargetCoef
UniSyn
module.
The basic synthesis process in Festival is viewed as applying a set of modules to an utterance. Each module will access various relations and items and potentially generate new features, items and relations. Thus as the modules are applied the utterance structure is filled in with more and more relations until ultimately the waveform is generated.
Modules may be written in C++ or Scheme. Which modules are executed are
defined in terms of the utterance type
, a simple feature on the
utterance itself. For most text-to-speech cases this is defined to be
of type Tokens
. The function utt.synth
simply looks up an
utterance's type and then looks up the definition of the defined
synthesis process for that type and applies the named modules.
Synthesis types maybe defined using the function defUttType
.
For example definition for utterances of type Tokens
is
(defUttType Tokens (Token_POS utt) (Token utt) (POS utt) (Phrasify utt) (Word utt) (Pauses utt) (Intonation utt) (PostLex utt) (Duration utt) (Int_Targets utt) (Wave_Synth utt) )
While a simpler case is when the input is phone names
and we don't wish to do all that text analysis and prosody
prediction. Then we use the type Phones
which simply
loads the phones, applies fixed prosody and the synthesizes
the waveform
(defUttType Phones (Initialize utt) (Fixed_Prosody utt) (Wave_Synth utt) )
In general the modules named in the type definitions are general and
actually allow further selection of more specific modules within
them. For example the Duration
module respects the global
parameter Duration_Method
and will call then desired duration
module depending on this value.
When building a new voice you will probably not need to change any of these definitions, though you may wish to add a new module and we will show how to do that without requiring any change to the synthesis definitions in a later chapter.
There are many modules in the system, some simply wraparounds to choose between other modules. However the basic modules used for text-to-speech have the basic following function
Token_POS
Token
Word
relation.
POS
Phrasify
Phrase
relation using the specified method. Various
are offered, from statistically trained models to simple CART trees.
Word
Syllable
and Segment
relations and the SylStructure
related these together.
Pauses
Segment
relation, again through a choice of different prediction mechanisms.
Intonation
IntEvent
relation and the Intonation
relation that links IntEvents
to syllables. This can easily be parameterized for most practical
intonation theories.
PostLex
Duration
Int_Targets
Target
relation representing the desired F0 contour.
Wave_Synth
A set of simple access methods exist for utterances, relations, items and features, both in Scheme and C++. As much as possible these access methods are as similar as possible.
As the users of this document will primarily be accessing utterance via Scheme we will describe the basic Scheme functions available for access and give some examples of idioms to achieve various standard functions.
In general the required arguments to a lisp function are reflected in
the first parts of the name of the function. Thus
item.relation.next
requires an item, and relation name and will
return the next item in that named relation from the given one.
A listing a short description of the major utterance access and manipulation functions is given in the Festival manual.
An important notion to be aware of is that an item is always viewed
through so particular relation. For example, assuming
a typically utterance called utt1
.
(set! seg1 (utt.relation.first utt1 'Segment))
seg1
is an item viewed from the Segment
relation. Calling
item.next
on this will return the next item in the Segment
relation. A Segment
item may also be in the SylStructure
item. If we traverse it using next in that relation we will hit
the end when we come to the end of the segments in that syllable.
You may view a given item from a specified relation by
requesting a view from that. In Scheme nil
will
be returned if the item is not in the relation. The
function item.relation
takes an item and relation
name and returns the item as view from that relation.
Here is a short example to help illustrate the basic structure.
(set! utt1 (utt.synth (Utterance Text "A short example.")))
The first segment in utt!
will be silence.
(set! seg1 (utt.relation.first utt1 'Segment))
This item will be a silence as can shown by
(item.name seg1)
If we find the next item we will get the schwa representing the indefinite article.
(set! seg2 (item.next seg1)) (item.name seg2)
Let us move onto the "sh" to illustrate the different between
traversing the Segment
relation as opposed to the
SylStructure
(set! seg3 (item.next seg2))
Let use define a function which will take an item, print its name name call next on it in the same relation and continue until it reaches the end.
(define (toend item) (if item (begin (print (item.name item)) (toend (item.next item)))))
If we call this function on seg3
which is in the Segment
relation we will get a list of all segments until the end of the utterance
festival> (toend seg3) "sh" "oo" "t" "i" "g" "z" "aa" "m" "p" "@" "l" "#" nil festival>
However if we first changed the view of seg3 to the SylStructure
relation we will be traversing the leaf nodes of the syllable structure
tree which will terminate at the end of that syllable.
festival> (toend (item.relation seg3 'SylStructure) "sh" "oo" "t" nil festival>
Note that item.next
returns the item immediately to the next in
that relation. Thus it return nil
when the end of a sub-tree is
found. item.next
is most often used for traversing simple lists
through it is defined for any of the structure supported by relations.
The function item.next_item
allows traversal of any relation
returning a next item until it has visiting them all. In the simple
list case this this equivalent to item.next
but in the tree case
it will traverse the tree in pre-order that is it will visit
roots before their daughters, and before their next siblings.
Scheme is particularly adept at using functions as first class
objects. A typical traversal idiom is to apply so
function to each item in a a relation. For example support
we have a functionPredictDuration
which takes a single item
and assigns a duration. We can apply this to each item in the
Segment
relation
(mapcar PredictDuration (utt.relation.items utt1 'Segment))
The function utt.relation.items
returns all items in the
relation as a simple lisp list.
Another method to traverse the items in a relation is use
the while
looping paradigm which many people are more
familiar with.
(let ((f (utt.relation.first utt1 'Segment))) (while f (PredictDuration f) (set! f (item.next_item f))))
If you wish to traverse only the leaves of a tree you
may call utt.relation.leafs
instead of
utt.relation.items
. A leaf is defined to be an item with
no daughters. Or in the while
case, there isn't standardly
defined a item.next_leaf
but code easily be defined
as
(define (item.next_leaf i) (let ((n (item.next_item i))) (cond ((null n) nil) ((item.daughters n) (item.next_leaf n)) (t n))))
Rather than explicitly calling a set of functions to find your way round an utterance we also allow access through a linear flat pathname mechanism. This mechanism is read-only but can succinctly access not just features on a given item but features on related items too.
For example rather than calling an explicit next function to find the name of the following item thus
(item.name (item.next i))
You can access it via the pathname
(item.feat i "n.name")
Festival will interpret the feature name as a pathname. In addition
to traversing the current relation you can switch between
relations via the element R:<relationame>
. Thus to
find the stress value of an segment item seg
we need
to switch to the SylStructure
relation, find its parent
and check the stress
feature value.
(item.feat seg "R:SylStructure.parent.stress")
Feature pathnames make the definition of various prediction models much easier. CART trees for example simply specify a pathname as a feature, dumping features for training is also a simple task. Full function access is still useful when manipulation of the data is required but as most access is simply to find values pathnames are the most efficient way to access information in an utterance.
For example suppose you wish to traverse each segment in an utterance replace all vowels in unstressed syllables with a schwa (a rather over-aggressive reduction strategy but it servers for this illustrative example.
(define (reduce_vowels utt) (mapcar (lambda (segment) (if (and (string-equal "+" (item.feat segment "ph_vc")) (string-equal "1" (item.feat segment "R:SylStructure.parent.stress"))) (item.set_name segment "@"))) (utt.relation.items 'Segment)))
As well as using Utterance structures in the actual runtime process of converting text-to-speech we also use them in database representation. Basically we wish to build utterance structures for each utterance in a speech database. Once they are in that structure, as if they had been (correctly) synthesized, we can use these structures for training various models. For example given the actually durations for the segments in a speech database and utterance structures for these we can dump the actual durations and features (phonetic, prosodic context etc.) which we feel influence the durations and train models on that data.
Obviously real speech isn't as clean as synthesized speech so its not always easy to build (reasonably) accurate utterances for the real utterances. However here we will itemize a number of functions that will make the building of utterance from real speech easier. Building utterance structures is probably worth the effort considering how easy it is to build various models from them. Thus we recommend this even though at first the work may not immediately seem worthwhile.
In order to build an utterance of the type used for our English voices (and which is suitable for most of the other languages we have done), you will need label files for the following relations. Below we will discuss how to get these labels, automatically, by hand or derived from other label files in this list and the relative merits of such derivations.
The basic label types required are
Segment
Syllable
Word
IntEvent
Phrase
Target
Segment labels are probably the hardest to generate. Knowing what phones are there can only really be done by actually listening to the examples and labelling them. Any automatic method will have to make low level phonetic classifications which machines are not particularly good at (nor are humans for that matter). Some discussion of autoaligning phones is given in the diphone chapter where an aligner distributed with this document is described. This may help but as much depends on the segmental accuracy getting it right ultimately hand correction at least is required. We have used that aligner on a speech database though we already knew from another (not so accurate) aligner what the phone sequences probably were. Our aligner improved the quality of exist labels and the synthesizer (phonebox) that used it, but there are external conditions that made this a reasonably thing to do.
Word labelling can most easily be done by hand, it is much easier than to do than segment labelling. In the continuing process of trying to build automatic labellers for databases we currently reckon that word labelling could be the last to be done automatically. Basically because with word labelling, segment, syllable and intonation labelling becomes a much more constrained task. However it is important that word labels properly align with segment labels even when spectrally there may not be any real boundary between words in continuous speech.
Syllable labelling can probably best be done automatically given segment (and word) labelling. The actual algorithm for syllabification may change but whatever is chosen (or defined from a lexicon) it is important that that syllabification is consistently used throughout the rest of the system (e.g. in duration modelling). Note that automatic techniques in aligning lexical specifications of syllabification are in their nature inexact. There are multiple acceptable ways to say words and it is relatively important to ensure that the labelling reflects what is actually there. That is simply looking up a word in a lexicon and aligning those phones to the signal is not necessarily correct. Ultimately this is what we would like to do but so far we have discovered our unit selection algorithms are nowhere near robust enough to do this.
The Target labelling required here is a single average F0 value for each segment. This currently is done fully automatically from the signal. This is naive and a better representation of F0 could be more appropriate, it is used only in some of the model building described below. Ultimately it would be good if the F0 need not be explicitly used at all but just use the factors that determine the F0 value, but this is still a research topic.
Phrases could potentially be determined by a combination of F0 power and silence detection but the relationship is not obvious. In general we hand label phrases as part of the intonation labelling process. Realistically only two levels of phrasing can reliably be labelled, even though there are probably more. That is, roughly, sentence internal and sentence final, what ToBI would label as (2 or 3) and 4. More exact labellings would be useful.
For intonation events we have more recently been using Tilt accent
labelling. This is simpler than ToBI and we feel more reliable. The
hand labelling part marks a
(for accent) and b
for
boundary. We have also split boundaries into rb
(rising
boundary) and fb
(falling boundary). We have been experimenting
with autolabelling these and have had some success but that's still a
research issue. Because there is a well defined and fully automatic
method of going from a/b labelled waveforms to a parameterization of the
F0 contour we've found Tilt the most useful Intonation labelling. Tilt
is described in taylor00a.
ToBI accent/tone labelling silverman92 is useful too but time consuming to label. If it exists for the database then its usually worth using.
In the standard Festival distribution there is a festival script `festival/examples/make_utts' which will build utterance structures from the labels for the six basic relations.
This function can most easily be used given the following directory/file structure in the database directory. `festival/relations/' should contain a directory for each set of labels named for the utterance relation it is to be part of (e.g. `Segment/', `Word/', etc.
The constructed utterances will be saved in `festival/utts/'.
An example of the label files is given with this document in src/db_example/festival/relations/ and the build utterance in src/db_example/festival/utts/
Many of the training techniques that are described in the
following chapters extract basic features (via pathnames) from
a set of utterances. This can most easily be done by the
`festival/examples/dumpfeats' Festival script. It takes
a list of feature/pathnames, as a list or from a file and saves
the values for a given set of items in a single feature file (or
one for each utterance). Call `festival/examples/dumpfeats'
with the argument -h
for more details.
For example suppose for all utterances we want the segment duration, its name, the name of the segment preceding it and the segment following it.
dumpfeats -feats '(segment_duration name p.name n.name)' \ -relation Segment -output dur.feats festival/utts/*.utt
If you wish to save the features in separate files one for each utterance, if the output filename contains a `%s' it will be filled in with the utterance fileid. Thus to dump all features named in the file `duration.featnames' we would call
dumpfeats -feats duration.featnames -relation Segment \ -output feats/%s.dur festival/utts/*.utt
The file `duration.featnames' should contain the features/pathnames one per line (without the opening and closing parenthesis.
Other features and other specific code (e.g. selecting a
voice that uses an appropriate phone set), can be included in this
process by naming a scheme file with the -eval
option.
The dumped feature files consist of a line for each item in the named relation containing the requested feature values white space separated. For example
0.399028 pau 0 sh 0.08243 sh pau iy 0.07458 iy sh hh 0.048084 hh iy ae 0.062803 ae hh d 0.020608 d ae y 0.082979 y d ax 0.08208 ax y r 0.036936 r ax d 0.036935 d r aa 0.081057 aa d r ...
Go to the first, previous, next, last section, table of contents.