Utterance building

As well as using Utterance structures in the actual runtime process of converting text-to-speech we also use them in database representation. Basically we wish to build utterance structures for each utterance in a speech database. Once they are in that structure, as if they had been (correctly) synthesized, we can use these structures for training various models. For example given the actually durations for the segments in a speech database and utterance structures for these we can dump the actual durations and features (phonetic, prosodic context etc.) which we feel influence the durations and train models on that data.

Obviously real speech isn't as clean as synthesized speech so its not always easy to build (reasonably) accurate utterances for the real utterances. However here we will itemize a number of functions that will make the building of utterance from real speech easier. Building utterance structures is probably worth the effort considering how easy it is to build various models from them. Thus we recommend this even though at first the work may not immediately seem worthwhile.

In order to build an utterance of the type used for our English voices (and which is suitable for most of the other languages we have done), you will need label files for the following relations. Below we will discuss how to get these labels, automatically, by hand or derived from other label files in this list and the relative merits of such derivations.

The basic label types required are

Segment

segment labels with (near) correct boundaries, in the phone set of your language.

Syllable

Syllables, with stress marking (if appropriate) whose boundaries are closely aligned with the segment boundaries.

Word

Words with boundaries aligned (close) to the syllables and segments. By words we mean the things which can be looked up in a lexicon thus "1986" would not be considered a word and should be rendered as three words "nineteen eighty six".

IntEvent

Intonation labels aligned to a syllable (either within the syllable boundary or explicitly naming the syllable they should align to. If using ToBI (or some derivative) these would be standard ToBI labels, while in something like Tilt these would be "a" and "b" marking accents and labels.

Phrase

A name and marking for the end of each prosodic phrase.

Target

The mean F0 value in Hertz at the mid-point of each segment in the utterance.

Segment labels are probably the hardest to generate. Knowing what phones are there can only really be done by actually listening to the examples and labeling them. Any automatic method will have to make low level phonetic classifications which machines are not particularly good at (nor are humans for that matter). Some discussion of autoaligning phones is given in the diphone chapter where an aligner distributed with this document is described. This may help but as much depends on the segmental accuracy getting it right ultimately hand correction at least is required. We have used that aligner on a speech database though we already knew from another (not so accurate) aligner what the phone sequences probably were. Our aligner improved the quality of exist labels and the synthesizer (phonebox) that used it, but there are external conditions that made this a reasonably thing to do.

Word labeling can most easily be done by hand, it is much easier than to do than segment labeling. In the continuing process of trying to build automatic labelers for databases we currently reckon that word labeling could be the last to be done automatically. Basically because with word labeling, segment, syllable and intonation labeling becomes a much more constrained task. However it is important that word labels properly align with segment labels even when spectrally there may not be any real boundary between words in continuous speech.

Syllable labeling can probably best be done automatically given segment (and word) labeling. The actual algorithm for syllabification may change but whatever is chosen (or defined from a lexicon) it is important that that syllabification is consistently used throughout the rest of the system (e.g. in duration modeling). Note that automatic techniques in aligning lexical specifications of syllabification are in their nature inexact. There are multiple acceptable ways to say words and it is relatively important to ensure that the labeling reflects what is actually there. That is simply looking up a word in a lexicon and aligning those phones to the signal is not necessarily correct. Ultimately this is what we would like to do but so far we have discovered our unit selection algorithms are nowhere near robust enough to do this.

The Target labeling required here is a single average F0 value for each segment. This currently is done fully automatically from the signal. This is naive and a better representation of F0 could be more appropriate, it is used only in some of the model building described below. Ultimately it would be good if the F0 need not be explicitly used at all but just use the factors that determine the F0 value, but this is still a research topic.

Phrases could potentially be determined by a combination of F0 power and silence detection but the relationship is not obvious. In general we hand label phrases as part of the intonation labeling process. Realistically only two levels of phrasing can reliably be labeled, even though there are probably more. That is, roughly, sentence internal and sentence final, what ToBI would label as (2 or 3) and 4. More exact labelings would be useful.

For intonation events we have more recently been using Tilt accent labeling. This is simpler than ToBI and we feel more reliable. The hand labeling part marks a (for accent) and b for boundary. We have also split boundaries into rb (rising boundary) and fb (falling boundary). We have been experimenting with autolabeling these and have had some success but that's still a research issue. Because there is a well defined and fully automatic method of going from a/b labeled waveforms to a parameterization of the F0 contour we've found Tilt the most useful Intonation labeling. Tilt is described in [taylor00a].

ToBI accent/tone labeling [silverman92] is useful too but time consuming to label. If it exists for the database then its usually worth using.

In the standard Festival distribution there is a festival script festival/examples/make_utts which will build utterance structures from the labels for the six basic relations.

This function can most easily be used given the following directory/file structure in the database directory. festival/relations/ should contain a directory for each set of labels named for the utterance relation it is to be part of (e.g. Segment/, Word/, etc.

The constructed utterances will be saved in festival/utts/.