Go to the first, previous, next, last section, table of contents.

10 Future directions

This chapter discusses processes in synthesis that are now ripe for detailed research work. The improvement offered by diphone technology and the corresponding jump in quality offered by general unit selection appear to offer significantly higher quality synthesis than anything we have had before. Improvements in speech recognition, speech coding, and of course, larger faster machines make experiments, databases and training algorithms that were once beyond our reach, now practical on reasonably sized systems.

10.1 Synthesis of any voice

One goal of Festival over the next few years is to provide a system that can automatically build high quality new voices for anyone who wants them.

To add a new synthesis voice, the process should be

Present a number of sentences/utterances to a person and record them saying them.
Automatically label the recorded data with phones, syllables, words, and intonation.
Extract parameters for intonation, phrasing, duration, dialectal phone mappings etc. from the labelled data.
Build a voice database from the extracted parameters and phone labelled data.
This should be possible for non-linguistic/phonetic trained people and be possible in reasonably time on standard workstations

Festival is still some way from this goal but already the beginnings are there. Note we are not asking for a system that does acquisition of arbitrary languages. This system would be constrained. Although Festival would offer ways to build support of new languages, that process would still require greater skill. Adding a new voice in a supported language, but with possibly different dialects, is our more modest goal.

10.2 Size of database

So what is the best size and type of database to use for such a model? There are two possible views

A database large enough to cover enough phonetic and prosodic variation in the language so that most of the units required are contained within the database and their are sufficient examples of prosodic variations to make the models learn from them.
Record only a small amount of speech and use the information in that to modify existing parametric systems. The work on voice conversion would be relevant here. Also as it will be prohibitive to record very large databases, extracting key examples to tune existing models that were trained on larger databases is probably more efficient.

Our current estimate is around 45 minutes of speech which should contain the sort of speech that you wish synthesized. Short paragraphs containing rich phonetic variation are reasonable but if the voice is to be used for reading numbers, or answering the telephone, the database should included numbers, dates, welcome messages (the company name) in various prosodic contexts.

There is unlikely to be one specific text set that is good for all, and different sets will be good for different applications. Also many people will be unwilling to talk to their computer for 45 minutes before they can get their new voice, so multiple levels and sizes of text to say to best optimize build time, quality etc. should be identified.

There are many experiments and design choices required here to come up with detailed guides and database texts that are appropriate when building new voices.

10.3 Autolabelling

Given a database of speech, even with the orthography we will need some way to label this automatically with lower level labels such as phones. Speech recognition technology has improved greatly and using packages like HTK, it appears easy to build an autolabelling system that does reasonably well. The problem however with labelling for synthesis rather than labelling for recognition, is that the labels need to be more accurate for a synthesis database than for word recognition. But its not all bad, as for recognition you (probably) need to recognize all words but for synthesis you need only label the parts you can do so well. Ignoring badly labelled parts is acceptable, as long as there is enough redundancy in the database.

Therefore what we need is:

a robust phone labelling system
some measure of correctness for labels

The phone labels need to overlap with examples of phones in the waveform but phone boundaries probably don't need to be very accurate given that the synthesis method will use some form of optimal coupling.

However gross errors need to be identified from mis-alignment, I have seen syllables misaligned in autolabelled corpora and also occasions when the speaker has actually said something different from the orthography. Some automatic measure of accuracy is necessary to judge when to ignore a section, or even ask for re-recording.

Another method is to ignore traditional phonetic classes as labels and use some other system for labelling. As the task of unit selection is to find appropriate units it is reasonable to question the use traditional phonetic classes. Two possible alternatives are mentioned here but others could also be reasonable candidates.

bacchiani96 derives acoustically similar segmental units from coded parameters for speech recognition. Some experiments in speech synthesis have already been done with regards labelling databases for synthesis and they indicate further work could be fruitful, particularly in that this method is accurate at labelling the data, and fully automatic. But of course you still need to match traditional phonetic classes (from the targets) to these acoustically derived labels, but that might be an easier task than mapping to phonetic labels to waveform segments directly.

A second alternative in looking for non-traditional phonetic labelling is described in fujimura93. The C/D model offers a multi-layered (autosegmental) labelling system for different phonetic phenomena, it offers a more reasonable labelling for variable speed speech and articulatory constraints in forming speech are implicit in the model.

In addition to phone (or equivalent) labels other forms of labelling are desirable: syllable and word labelling are probably trivial given orthography and phones. Phrasing, and intonation do require further labelling expertise. taylor94b provides a method for autolabelling but that has not been fully implemented and measured within a text to speech framework.

It seems reasonable to prompt users when recording a database. When given a prompt a user will most likely mimic the prosody of the prompt itself so this may help seed the intonation labelling system. However as full paragraphs will also be given, which is necessary to get fully fluent speech, users will not remember the full prosodic structure of what they have to say and hence will deviate from the prompt. It may be reasonable to record small sentences with specific variation to seed the intonation labelling system then use that to train labelling for the rest of the corpus.

10.4 Training from labels

Given our neatly labelled database we must use this information sensibly to extract parameters for our various synthesis models

Lexicons need to be dialect sensitive. If Festival supports English it should trivially support major dialects of it without requiring whole new lexicons or whole new intonation methods. Lexicons would need to be constructed in a way that allow such mapping. Also note that few people speak in pure dialect and it is the combination of "dialects" that make their speech their own. Their lexicon, intonation parameters etc. should reflect that.
We need a good acoustic measure no matter what eventual form of unit selection we choose, a good automatic measure of how good our synthesis is is necessary for any type of training method. This probably requires a significant psycho-linguistic experiment to find how human perception perceives different selection candidates and then find an automatic measure which follows those perceptions.
We still do not know which acoustic/phonetic/prosodic features are the most important in any form of unit selection and what their relative weighting (in different contexts) are. Also how adequately various failures in selection can be corrected by signal processing techniques.
Intonation, duration and other prosodic features still require more work to ensure they are fully automatic.
Post-lexical rules: how a speaker modifies their pronunciation from a base leixcal form can also be learned. Vowel reduction is is prime example.
The influence of dialog structure on intonation (questions, agreement focus, topic shifting etc.) is still an open area. With the use of synthesizers in dialog systems a more responsive voice is required making more demands on the intonational properties of the synthesized voice. A CNN newsreader voice is probably not appropriate as a dialog partner when asking for train time-table information.

Go to the first, previous, next, last section, table of contents.