This chapter discusses processes in synthesis that are now ripe for detailed research work. The improvement offered by diphone technology and the corresponding jump in quality offered by general unit selection appear to offer significantly higher quality synthesis than anything we have had before. Improvements in speech recognition, speech coding, and of course, larger faster machines make experiments, databases and training algorithms that were once beyond our reach, now practical on reasonably sized systems.
One goal of Festival over the next few years is to provide a system that can automatically build high quality new voices for anyone who wants them.
To add a new synthesis voice, the process should be
Festival is still some way from this goal but already the beginnings are there. Note we are not asking for a system that does acquisition of arbitrary languages. This system would be constrained. Although Festival would offer ways to build support of new languages, that process would still require greater skill. Adding a new voice in a supported language, but with possibly different dialects, is our more modest goal.
So what is the best size and type of database to use for such a model? There are two possible views
Our current estimate is around 45 minutes of speech which should contain the sort of speech that you wish synthesized. Short paragraphs containing rich phonetic variation are reasonable but if the voice is to be used for reading numbers, or answering the telephone, the database should included numbers, dates, welcome messages (the company name) in various prosodic contexts.
There is unlikely to be one specific text set that is good for all, and different sets will be good for different applications. Also many people will be unwilling to talk to their computer for 45 minutes before they can get their new voice, so multiple levels and sizes of text to say to best optimize build time, quality etc. should be identified.
There are many experiments and design choices required here to come up with detailed guides and database texts that are appropriate when building new voices.
Given a database of speech, even with the orthography we will need some way to label this automatically with lower level labels such as phones. Speech recognition technology has improved greatly and using packages like HTK, it appears easy to build an autolabelling system that does reasonably well. The problem however with labelling for synthesis rather than labelling for recognition, is that the labels need to be more accurate for a synthesis database than for word recognition. But its not all bad, as for recognition you (probably) need to recognize all words but for synthesis you need only label the parts you can do so well. Ignoring badly labelled parts is acceptable, as long as there is enough redundancy in the database.
Therefore what we need is:
The phone labels need to overlap with examples of phones in the waveform but phone boundaries probably don't need to be very accurate given that the synthesis method will use some form of optimal coupling.
However gross errors need to be identified from mis-alignment, I have seen syllables misaligned in autolabelled corpora and also occasions when the speaker has actually said something different from the orthography. Some automatic measure of accuracy is necessary to judge when to ignore a section, or even ask for re-recording.
Another method is to ignore traditional phonetic classes as labels and use some other system for labelling. As the task of unit selection is to find appropriate units it is reasonable to question the use traditional phonetic classes. Two possible alternatives are mentioned here but others could also be reasonable candidates.
bacchiani96 derives acoustically similar segmental units from coded parameters for speech recognition. Some experiments in speech synthesis have already been done with regards labelling databases for synthesis and they indicate further work could be fruitful, particularly in that this method is accurate at labelling the data, and fully automatic. But of course you still need to match traditional phonetic classes (from the targets) to these acoustically derived labels, but that might be an easier task than mapping to phonetic labels to waveform segments directly.
A second alternative in looking for non-traditional phonetic labelling is described in fujimura93. The C/D model offers a multi-layered (autosegmental) labelling system for different phonetic phenomena, it offers a more reasonable labelling for variable speed speech and articulatory constraints in forming speech are implicit in the model.
In addition to phone (or equivalent) labels other forms of labelling are desirable: syllable and word labelling are probably trivial given orthography and phones. Phrasing, and intonation do require further labelling expertise. taylor94b provides a method for autolabelling but that has not been fully implemented and measured within a text to speech framework.
It seems reasonable to prompt users when recording a database. When given a prompt a user will most likely mimic the prosody of the prompt itself so this may help seed the intonation labelling system. However as full paragraphs will also be given, which is necessary to get fully fluent speech, users will not remember the full prosodic structure of what they have to say and hence will deviate from the prompt. It may be reasonable to record small sentences with specific variation to seed the intonation labelling system then use that to train labelling for the rest of the corpus.
Given our neatly labelled database we must use this information sensibly to extract parameters for our various synthesis models
Go to the first, previous, next, last section, table of contents.