Go to the first, previous, next, last section, table of contents.

9 New Voices

The FestVox projects (http://festvox.org) specifically deals with this issue and contains full documentation, disucssion, as wellas tools for designing, recording, building, and testing of new voices in new and currentl supported languages. If you are going to actual build a new synthetic voice you should read the festvox document itself

9.1 Building New Voices

Building a new voice in a new language requires addressing the following

Phoneset
Token processing rules
Prosodic phrasing method
Word pronunciation (lexicon and/or letter to sound rules)
Intonation (accents and F0 contour)
Duration
Waveform synthesizer

Before starting you must have a phoneset. This is often pre-defined due to the lexicon and/or waveform synthesizer that is to be used. Most languages have a phoneste already defined for them, though if this is a language that has had little computational the phoneset may not be defined complete for easy computational use. However wherever possible it is wise to use someone elses defined phoneste but do consider the possibility that it needs to be augment due to allophonic phenomena such as vowel reduction, flaps, palatalization etc.

The next area to consider is tokenization of text, assuming the desired language has a written form, if not you'll need to defined on. This can be relatively easy or not depending on the language. Chinese, Japanese and Thai (and others) do not use spaces between works so even simple tokenization is a non-trival process. There are statistical techniques that work for this but they require a lexicon of words in the language sproat96b. Other problems such as numbers, symbols, homographs etc require treatment. Note that although from a anglosized view treatment of text in the native writting system may seem no worth it, for it to be considered a TTS system for that language you must allow input to be in that writing system. This requirement may be non-trivial to answer if there are few computer fonts, or authoring tools for that language, but it should be seriously addressed.

In some cases a romanized only form of synthesis may be sufficient, for example if the synthesizer is only going to be used in language generation systems, or manchine translation system, it may be possible to use this simplified script. Also note that due to the dominence of ASCII based machines that even when there is a native written form many computer based users of the language have constructed a romanized system due to the inadequacy of existin computer systems. Support that writting system may be advanatageous as there already exist text (e.g. email) in that form. For example Greek can be written in Ascii using soundalike and lookalike symbols. Greek people do use this in email when no other input method is available and we found it useful to include support for this in our Greek synthesizer.

Possibily the most expensive part required for a synthesizer is probably a lexicon. However many language don't require one as the relationship between the written and phonetic form can be so close that simple hand written letter to sound rules are sufficient. Building large lexicons is a long and tedious task and not something to be undertaken lightly. Whereever possible you should try to find an existing lexicon in the language. Though remember to note its copyright restrictions.

Note that for some languages, which have a rich morphology such as Finnish and Turkish, its not easy to list the words in the language due to the large number of variations in forms. Even in langauge with relatively few variations it is impossible to list every word in the language, new words, rare words, especially proper and place names will appear in small but regular frequency in new text to be spoken and some pronunciation is will be required. As there is often some, even though non-obvious, relation between the written form and the pronunciation trainable letter to sound rules systems such as support by festival are a solution. However there may always be symbols that cannot be pronounced. Though in this case we should defer to what a human would do. FOr example what does a Chinese or Japanese person do when they come accross a chinese character they have never seen before, or for that matter an English speaking person when they come across a symbol that denotes an artist formerly known as Prince.

9.2 Limited domain synthesis

Unit selection synthesis offers high quality natural sound utterances, but unfortunately it can also deliver quite incomprehensible synthesis. Most of the work in unit selection synthesis is invovled in trying to automaticall detect when its gone wrong and try to ensure this happens as infrequently as possible. The reasons for the bad quality examples are first, that the acoustic measures and optimal coupling are not simple reliable measures of quality, and hence the search for better measures that better reflect perceptual notions of good quality speech. The second reaosn is that the database being selected from doesn't have the distribution of units that are required for synthesizing the desired utterance. It is obvious when using a unit seleciton synthesizer that it works better when there are more example of what you wish in the database.

To try to take advantage of this second point in producing high quality synthesis more reliably we can build synthesizer from database that are deliberated tailored to the task they are to perform. At the most extreme end we could simply record all the utterances that are required for the particular task we require a synthesizer for. But that seems over restrictive and possible a lot of recording note that a task may still have a restricted doamin even through it has an infinite number of utterances.

For example consider a talking clock. We can specify a particular structure for the how to say the time, and the build a database that has a resonable number of examples of the munites and hours in each slot. For example included in the Festival distribution is a simple `saytime' script which gives the current time in a simple approximate form as in

The time is now, a little after ten past eleven, in the morning.

Thus we can design a number of utterances which cover all the basic word/phrases that such an program needs to say. In this case 24 utterances seems to get the coverage we need. These are

"The time is now, exactly five past one, in the morning."
"The time is now, just after ten past two, in the morning."
"The time is now, a little after quarter past three, in the morning."
"The time is now, almost twenty past four, in the morning."
"The time is now, exactly twenty-five past five, in the morning."
"The time is now, just after half past six, in the morning."
"The time is now, a little after twenty-five to seven, in the morning."
"The time is now, almost twenty to eight, in the morning."
"The time is now, exactly quarter to nine, in the morning."
"The time is now, just after ten to ten, in the morning."
"The time is now, a little after five to eleven, in the morning."
"The time is now, almost twelve."
"The time is now, just after five to one, in the afternoon."
"The time is now, a little after ten to two, in the afternoon."
"The time is now, exactly quarter to three, in the afternoon."
"The time is now, almost twenty to four, in the afternoon."
"The time is now, just after twenty-five to five, in the afternoon."
"The time is now, a little after half past six, in the evening."
"The time is now, exactly twenty-five past seven, in the evening."
"The time is now, almost twenty past eight, in the evening."
"The time is now, just after quarter past nine, in the evening."
"The time is now, almost ten past ten, in the evening."
"The time is now, exactly five past eleven, in the evening."
"The time is now, a little after quarter to midnight."

We then record these and build a unit selection synthesizer from them. This synthesizer can only say things in domain. As the above doesn't even have phoneme coverage for English it is unquestionably a limited domain synthesizer, but when it talks the quality is far superior to diphones and the quality is almost never poor, even though there are still some problems with it.

http://festvox.org/ldom/ldom_time.html includes so examples of a talking clock using this technique. And the festvox documentation include detailed instructions on building such a talking clock.

However this above is very simple and really too limited. This technique however does scale up, at least to much more interesting domains. A weather reporter include times, dates, weather outlok etc is also available at http://festvox.org/ldom/ldom_weather.html. This was done with 100 sentences in domain of the form

The weather at 10:00 am on Sunday April 23, outlook cloudy,
45 degrees, winds northwest 10 mile per hour

We have experimented with these techniques futher with a limited domain synthesizer for the CMU Darpa Communicator project, which offers a telpehone dialog system for booking flights etc. Here the output forms seems more general that a simple slot and filler approach but still much is actually constrained. We have built a 500 utterance database which covers most of what is required, flight numbers, times, cities and airlines are the only major changing parts, while the standard expression remain the same (except when we update the systems).

Tailoring your synthesizer to the particular task is always a good idea. But it is also possible a risk. If something needs to be changed in your system, having to re-record is not as easy as simply adding another line of text in your program. Thus there are disadvantages to limited domain synthesizers over general ones.

9.3 Voice Conversion

Instead of carefully collecting a diphone database from a speaker or a balanced database for unit selection it would be much better if only a small sample of speech was necessary from a speaker in order to model their speech. Ideally one would use an existing database of speech and prosodic models and convert them based on a small sample from the desired speaker.

Such techniques are being studied and probably eventually will be the standard method for builing new synthetic voices. Such a technique would reuqire much less participation from the target speaker, include less high quality recordings and less effort on their part.

Apart from the obvious advantage of being able to build a new voice more easily, voice conversion has other advantages too. First it would be easier to have a synthesizer have many more voices available allowing a user to choise if not their own voice on they prefer for their system. If the modelling is done appropriately this can be done without requiring large amounts of disk requirements to store the multiple voices. A second advantage is that modelling might be done across language thus a speech to speech transaltion system can get a voice like the originator but speaking in a language that original person cannot speak. Also imaging a network based telephone system that is bandwidth constrained (i.e. all telephone systems), the modification parameters alone could be tranfered to the other site and a synthesizer voice from the orignal could be constructed from a very low band width line that passed only phones, durations and F0 points.

A person's voice is a combination of thier lexical, prosodic, and spectral properoties. Thus ideally to model a voice you need sufficient samples to augment existing representations. In some sense we have already discussed dialects and prosodic modelling. Here we will concentrate on the lower level parts.

Given the form of parameterization of the speech we are doing we can view the our standard parameterization to be spectral plus residual. Where the spectral information is contained within the LPC (or cepstrum) parameters. Mapping these parameters to a new voices is the basic technique. Though whwat parameterization, how mapping is done, and how to use the available parameters from the target speaker are all active areas of research.

Oregon Graduate Institute have a number of example of voice conversion on line at http://cslu.cse.ogi.edu/tts/research/vc/vc.html. They also have developed a Festival plugin for this technique kain98.

They can achieve reasonable mapping of a male to a female voice from 64 diphones in the target voice. However although they (and others in this area) do definitely achieve a new synthetic voice so far none of the technique really fully capture all the propties of the target voice. Though this is also true when full diphones are collected so it is not just the conversion that is the problem it is the fact that we still do no yet have an adequate model for the repreesentation of a person's voice.

Go to the first, previous, next, last section, table of contents.