Go to the first, previous, next, last section, table of contents.


2 History

Here is a (probably biased and naive) history of generating synthetic speech.

The idea that a machine could generate speech has been with us for some time but the realization of such machines has only really been practical in the last 50 years, only in the last 20 years have we actually seen what could be termed practical examples of text to speech systems.

First you must be aware of some distinctions here. The creation of synthetic speech covers a whole range of processes and although often all lumped under the general term text-to-speech much work has concentrated on how to generate the speech itself from known phones rather than the whole text convertion process.

Probably the first practical application of speech synthesis was in 1936 when the UK Telephone Company introduced a speaking clock. It used optical storage for the phrases, words and part-words which were appropriately concatenated to form complete sentences.

Also around that time Homer Dudley at Bell Labs, developed a mechanical device that through movement of pedals, and mechanical keys, like an organ, with a trained operator could generate almost recognizable speech. Called the Voder it was demonstrated at the 1939 World's fair in New York and San Francisco. A recording of this device exists, can be heard as part of a collection of historical synthesis examples that were distributed on a record as part of klatt87. See http://www.festvox.org/history/klatt.html.

The realization that the speech signal could be decomposed as a source filter model was utilized to build analogue systems that could be used to mimic human speech. The vocoder, also developed by Homer Dudley is one such example. Much of the work in synthesis in the 40s and 50s was primarily concerned with constructing the signal itself rather than generating the phones from some higher form like text.

Further decomposition of the speech signal allowed the development of formant synthesis where collections of signals were composed to form recognization speech. Predicting the parameters that adequately represent the signal has always (and still is) difficult. Early versions of formant synthesis allowed these to be specified by hand but automatic prediction was the goal. Today formant synthesizer can produce high quality recognizable speech if the parameters are properly adjusted. But it is still difficult to get full natural sounding speech from them when the process is fully automatic.

With the advancement of digital computers the move to digital representations of the speech and reduction of a dependence of specialized hardware, more work was done in concatenation of natural recorded speech. It quickly became ovbious that diphones were the desired unit of presentation. That is, units of speech from middle of one phone to the middle of another were the basic building blocks. The justification is that phone boundaries are much more dynamic than middle of phones and therefore mid-phone is a better place to concatenate that phone boundary. The rise of concatenative speech starting in the 70s until today has to a large extent been practical because of the reduction in the cost of eletronic storage. When a megabyte of memory was a signiciant part of researchers salary, less resource intensive techniques were naturally investigated more. Of course formant synthesis often requires significant computational power, even if it requires less storage, so even into the 80s speech synthesis relied on specialized hardware.

In 1972 the standard Unix manual (3rd edition) included commands to process text to speech, form text analysis, prosodic prediction, phoneme generation and waveform synthesis through a specialized piece of hardware. Of course Unix had only about 16 installations at the time and most (all?) were located in Bell Labs at Murray Hill.

Techniques were being developed to compress speech in a way that it could be more easily used in applications. The Texas Instraments Speak 'n Spell toy, released in the late 70s, was one of the earlier examples of mass production of speech synthesis. The quality was pretty poor, but for the time it was very impressive. Speech was basically encoded using LPC (linear Predictive Coding) and mostly used isolated words and letters though there were also a few phrases formed by concatenation. Simple TTS engines based on specialised chips became popular on home computers such as the BBC Micro in the UK and the Apple ][.

Dennis Klatt's MITalk synthesizer allen87 in many senses defined the perception of automatic speech synthesis to the world at large. Later developed into the product DECTalk, its produces a somewhat robotic, but very understandable form of speech. It is a formant synthesizer (reflecting its development period).

Before 1980 speech synthesis research was limited to large laboratories that could afford to invest the time and money for hardware. By the mid-80s more labs and universities started to join in as the cost of the hardware dropped. By the late eighties purely software synthesizers became feasible, that not only produced reasonable quality speech, but also could do so in near real-time.

Of course with faster machines and large disk space people began to look to improving synthesis by using larger, and more varied inventories for concatenative speech. Yoshinori Sagisaka at ATR in Japan developed nuu-talk nuutalk92 in the late 80s early 90s, which used a much larger inventories of concatenative units thus instead of one example of each diphone unit there could be many and a automatic acoustic based selection was used to find the best selection of sub-word units from a fairly geenral database of speech. This work was done in Japanese which has a much simpler phonetic structure than English making it possible to get high quality with still a relatively small databases. Though even by 94 the generation of the parameter files for a new voice in nuu-talk (503 senetences) would take several days of CPU time, and synthesis was not generally real-time.

With the demonstration of general unit selection synthesis in English in Rob Donovan's PhD word donovan95 and ATR's CHATR system (campbell96 and hunt96) by the end of the 90's unit selection became the hot topic for speech synthesis research. However inspite of very high quality examples of it working, generalized unit selection also produces some very bad quality synthesis. As the optimial search and selection agorithms used are not 100% reliable both high and low quality synthesis is produced.

Of course the development of speech synthesis is not in isolation from other developments in speech technology. With the success in speech recognition, also benefiting from the reduction in cost of computational power and increased availability of general computing into the populace. There are now many more people who have the computational resouces and interest in running speech applications. This ability to run such applications puts the demand on the technology to deliver both working recognition and acceptable quality speech synthesis.

Availability of free and semi-free synthesis systems such as the MBROLA project and the Festival Speech Synthesis System makes the cost of entering the field of speech synthesis much lower and many more groups have now joined in the development.

However although we are now at the stage were talking computers are with us there is still much to do. We can now build synthesizers of (probably) any language that produced reconizable speech. But if we are to use speech to receive information as easily as we do from humans there is still much to do. Synthesized speech must be natural, controllable and efficient (both in rendering and in the building of new voices).


Go to the first, previous, next, last section, table of contents.