Overview of Speech Synthesis


AWB: probably way too biased as a history The idea that a machine could generate speech has been with us for some time, but the realization of such machines has only really been practical within the last 50 years. Even more recently, it's in the last 20 years or so that we've seen practical examples of text-to-speech systems that can say any text they're given -- though it might be "wrong."

The creation of synthetic speech covers a whole range of processes, and though often they are all lumped under the general term text-to-speech, a good deal of work has gone into generating speech from sequences of speech sounds; this would be a speech-sound (phoneme) to audio waveform synthesis, rather than going all the way from text to phonemes (speech sounds), and then to sound.

One of the first practical application of speech synthesis was in 1936 when the U.K. Telephone Company introduced a speaking clock. It used optical storage for the phrases, words, and part-words ("noun," "verb," and so on) which were appropriately concatenated to form complete sentences.

Also around that time, Homer Dudley developed a mechanical device at Bell Laboratories that operated through the movement of pedals, and mechanical keys, like an organ. With a trained operator, it could be made to create sounds that, if given a good set-up, almost sounded like speech. Called the Voder, it was demonstrated at the 1939 World's Fair in New York and San Francisco. A recording of this device exists, and can be heard as part of a collection of historical synthesis examples that were distributed on a record as part of [klatt87].

The realization that the speech signal could be decomposed as a source-and-filter model, with the glottis acting as a sound source and the oral tract being a filter, was used to build analog electronic devices that could be used to mimic human speech. The vocoder, also developed by Homer Dudley, is one such example. Much of the work in synthesis in the 40s and 50s was primarily concerned with constructing replicas of the signal itself rather than generating the phones from an abstract form like text.

Further decomposition of the speech signal allowed the development of formant synthesis, where collections of signals were composed to form recognization speech. The prediction of parameters that compactly represent the signal, without the loss of any information critical for reconstruction, has always been, and still is, difficult. Early versions of formant synthesis allowed these to be specified by hand, with automatic modeling as a goal. Today, formant synthesizers can produce high quality, recognizable speech if the parameters are properly adjusted, and these systems can work very well for some applications. It's still hard to get fully natural sounding speech from these when the process is fully automatic -- as it is from all synthesis methods.

With the rise of digital representations of speech, digital signal processing, and the proliferation of cheap, general-purpose computer hardware, more work was done in concatenation of natural recorded speech. Diphones appeared; that is, two adjacent half-phones (context-dependent phoneme realizations), cut in the middle, joined into one unit. The justification was that phone boundaries are much more dynamic than stable, interior parts of phones, and therefore mid-phone is a better place to concatenate units, as the stable points have, by definition, little rapid change, whereas there are rapid changes at the boundaries that depend upon the previous or next unit.

The rise of concatenative synthesis began in the 70s, and has largely become practical as large-scale electronic storage has become cheap and robust. When a megabyte of memory was a significant part of researchers salary, less resource-intensive techniques were worth their... weight in saved cycles in gold, to use an odd metaphor. Of course formant, synthesis can still require significant computational power, even if it requires less storage; the 80s speech synthesis relied on specialized hardware to deal with the constraints of the time.

In 1972, the standard Unix manual (3rd edition) included commands to process text to speech, form text analysis, prosodic prediction, phoneme generation, and waveform synthesis through a specialized piece of hardware. Of course Unix had only about 16 installations at the time and most, perhaps even all, were located in Bell Labs at Murray Hill.

Techniques were developed to compress (code) speech in a way that it could be more easily used in applications. The Texas Instruments Speak 'n Spell toy, released in the late 70s, was one of the early examples of mass production of speech synthesis. The quality was poor, by modern standards, but for the time it was very impressive. Speech was basically encoded using LPC (linear Predictive Coding) and mostly used isolated words and letters though there were also a few phrases formed by concatenation. Simple text-to-speech (TTS) engines based on specialised chips became popular on home computers such as the BBC Micro in the UK and the Apple ][.

Dennis Klatt's MITalk synthesizer [allen87] in many senses defined the perception of automatic speech synthesis to the world at large. Later developed into the product DECTalk, it produces somewhat robotic, but very understandable, speech. It is a formant synthesizer, reflecting the state of the art at the time.

Before 1980, research in speech synthesis was limited to the large laboratories that could afford to invest the time and money for hardware. By the mid-80s, more labs and universities started to join in as the cost of the hardware dropped. By the late eighties, purely software synthesizers became feasible; the speech quality was still decidedly inhuman (and largely still is), but it could be generated in near real-time.

Of course, with faster machines and large disk space, people began to look to improving synthesis by using larger, and more varied inventories for concatenative speech. Yoshinori Sagisaka at Advanced Telecommunications Research (ATR) in Japan developed nuu-talk [nuutalk92] in the late 80s and early 90s. It introduced a much larger inventory of concatenative units; thus, instead of one example of each diphone unit, there could be many, and an automatic, acoustically based distance function was used to find the best selection of sub-word units from a fairly broad database of general speech. This work was done in Japanese, which has a much simpler phonetic structure than English, making it possible to get high quality with a relatively small databases. Even up through 1994, the time needed to generate of the parameter files for a new voice in nuu-talk (503 senetences) was on the order of several days of CPU time, and synthesis was not generally possible in real time.

With the demonstration of general unit selection synthesis in English in Rob Donovan's PhD work [donovan95], and ATR's CHATR system ([campbell96] and [hunt96]), by the end of the 90's, unit selection had become a hot topic in speech synthesis research. However, despite examples of it working excellently, generalized unit selection is known for producing very bad quality synthesis from time to time. As the optimial search and selection agorithms used are not 100% reliable, both high and low quality synthesis is produced -- and many diffilculties still exists in turning general corpora into high-quality synthesizers as of this writing.

Into the 2000s a new statistical method of speech synthesis has come to the forefront. Again pioneered by work on Japan. Prof Keiichi Tokuda's HTS System (from Nagoya Institute of Technology) showed that building generative models of speech, rather than selecting unit instances can generate reliable high quality speech. Its prominance came to the fore front at the first Blizzard Challange in 2005 which showed that HTS output was reliably understoof by listeners. HTS, and so-called HMM synthesis seems to do well on smaller amounts of data, and when then the data is less reliably recorded which offers a significant advantage over the requirement of very large carefully labelled corpora that seem to be required for unit selection work. We include detailed walkthroughs form CMU's CLUSTERGEN statistical parametric synthesizer which is tightly coupled with this Festvox voice building toolkit, though HTS continues to benefit from the Festival systems (and much of what is in this document).

Of course, the development of speech synthesis is not isolated from other developments in speech technology. Speech recognition, which has also benefited from the reduction in cost of computational power and increased availability of general computing into the populace, informs a the work on speech synthesis, and vice versa. There are now many more people who have the computational resouces and interest in running speech applications, and this ability to run such applications puts the demand on the technology to deliver both working recognition and acceptable quality speech synthesis.

The availability of free and semi-free synthesis systems, such as the Festival Speech Synthesis System and the MBROLA project, makes the cost of entering the field of speech synthesis much lower, and many more groups have now joined in the development.

However, although we are now at the stage were talking computers are with us, there is still a great deal of work to be done. We can now build synthesizers of (probably) any language that can produce reconizable speech, with a sufficient amount of work; but if we are to use speech to receive information as easily when we're talking with computers as we do in everyday conversation, synthesized speech must be natural, controllable and efficient (both in rendering and in the building of new voices).