Go to the first, previous, next, last section, table of contents.

1 Introduction

The purpose of this course is

To allow understanding of the basic parts of speech synthesis
To understand the relative complexity of implementing solutions to the problems
To become familiar with the Festival architecture and know what it can and can't do

As a firm believer in learning by doing this course tries to touch on every aspect of speech synthesis from a practical view. General discussion of problems are dicussed, with some presentation of potential theoretical solutions. Where appropriate, substatial exercises are given which will hopefully lead to greater understanding of the actual problems.

In addition to discussion, exercises will be given with hints about how to do them in Festival. The exercises may take some time and are sometimes open ended with no obviously right solution. This reflects the synthesis field.

1.1 Text to Speech

Many people have quite different views of what processes are involved in text to speech. Often what are considered primary areas by some are considered trivial by others. This is partly due to different problems in different languages but also due to researchers only seeing the part that interests them (a common problem in most areas of research). Here we also present a particular view of the processes involved in text to speech, which is probably also biased but does at least discuss all those parts that are necessary for our system to run.

In this course we will view TTS in four major parts

Architecture: something that can hold the system together. This introduces clearly defined objects and the processes that we wish to apply to them.
Text processing: Analysis of raw and labelled text into identifiable words. This covers tokenization, mapping tokens to words, resolving homographs, and explicit mark-up languages.
Linguistic/prosodic processing: From words to segments, F0 and durations (and anything else appropriate for waveform synthesis). This deals with lexicons for pronunciation of words, intonation (prosodic phrase, accent and F0 prediction) and durations.
Waveform synthesis: From segments, F0 and duration to a waveform. There are many techniques to do this, concatenative synthesis (diphone, unit selection), formant synthesis and articulatory synthesis.

In addition to these sub-areas of speech synthesis there are also general aspects which touch on all of these aspects and are best dealt with as separate parts.

Building new voices: either in existing supported languages or new languages
Building domain specific voices: by tailoring modes, (text analysis, lexicon, prosody) or by recording specific databases and building limited domain synthesizers.
Building data-driven models of speech: for prosody, letter to sound rules, and text analysers.
How to use speech synthesis: what it can do and what it can't. How to get the best possible synthetic voice for different applications
Concept-to-speech vs Text-to-speech

Go to the first, previous, next, last section, table of contents.