Go to the first, previous, next, last section, table of contents.
The purpose of this course is
-
To allow understanding of the basic parts of speech synthesis
-
To understand the relative complexity of implementing solutions to the
problems
-
To become familiar with the Festival architecture and know what it can
and can't do
As a firm believer in learning by doing this course tries to touch on
every aspect of speech synthesis from a practical view. General
discussion of problems are dicussed, with some presentation of potential
theoretical solutions. Where appropriate, substatial exercises are
given which will hopefully lead to greater understanding of the actual
problems.
In addition to discussion, exercises will be given with hints about how
to do them in Festival. The exercises may take some time and are
sometimes open ended with no obviously right solution. This reflects
the synthesis field.
Many people have quite different views of what processes are involved in
text to speech. Often what are considered primary areas by some are
considered trivial by others. This is partly due to different problems
in different languages but also due to researchers only seeing the part
that interests them (a common problem in most areas of research). Here
we also present a particular view of the processes involved in text to
speech, which is probably also biased but does at least discuss all
those parts that are necessary for our system to run.
In this course we will view TTS in four major parts
- Architecture:
something that can hold the system together. This introduces clearly
defined objects and the processes that we wish to apply to them.
- Text processing:
Analysis of raw and labelled text into identifiable words. This covers
tokenization, mapping tokens to words, resolving homographs, and
explicit mark-up languages.
- Linguistic/prosodic processing:
From words to segments, F0 and durations (and anything else appropriate
for waveform synthesis). This deals with lexicons for pronunciation
of words, intonation (prosodic phrase, accent and F0 prediction)
and durations.
- Waveform synthesis:
From segments, F0 and duration to a waveform. There are many
techniques to do this, concatenative synthesis (diphone, unit selection),
formant synthesis and articulatory synthesis.
In addition to these sub-areas of speech synthesis there are also
general aspects which touch on all of these aspects and are best
dealt with as separate parts.
- Building new voices:
either in existing supported languages or new languages
- Building domain specific voices:
by tailoring modes, (text analysis, lexicon, prosody) or by
recording specific databases and building limited domain synthesizers.
- Building data-driven models of speech:
for prosody, letter to sound rules, and text analysers.
- How to use speech synthesis:
what it can do and what it can't. How to get the best
possible synthetic voice for different applications
- Concept-to-speech vs Text-to-speech
Go to the first, previous, next, last section, table of contents.