Building Synthetic Voices | ||
---|---|---|
<<< Previous | Next >>> |
This chapter discusses some of the basic problems in analyzing text when trying to convert it to speech. Although it is oftain considered a trival problem, not worthy of specnding time one, to anyone who has to actually listen to general text-to-speech systems quickly realises it is not as easy to pronounce text as it first appears. Numbers, symbols, acronyms, abbreviations apear to various degrees in different types of text, even the most general types, like news stoires and novels still have tokens which do not have a simple pronunciaiton that can be found merely by looking up the token in a lexicon, or using letter to sound rules.
In any new language, or any new domain that you wish to tranfer text to speech building an apporpriate text analysis module is necessary. As an attempt to define what we mean by text analysis more specifically we will consider this module as taking in strings of characters and producing strings of words where we defined words to be items for which a lexicon can provide pronucniations either by direct lookup or by some form of letter to sound rules.
The degree of difficult of this convertion task depends on the text type and language. For example in lanmguages like Chinese, Japanese etc., there is, conventionally, no use of whitespace characaters between words as found in most Western language, thus even identfying the token boundaries is an interesting task. Even in English text the proportion of simple pronouncable words to what we will term non-standard words can vary greatly. We define non-standard words (NSWs) to be those tokens which do not apear directly in the lexicon (at least as a first simplication). Thus tokens contains digits, abbreviations, and out of vocabulary words are all considered to be NSWs that require some form of identification before their pronunciation can be specified. Sometimes NSWs are ambiguous and some (often shallow) level of analysis is necessary to identfiy them. For example in English the string of digits 1996 can have several different pronunciations depending on its use. If it is used as a year it is pronunciation as nineteen ninety-six, if it is a quantity it is more likely pronuounced as one thousand nine hundred (and) ninety-six while if it is used as a telephone extention it can be pronounced simpelas a string of digits one nine nine six . Deterimining the appropriate type of expansion is the job of the text analysis module.
Much of this chapter is based on a project that was carried out at a summer workshop at Johns Hopkins University in 1999 [JHU-NSW-99] and later published in [Sproat00], the tools and techniques developed at that workshop were further developed and documented and now distributed as part of the FestVox project. After a discussion of the problem in more detail, concentrating on English examples, a full presentation of NSW text analysis technique will be given with a simple example. After that we will address different appropaches that can be taken in Festival to build general and customized text analysis models. Then we will address a number of specifc problems that appear in text analysis in various languages including homogra[h disambiguation, number pronunciation in Slavic languages, and segmentation in Chinese.
In an attempt to avoid relying solely on a bunch of "hacky" rules, we can better define the task of analyzing text using a number of statistical trained models using either labeled or unlabeled text from the desired domain. At first approximation it may seem to be a trival problem, but the number of non-standard words is enough even in what is considered clean text such as press wire news articales to make their synthesis sound bad without it.
Full NSW model description and justification to be added, doan play the following (older) parts.
<<< Previous | Home | Next >>> |
Making it better | Up | Token to word rules |