Go to the first, previous, next, last section, table of contents.


5 Text processing

The first of the three major tasks in speech synthesis is the analysis of raw text into something that can be precessed in a more reasonable manner.

In this section we will look at how to take arbitrary text and convert it to identifiable words chunked into reasonable sized utterances.

5.1 Text analysis

Consider the following examples to see how directly the written form follows standard pronunciation.

Or worse still consider this mail message (even with the headers deleted).

from awb@cstr.ed.ac.uk ("Alan W Black") on Thu 23 Nov 15:30:45:
>
>  ...  but, *I* wont make it :-) Can you tell me who's going?
>
  IMHO I think you should go, but I think the followign are going
     George Bush
     Bill Clinton
     and that other guy

Bob

-- 
                                                         _______
 +---------------------------------------------------+  |\\   //|
 | Bob Beck  E-mail bob@beck.demon.co.uk             |  | \\ // |
 +---------------------------------------------------+  |  > <  |
                                                        | // \\ |
                                          Alba gu brath |//___\\|

In the above there are a number of specific problems a speech synthesizer needs to address before it could adequately rendered this message as speech. At least the following need attention

We can split the task down as follows

5.2 Identifying tokens

Whitespace (space, tab, newline, and carriage return) can be viewed as separators.

Punctuation can also be separated from the raw tokens.

Festival converts text from files into an ordered list of tokens each with its own preceding whitespace and succeeding punctuation as features of the token.

5.3 Chunking into utterances

"Sentences end with a full stop." Unfortunately it is nowhere near as simple as that. We wish to chunk the text into reasonable sized chunks so that they can synthesized quickly and played so that there is as little time as possible between utterances, especially at the start.

Most synthesizers support some form of spooling which allows the next utterance to be synthesized while the previous is actually playing. Synthesis time varies from machine to machine and synthesizer to synthesizer. There can be many factors and they are not all due to the actual algorithms used in the synthesizer. Resampling time, pauses introduced by audio hardware, and accessing files on remote disks often play as much in the overall timing as the algorithms and the machine speed itself.

In Festival we chunk tokens into utterances which are what can be most reasonably recognized as sentences. Ideally chunks should be prosodic phrases but that would require more analysis of the tokens before a decision about where the prosodic phrases should occur can be made. Festival uses a rule system to determine utterance breaks, it depends on the token itself and its context. It allows lookahead to the next token.

The currently used decision tree for determining end of utterance is as follows

((n.whitespace matches ".*\n.*\n[ \n]*") ;; A significant break in the text
  ((1))
  ((punc in ("?" ":" "!"))
   ((1))
   ((punc is ".")
    ;; This is to distinguish abbreviations vs periods
    ;; These are heuristics
    ((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
     ((n.whitespace is " ")
      ((0))                  ;; if abbrev signal space is enough for break
      ((n.name matches "[A-Z].*")
       ((1))
       ((0))))
     ((n.whitespace is " ")  ;; if it doesn't look like an abbreviation
      ((n.name matches "[A-Z].*")  ;; single space and non-cap is no break
       ((1))
       ((0)))
      ((1))))
    ((0)))))

Thus the above difficult cases try to deal with the case where a token is terminated by a period but could be an abbreviation. An abbreviation is recognized as containing a dot or capitalized with one or two letters or three capital letters. When an abbreviation is detected there must be more than one space and the next word must be capitalized to signal a break. If the word doesn't appear to be an abbreviation, then any long break or capitalized following word will signal a break.

This will fail for such examples as

5.4 Tokens to words

Now we have our chunks we can relate each token to zero or more words. This requires context-sensitive rules. This takes into account, numbers, dates, money, some abbreviations, email addresses, random tokens. This is to identify and analyze tokens that have some internal structure. Note that some have different readings depending on dialect even when we agree on what they represent.

1100 meters -> eleven hundred or one thousand one hundred
$3.50 -> three dollars (and) fifty (cents)

Pronunciation of numbers often depends on its type.

An example rule for dealing with such phrases as "$1.2 million" would be

(define (token_to_words token name)
 (cond
  ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
        (string-matches (item.feat token "n.name") ".*illion.?"))
   (append
    (builtin_english_token_to_words token (string-after name "$"))
    (list (item.feat token "n.name"))))
  ((and (string-matches (item.feat token "p.name") 
                        "\\$[0-9,]+\\(\\.[0-9]+\\)?")
        (string-matches name ".*illion.?"))
   (list "dollars"))
  (t
   (builtin_english_token_to_words token name))))

But even this isn't 100% robust.

@cindex{homographs} Homographs are words that are written the same but are pronounced differently. There are a number of different types of homographs which can be distinguished through different types of information.

Care should be taken when trying to deal with a phenomena. The question is, 'How often does it occur?'. The answer depends very much on the type text you are dealing with.

There are however relatively few semantic homographs in languages, English has perhaps only a few hundred, this is probably related to the fact that it would be too difficult to read if too many words were homographs.

Although some phenomena can only realistically be treated by hand written rules, others depend on a host of different contextual information and can be better learned from suitable data. Festival supports a homograph disambiguation method based on yarowsky96. This technique covers all classes of homographs, those based on part of speech as well as "semantic" ones (e.g. "row" (boat and argument)), though part of speech homographs are usually dealt with by the part of speech tagger.

To use this disambiguation technique, first you need a large corpus of words. We've been using a collection of over 63 million words including, various novels, newspapers, encyclopedia, research papers and email.

The stages involved in building a disambiguator for a particular homograph are as follows.

  1. Extract all occurrences of homograph from corpus.
  2. Label each occurrence with its class
  3. Extract contextual features that will identify class
  4. Build classification tree (or decision list) to classify occurrences.

This may seem like an impossibly large task but it is surprising how few occurrences of homographs there actually are. For example there are only 442 occurrences of the token bass in our corpus, and 6167 occurrences of the word lead.

Of course for some homographs there are many more. For examples there are hundreds of thousands of tokens containing only digits in our corpus. In this case we simply take some representative subset.

The features used to classify are relatively easy to find and improve on. Content words in the context of five words before and five word after are very good at disambiguating many semantic tokens, as discussed in yarowsky96. The classification tree for the token St for street or saint, is dependent of the capitalization and punctuation of the immediately surround words. It is often that while labelling the class of the occurrences of homographs, potential discriminatory features come to mind.

Using this technique we build a classifier for tokens containing only digits. Because there are many occurrences of such tokens we only used examples from two sub-corpora, namely four years of Time Magazine and 10 years of personal email. We classified around 100,000 occurrences of numbers into four classes, years, days (pronounced as ordinals), quantifiers and phone numbers (pronounced as digits). The distribution in the corpus was 42% numbers, 35% years, 19% days, and 3% phone numbers. We achieved an overall 97.4% correct classification on held out test data.

This technique is successful if the data has the appropriate coverage. But due to the changing nature of language we find that no matter how big your training set is, there still will be apparently common forms which never appear in it. For example, in building disambiguators for roman numerals, our databases has no occurences of the words Pentium or Palm appearing before roman numerals even though in today's text these are common.

5.5 Summary of Festival's text processing

When Festival performs TTS what it really does is

5.6 Text modes

Another level of control that is desirable is mode specific processing, even when we have no explicit mark-up. Obviously the treatment of some tokens will be different between different types of text. For example a "/" (slash) inside a token in an email message is much more likely to identify a Unix path name than when it appears within a Reuters news article, where it probably identifies alternatives. Email messages have conventions for quoting (and signatures) which can be dealt with that make comprehension of the message much easier. Also some file formats have partial mark-up that is useful. Latex has methods for marking emphasis which can easily be detected.

Festival supports the notion of text modes, following the notion of modes in Emacs, which allow customization of mode specific parameters.

Specifically it offers

filter
A Unix program filter for the file. In email-mode this removes most of the mail headers.
init_function
A Scheme function to be called when entering the mode. This allows selection of voice, addition of lexical entries, and mode specific tokenization rules to be set up.
exit_function
Called on exiting the mode, so you can tidy everything up and not leave mode specific rules that cause other synthesis modes to fail.

5.7 An example email text mode

In this example we will show a text mode which deals with email messages. It is not complete but shows some of the things you can do. We will filter the message extracting interesting parts (sender, subject and body). For token to word rules, we will set rules for email addresses, and remove greater than signs in quoted paragraphs. We will also switch voices for quoted text.

First we define a filter that extracts the from line and subject from the headers and the message body itself

#!/bin/sh
#  Email filter for Festival tts mode
#  usage: email_filter mail_message >tidied_mail_message
grep "^From: " $1
echo 
grep "^Subject: " $1
echo
# delete headers (up to first blank line)
sed '1,/^$/ d' $1

Now we define the init and exit functions. In this small example the only thing we do is save the existing text to token function and cause this mode to use ours. In the exit function we switch things back.

(define (email_init_func)
 "Called on starting email text mode."
 (set! email_previous_t2w_func token_to_words)
 (set! english_token_to_words email_token_to_words)
 (set! token_to_words email_token_to_words))

(define (email_exit_func)
 "Called on exit email text mode."
 (set! english_token_to_words email_previous_t2w_func)
 (set! token_to_words email_previous_t2w_func))

The function email_token_to_words must be defined. We'll discuss it in three parts.

(define (email_token_to_words token name)
  "Email specific token to word rules."
  (cond
   ((string-matches name "<.*@.*>")
     (append
      (email_previous_t2w_func token
       (string-after (string-before name "@") "<"))
      (cons 
       "at"
       (email_previous_t2w_func token
        (string-before (string-after name "@") ">")))))

This function will be called for each token in an utterance. It is called with two arguments: the item (token) and the actual token's string are given (name).

The first clause identifies an email address and removes the angle brackets then calls the token-to-word function recursively to say the name and address separately.

The second part of this function is designed to identify quotes in an email message.

   ((and (string-matches name ">")
         (string-matches (item.feat token "whitespace") 
                         "[ \t\n]*\n *"))
    (voice_don_diphone)
    nil ;; return nothing to say
   )

That is when the token is a greater than sign and it appears at the start of a line. When this is true we select the alternate speaker and return no words to be said (i.e. the greater than sign is silent). This also has the advantage that the word relation that will be created in the utterance will be continuous over the newline and quote marker so it wont interfere with prosodic phrasing.

The third part deals with all other cases and simply calls the previously defined token to word function. But before that we must find out if we have switched back to unquoted text and hence need to switch back the speaker.

   (t  ;; for all other cases
     (if (string-matches (item.feat token "whitespace") 
                         ".*\n[ \n]*")
         (voice_gsw_diphone))
     (email_previous_t2w_func utt token name))))

Now the next stage is to define the mode itself. This is done through the variable tts_text_modes like this

(set! tts_text_modes
   (cons
    (list
      'email   ;; mode name
      (list         ;; email mode params
       (list 'init_func email_init_func)
       (list 'exit_func email_exit_func)
       '(filter "email_filter")))
    tts_text_modes))

You can test this mode with the example mail message in `FESTIVALDIR/examples/ex1.email'. You must load all of the above commands into Festival before the email mode will work. Save them in a file and name that file as an argument when starting Festival, then call tts on the file like this

(tts "FESTIVALDIR/examples/ex1.email" 'email)

Note there a number of specific problems. No end of utterance is detected after "URL." and before "Alan" in the quoted text. This is because our end of utterance rules (as described above) don't deal with this case. Can you see how to modify them to fix this?

Other problems too exist, such as end of quoted text may not be detected if there is no blankline between the quoted and unquoted forms.

There is the question of whether text modes should produce STML or not. If they did produce STML then they could easily port to other synthesizers.

5.8 Mark-up languages

But wouldn't it all be easier if the input included information identifying what the text was? In many uses of synthesis such information does exist in the application and could easily be passed on to the synthesizer if there was a well defined way to do so. Synthesis uses such as language generation, machine translation, dialog systems and information providing systems often know significant details about what has to be said.

Sable is an XML-based language developed for marking up text (sproat98b, which is based on STML sproat97, and on previous work on SSML taylor97b). Sable was designed by a group invovling AT&T, Bell Labs, Sun, Apple, Edinburgh University and CMU. It is intended as a standard that can be used across many different synthesis systems.

Input in Sable may be labelled, identifying pronunciation, breaks, emphasis etc.

An example best illustrates its use.

<?xml version="1.0"?>
<!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" 
        "Sable.v0_2.dtd"
[]>
<SABLE>
<SPEAKER NAME="male1">

The boy saw the girl in the park <BREAK/> with the telescope.
The boy saw the girl <BREAK/> in the park with the telescope.

Some English first and then some Spanish.
<LANGUAGE ID="SPANISH">Hola amigos.</LANGUAGE>
<LANGUAGE ID="NEPALI">Namaste</LANGUAGE>

Good morning <BREAK /> My name is Stuart, which is spelled
<RATE SPEED="-40%">
<SAYAS MODE="literal">stuart</SAYAS> </RATE>
though some people pronounce it 
<PRON SUB="stoo art">stuart</PRON>.  My telephone number
is <SAYAS MODE="literal">2787</SAYAS>.

I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, 
but no one can pronounce that.

By the way, my telephone number is actually
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/>
<AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>.
</SPEAKER>
</SABLE>

Sable currently supports

But Sable is still being developed and other tags are under consideration including synthesis engine specific commands and labelling of phrase types (e.g. "question", "greeting" etc.)

There is much interest in defining such a mark-up up language which is independent of any particular synthesizer and work is continuing with a number of important laboratories of agreeing on a standard.

5.9 Normalization of Non-Standard Words

A cleaner alternative general method for text normalization. This model was developed as a part of a project at the 1999 Summer Workshop at Johns Hopkins University. See http://www.clsp.jhu.edu/ws99/projects/normal/ for full description.

The idea behind this model was to formalize the notion of text analysis so that there need be only a small number of clearly defined, and hopefully trainable, modules that offer translation from strings of characters to lists of words.

In the NSW framework there are 4 basic stages of processing:

splitter
A simple tokenizer, that splits not only "classic" whitespace separated tokens but also within such tokens where punctuation (e.g. hyphens) or capitalization suggest such a split.
type identifier
for each split token identify its type, one of around 20 types, identifying how the toke is to be expanded.
token expander
for each typed toke expand it to words. In all except one case this expnation is pretty much deterministic, such as number, date, money, letter sequences expansion. Only in the case of abbreviations is some extra required.
language modelling
A langauge model is then used to select between possible alternative pronucniations of the output.

This project wanted to look at how a text normalization function could be trained for different domains. To do this four basic text types were chosen.

NANTC
(North American News Text Corpus), presswire database from, New York Times, Wall Street Journal etc. This was chosen as a baseline that many text conditioners and TTS engines are already tuned for. This consists of about 4.3 million tokens, of which around 8.8% were considered non-standard words
Classified Ads
Because this genre is so productive we collected 415K tokens from various websites of real estate classified ads. 43.4% of these were non-standard words.
PC110
To allow testing on data similar to email (and data that could be freely distributed), we extracted 264K tokens from a mailing list of the IBM PC110 palmtop computer. 27.3% of this is non-standard words.
RFR
To show that its not just geeky discussions that are full of non-standard words we extracted 209K tokens from the USENET group rec.food.recipes. 22% of these tokens are non-standard words.

The data was converted to a simple XML format and then each NSW was hand labelled (with a simple labelling tool, that included simple prediction of the type of the token).

The NSW types were

EXPN
abbreviation, contractions e.g. adv, N.Y, mph, gov't
LSEQ
letter sequence e.g. CIA, D.C, CDs
ASWD
read as word, e.g. CAT, proper names
MSPL
misspelling e.g. geogaphy
NUM
number (cardinal) e.g. 12, 45, 1/2, 0.6
NORD
number (ordinal) e.g. May 7, 3rd, Bill Gates III
NTEL
telephone (or part of) e.g. 212 555-4523
NDIG
number as digits e.g. Room 101,
NIDE
identifier e.g. 747, 386, I5, PC110, 3A
NADDR
number as street address e.g. 5000 Pennsylvania, 4523 Forbes
NZIP
zip code or PO Box e.g. 91020
NTIME
a (compound) time e.g. 3.20, 11:45
NDATE
a (compound) date e.g. 2/2/99, 14/03/87 (or US) 03/14/87
NYER
year(s) e.g. 1998 80s 1900s 2003
MONEY
money (US or otherwise) e.g. \$3.45 HK\$300, Y20,000, \$200K
BMONY
money tr/m/billions e.g. \$3.45 billion
PRCT
percentage e.g. 75\%, 3.4\%
SLNT
not spoken, word boundary e.g. word boundary or emphasis character: M.bath, KENT*REALTY, \_really\_, ***Added
PUNC
not spoken, phrase boundary e.g. non-standard punctuation: "..." in e.g. DECIDE...Year, *** in $99,9K***Whites
FNSP
funny spelling e.g. slloooooww, sh*t
URL
url, pathname or email e.g. http://apj.co.uk, /usr/local, phj@teleport.com
NONE
token should be ignored e.g. ascii art, formating junk

Although at first glance these seem reasonable, after labelling we noted there is still some ambiguity on these. We did test our interlabeller agreement and found it very high but as a large percentage of this task is trivial the problems are always in that last few percent, and there we foudn out labellers we're consistent especially when it came to splitting things.

After labelling a section such as

Today I bought a Sony NP-F530 1350maH. Like your 550 it is slightly larger than the native IBM battery pack. It's been 3 hours now on it's first charge - I am charging in the PC110.

will look like

Today I bought a Sony<W NSW="LSEQ"> NP-F530,</W><W NSW="SPLT"><WS NSW="NUM"> 1350</WS><WS NSW="EXPN">maH.</WS></W> Like your<W NSW="NIDE"> 550</W> it is slightly larger than the native<W NSW="LSEQ"> IBM</W> battery pack. It's been<W NSW="NUM"> 3</W> hours now on it's first charge - I am charging in the <W NSW="LSEQ"> PC110. </W>

Splitting of tokens only at white space, even for English was considered to be too limiting as when looking athe Festival token to word rules (or the equivalent int he Bell Labs systems) there seemed to be too many rules that were there just further split objects rather than expanding thing to words. Thus the splitter additional splits tokens that are letter/number conjctions, (forms of) mixed case conjunctions and things containing punctuation. The splitter is actually implemented as a set of regular expressions. But this too has a (in some sense) weird set of exceptions such as money symbols, urls, telephone numbers and some others.

The second stage was to assign the token tags to each split token. This is done (by default) from CART models trained from the the labelled data. For identifying the various alphabetic types (LSEQ, ASWD and EXPN) trigram letter language models were build that return an estimate of the probability that some alphabetic character sequences is a letter sequence, as pronouncable word or an abbreviation. The results are fed into the CART classifier, rather than used directly.

The full classifier accuracy various across domains ranging from 98.1% correct in the new data, to 91.8% in the PC110 (email) data.

Once classification is done a simple set of expanders is used to generate the words from the token plus type. This is true expcept for EXPN labelled tokens (abbreviations), as although they have been identified as abbreviations we still need to identify what they are an abbreviation of.

Finding out what the expansion of an abbreviation is can most simply be done by an abbreviation lexicon. But in some cases an new abbreviation may occur, that can be detected as an abbreviation, but isn't in the lexicon. This model attempts to solve this. It relies that given an abbreviation some full expansion of it appear somewhere in a corpus of the sort of text the text normalisation model will be applied to. There are two parts to this abbreviation expansion model. First a model that predicts all possible expnasions of an abbreviation. This is done with a WFST (weighted finite state transducer) model built from a trained CART model that predicts the deletion of characters from full words to form abbreviations. The second stage is a languge model that predicts the occurs of those full words in text. Together, for the classifieds domain, this works with about 20% error.

The advantages of the NSW model are

5.10 Two interesting text analysis problems

5.10.1 Tokenization without whitespace

In most European languages white space is used to separate basic words, but this isn't true in languages like Chinese and Japanese. There text is a continuous flow of characters. Punctuation is still often used but there is regularly no whitespace. Even newlines may be inserted within words to allow proper line up of characters. Within English this problme partly occurs in compound words and consequences of this can be seen in finding pronunciations of unknown words by letter to sound rules.

Consider the word outhomer (s baseball term). Festival's letter to sound rules, wrongly pronounce this as aw th am m er. Although we can see there is effectiver a word break between the letters t and h the letter to sound rules do not see this and map these two characters together as a single phone th. The reason we can seet outhomer as two words as the two words out and homer, is due to the high frequency of these individual words and the lower frequency of the full word pronounced as aw th am m er. As another example, consider the word together, it could be split as to get her, but isn't due the relative frequency of the full word over the not unusual trigram of to get her.

Thus the best split can be defined as most probable of all possible splits. Mathematically we can put this as the split that maximizes the probability of the sentences which we can estimate by

ProductOf foreach i in K P(wi | wi-1, ... wi-N+1)

Thus we use say tri (N=3) or bi (N-2) grams to estimate the probability of each word. This technique is used, in general for finding segmentations of Chinese and Japanese texts sproat96b. And it is quite successful.

However in its simplest form it needs a pre-tokenized database of words in order to collect the statistics for the ngrams. One technique to solve this is to use an iterating approach where an unsegmented database is segmented with some simple algorithm, say longest matching, then stats are collected and the statistically technique is used, and the stats are re-estimated. This process can be iterated until there no more improvements, on some held out test database.

5.10.2 Number pronunciation

In some languages, gender, case etc affect how numbers are proounced. That is the pronounced of the digit 1 depends on what it is refereing too. For example in Spanish

1 ni@~no --> un ni@~no (one boy)
1 ni@~na --> un ni@~na (one girl)
1 hermano --> un hermano (one brother)
1 hermana --> una hermana (one sister)
1 pais ---> un pais (one country)
1 ra'iz ---> una ra'iz (one root)

Although it might seem possible that checking if the following word ends in a or o might be a good disambuguator there are many words in Spanish where gender cannot be easily identified from surface form and a lexicon is required. Further more the digit(s) may not be refering to the word immediately following that token and may ultimately require understanding of the sentence to get the right pronunciation.

Spanish is not the only language where declensions exist. In a Polish synthesizer we built some years ago, most of the work became concentrated on getting the pronunciation of numbers correct. This would be true for all slavic languages.

5.11 Exercises

  1. Add a token to word rule, to say money values with two places after the point properly.
  2. Add a token to word rule to say numbers in dates as ordinals (first, second, third, etc.) rather than cardinals. Also add a token to word rule to dates of the form "11/06/97" in their full form rather than number slash number ...
  3. Use Sable markup to tell a joke.
  4. Build a text mode for reading Latex, HTML, syslog messages, machine use summary or such like.

5.12 Hints

  1. You will need to add a new definition for token_to_words by convention save the existing one and call that for things that don't match what you are looking for. Thus your file will look something like
    (set! previous_token_to_words token_to_words)
    
    (define (token_to_words token name)
       (cond
         ;; condition to recognize money tokens
         ;; return list of words 
        (t
         (previous_token_to_words token name))))
    
    The actual condition and return list of words is similar to the treatment of email addresses described above. The regular expression for money is probably "$[0-9]+\\.[0-9][0-9]",
  2. Take the above and add to it. The rule is probably: if the token is a two digit number and the succeeding token's name is a month name (or abbreviation of a month name) then return the word, first, second etc. You will need a new function to relate the digits to the pronunciation. If you don't know how to do that in Scheme the following will work
    (define (num-to-ordinal num)
    "Returns the ordinal in words for num (up to 40)."
    (cdr    
     (cdr 
      (assoc (parse-number num)
       '((1 first) (2 second) (3 third) ...
         (39 thirty-ninth) (40 fortieth))))))
    
    Once you decide on the condition, remember that you need to return a list of words.


Go to the first, previous, next, last section, table of contents.