The first of the three major tasks in speech synthesis is the analysis of raw text into something that can be precessed in a more reasonable manner.
In this section we will look at how to take arbitrary text and convert it to identifiable words chunked into reasonable sized utterances.
Consider the following examples to see how directly the written form follows standard pronunciation.
Or worse still consider this mail message (even with the headers deleted).
from awb@cstr.ed.ac.uk ("Alan W Black") on Thu 23 Nov 15:30:45: > > ... but, *I* wont make it :-) Can you tell me who's going? > IMHO I think you should go, but I think the followign are going George Bush Bill Clinton and that other guy Bob -- _______ +---------------------------------------------------+ |\\ //| | Bob Beck E-mail bob@beck.demon.co.uk | | \\ // | +---------------------------------------------------+ | > < | | // \\ | Alba gu brath |//___\\|
In the above there are a number of specific problems a speech synthesizer needs to address before it could adequately rendered this message as speech. At least the following need attention
We can split the task down as follows
Whitespace (space, tab, newline, and carriage return) can be viewed as separators.
Punctuation can also be separated from the raw tokens.
Festival converts text from files into an ordered list of tokens each with its own preceding whitespace and succeeding punctuation as features of the token.
"Sentences end with a full stop." Unfortunately it is nowhere near as simple as that. We wish to chunk the text into reasonable sized chunks so that they can synthesized quickly and played so that there is as little time as possible between utterances, especially at the start.
Most synthesizers support some form of spooling which allows the next utterance to be synthesized while the previous is actually playing. Synthesis time varies from machine to machine and synthesizer to synthesizer. There can be many factors and they are not all due to the actual algorithms used in the synthesizer. Resampling time, pauses introduced by audio hardware, and accessing files on remote disks often play as much in the overall timing as the algorithms and the machine speed itself.
In Festival we chunk tokens into utterances which are what can be most reasonably recognized as sentences. Ideally chunks should be prosodic phrases but that would require more analysis of the tokens before a decision about where the prosodic phrases should occur can be made. Festival uses a rule system to determine utterance breaks, it depends on the token itself and its context. It allows lookahead to the next token.
The currently used decision tree for determining end of utterance is as follows
((n.whitespace matches ".*\n.*\n[ \n]*") ;; A significant break in the text ((1)) ((punc in ("?" ":" "!")) ((1)) ((punc is ".") ;; This is to distinguish abbreviations vs periods ;; These are heuristics ((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)") ((n.whitespace is " ") ((0)) ;; if abbrev signal space is enough for break ((n.name matches "[A-Z].*") ((1)) ((0)))) ((n.whitespace is " ") ;; if it doesn't look like an abbreviation ((n.name matches "[A-Z].*") ;; single space and non-cap is no break ((1)) ((0))) ((1)))) ((0)))))
Thus the above difficult cases try to deal with the case where a token is terminated by a period but could be an abbreviation. An abbreviation is recognized as containing a dot or capitalized with one or two letters or three capital letters. When an abbreviation is detected there must be more than one space and the next word must be capitalized to signal a break. If the word doesn't appear to be an abbreviation, then any long break or capitalized following word will signal a break.
This will fail for such examples as
Now we have our chunks we can relate each token to zero or more words. This requires context-sensitive rules. This takes into account, numbers, dates, money, some abbreviations, email addresses, random tokens. This is to identify and analyze tokens that have some internal structure. Note that some have different readings depending on dialect even when we agree on what they represent.
1100 meters -> eleven hundred or one thousand one hundred $3.50 -> three dollars (and) fifty (cents)
Pronunciation of numbers often depends on its type.
An example rule for dealing with such phrases as "$1.2 million" would be
(define (token_to_words token name) (cond ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches (item.feat token "n.name") ".*illion.?")) (append (builtin_english_token_to_words token (string-after name "$")) (list (item.feat token "n.name")))) ((and (string-matches (item.feat token "p.name") "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches name ".*illion.?")) (list "dollars")) (t (builtin_english_token_to_words token name))))
But even this isn't 100% robust.
@cindex{homographs} Homographs are words that are written the same but are pronounced differently. There are a number of different types of homographs which can be distinguished through different types of information.
project
.
bass, tear
.
Nice, Begin, Said
.
Chapter II, James II
.
5-3, high/low, usr/local
.
Care should be taken when trying to deal with a phenomena. The question is, 'How often does it occur?'. The answer depends very much on the type text you are dealing with.
There are however relatively few semantic homographs in languages, English has perhaps only a few hundred, this is probably related to the fact that it would be too difficult to read if too many words were homographs.
Although some phenomena can only realistically be treated by hand written rules, others depend on a host of different contextual information and can be better learned from suitable data. Festival supports a homograph disambiguation method based on yarowsky96. This technique covers all classes of homographs, those based on part of speech as well as "semantic" ones (e.g. "row" (boat and argument)), though part of speech homographs are usually dealt with by the part of speech tagger.
To use this disambiguation technique, first you need a large corpus of words. We've been using a collection of over 63 million words including, various novels, newspapers, encyclopedia, research papers and email.
The stages involved in building a disambiguator for a particular homograph are as follows.
This may seem like an impossibly large task but it is surprising how few
occurrences of homographs there actually are. For example there are
only 442 occurrences of the token bass
in our corpus, and
6167 occurrences of the word lead
.
Of course for some homographs there are many more. For examples there are hundreds of thousands of tokens containing only digits in our corpus. In this case we simply take some representative subset.
The features used to classify are relatively easy to find and improve
on. Content words in the context of five words before and five word
after are very good at disambiguating many semantic tokens, as discussed
in yarowsky96
. The classification tree for the token St
for street or saint, is dependent of the capitalization and punctuation
of the immediately surround words. It is often that while labelling the
class of the occurrences of homographs, potential discriminatory features
come to mind.
Using this technique we build a classifier for tokens containing
only digits. Because there are many occurrences of such tokens we only
used examples from two sub-corpora, namely four years of Time Magazine
and 10 years of personal email. We classified around 100,000 occurrences
of numbers into four classes, years
, days
(pronounced as
ordinals), quantifiers
and phone numbers
(pronounced as digits).
The distribution in the corpus was 42% numbers, 35% years, 19% days, and
3% phone numbers. We achieved an overall 97.4% correct classification
on held out test data.
This technique is successful if the data has the appropriate
coverage. But due to the changing nature of language we find that no
matter how big your training set is, there still will be apparently
common forms which never appear in it. For example, in building
disambiguators for roman numerals, our databases has no occurences of
the words Pentium
or Palm
appearing before roman numerals
even though in today's text these are common.
When Festival performs TTS what it really does is
utt.synth
and utt.play
utt.synth
runs further analysis on each
token in an utterance converting it to one or more words.
Another level of control that is desirable is mode specific processing, even when we have no explicit mark-up. Obviously the treatment of some tokens will be different between different types of text. For example a "/" (slash) inside a token in an email message is much more likely to identify a Unix path name than when it appears within a Reuters news article, where it probably identifies alternatives. Email messages have conventions for quoting (and signatures) which can be dealt with that make comprehension of the message much easier. Also some file formats have partial mark-up that is useful. Latex has methods for marking emphasis which can easily be detected.
Festival supports the notion of text modes, following the notion of modes in Emacs, which allow customization of mode specific parameters.
Specifically it offers
In this example we will show a text mode which deals with email messages. It is not complete but shows some of the things you can do. We will filter the message extracting interesting parts (sender, subject and body). For token to word rules, we will set rules for email addresses, and remove greater than signs in quoted paragraphs. We will also switch voices for quoted text.
First we define a filter that extracts the from line and subject from the headers and the message body itself
#!/bin/sh # Email filter for Festival tts mode # usage: email_filter mail_message >tidied_mail_message grep "^From: " $1 echo grep "^Subject: " $1 echo # delete headers (up to first blank line) sed '1,/^$/ d' $1
Now we define the init and exit functions. In this small example the only thing we do is save the existing text to token function and cause this mode to use ours. In the exit function we switch things back.
(define (email_init_func) "Called on starting email text mode." (set! email_previous_t2w_func token_to_words) (set! english_token_to_words email_token_to_words) (set! token_to_words email_token_to_words)) (define (email_exit_func) "Called on exit email text mode." (set! english_token_to_words email_previous_t2w_func) (set! token_to_words email_previous_t2w_func))
The function email_token_to_words
must be defined. We'll
discuss it in three parts.
(define (email_token_to_words token name) "Email specific token to word rules." (cond ((string-matches name "<.*@.*>") (append (email_previous_t2w_func token (string-after (string-before name "@") "<")) (cons "at" (email_previous_t2w_func token (string-before (string-after name "@") ">")))))
This function will be called for each token in an utterance. It is
called with two arguments: the item (token
) and the actual
token's string are given (name
).
The first clause identifies an email address and removes the angle brackets then calls the token-to-word function recursively to say the name and address separately.
The second part of this function is designed to identify quotes in an email message.
((and (string-matches name ">") (string-matches (item.feat token "whitespace") "[ \t\n]*\n *")) (voice_don_diphone) nil ;; return nothing to say )
That is when the token is a greater than sign and it appears at the start of a line. When this is true we select the alternate speaker and return no words to be said (i.e. the greater than sign is silent). This also has the advantage that the word relation that will be created in the utterance will be continuous over the newline and quote marker so it wont interfere with prosodic phrasing.
The third part deals with all other cases and simply calls the previously defined token to word function. But before that we must find out if we have switched back to unquoted text and hence need to switch back the speaker.
(t ;; for all other cases (if (string-matches (item.feat token "whitespace") ".*\n[ \n]*") (voice_gsw_diphone)) (email_previous_t2w_func utt token name))))
Now the next stage is to define the mode itself. This is done
through the variable tts_text_modes
like this
(set! tts_text_modes (cons (list 'email ;; mode name (list ;; email mode params (list 'init_func email_init_func) (list 'exit_func email_exit_func) '(filter "email_filter"))) tts_text_modes))
You can test this mode with the example mail message in `FESTIVALDIR/examples/ex1.email'. You must load all of the above commands into Festival before the email mode will work. Save them in a file and name that file as an argument when starting Festival, then call tts on the file like this
(tts "FESTIVALDIR/examples/ex1.email" 'email)
Note there a number of specific problems. No end of utterance is detected after "URL." and before "Alan" in the quoted text. This is because our end of utterance rules (as described above) don't deal with this case. Can you see how to modify them to fix this?
Other problems too exist, such as end of quoted text may not be detected if there is no blankline between the quoted and unquoted forms.
There is the question of whether text modes should produce STML or not. If they did produce STML then they could easily port to other synthesizers.
But wouldn't it all be easier if the input included information identifying what the text was? In many uses of synthesis such information does exist in the application and could easily be passed on to the synthesizer if there was a well defined way to do so. Synthesis uses such as language generation, machine translation, dialog systems and information providing systems often know significant details about what has to be said.
Sable is an XML-based language developed for marking up text (sproat98b, which is based on STML sproat97, and on previous work on SSML taylor97b). Sable was designed by a group invovling AT&T, Bell Labs, Sun, Apple, Edinburgh University and CMU. It is intended as a standard that can be used across many different synthesis systems.
Input in Sable may be labelled, identifying pronunciation, breaks, emphasis etc.
An example best illustrates its use.
<?xml version="1.0"?> <!DOCTYPE SABLE PUBLIC "-//SABLE//DTD SABLE speech mark up//EN" "Sable.v0_2.dtd" []> <SABLE> <SPEAKER NAME="male1"> The boy saw the girl in the park <BREAK/> with the telescope. The boy saw the girl <BREAK/> in the park with the telescope. Some English first and then some Spanish. <LANGUAGE ID="SPANISH">Hola amigos.</LANGUAGE> <LANGUAGE ID="NEPALI">Namaste</LANGUAGE> Good morning <BREAK /> My name is Stuart, which is spelled <RATE SPEED="-40%"> <SAYAS MODE="literal">stuart</SAYAS> </RATE> though some people pronounce it <PRON SUB="stoo art">stuart</PRON>. My telephone number is <SAYAS MODE="literal">2787</SAYAS>. I used to work in <PRON SUB="Buckloo">Buccleuch</PRON> Place, but no one can pronounce that. By the way, my telephone number is actually <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.2.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.8.au"/> <AUDIO SRC="http://www.cstr.ed.ac.uk/~awb/sounds/touchtone.7.au"/>. </SPEAKER> </SABLE>
Sable currently supports
But Sable is still being developed and other tags are under consideration including synthesis engine specific commands and labelling of phrase types (e.g. "question", "greeting" etc.)
There is much interest in defining such a mark-up up language which is independent of any particular synthesizer and work is continuing with a number of important laboratories of agreeing on a standard.
A cleaner alternative general method for text normalization. This model was developed as a part of a project at the 1999 Summer Workshop at Johns Hopkins University. See http://www.clsp.jhu.edu/ws99/projects/normal/ for full description.
The idea behind this model was to formalize the notion of text analysis so that there need be only a small number of clearly defined, and hopefully trainable, modules that offer translation from strings of characters to lists of words.
In the NSW framework there are 4 basic stages of processing:
This project wanted to look at how a text normalization function could be trained for different domains. To do this four basic text types were chosen.
The data was converted to a simple XML format and then each NSW was hand labelled (with a simple labelling tool, that included simple prediction of the type of the token).
The NSW types were
EXPN
LSEQ
ASWD
MSPL
NUM
NORD
NTEL
NDIG
NIDE
NADDR
NZIP
NTIME
NDATE
NYER
MONEY
BMONY
PRCT
SLNT
PUNC
FNSP
URL
NONE
Although at first glance these seem reasonable, after labelling we noted there is still some ambiguity on these. We did test our interlabeller agreement and found it very high but as a large percentage of this task is trivial the problems are always in that last few percent, and there we foudn out labellers we're consistent especially when it came to splitting things.
After labelling a section such as
Today I bought a Sony NP-F530 1350maH. Like your 550 it is slightly larger than the native IBM battery pack. It's been 3 hours now on it's first charge - I am charging in the PC110.
will look like
Today I bought a Sony<W NSW="LSEQ"> NP-F530,</W><W NSW="SPLT"><WS NSW="NUM"> 1350</WS><WS NSW="EXPN">maH.</WS></W> Like your<W NSW="NIDE"> 550</W> it is slightly larger than the native<W NSW="LSEQ"> IBM</W> battery pack. It's been<W NSW="NUM"> 3</W> hours now on it's first charge - I am charging in the <W NSW="LSEQ"> PC110. </W>
Splitting of tokens only at white space, even for English was considered to be too limiting as when looking athe Festival token to word rules (or the equivalent int he Bell Labs systems) there seemed to be too many rules that were there just further split objects rather than expanding thing to words. Thus the splitter additional splits tokens that are letter/number conjctions, (forms of) mixed case conjunctions and things containing punctuation. The splitter is actually implemented as a set of regular expressions. But this too has a (in some sense) weird set of exceptions such as money symbols, urls, telephone numbers and some others.
The second stage was to assign the token tags to each split token. This
is done (by default) from CART models trained from the the labelled
data. For identifying the various alphabetic types (LSEQ
,
ASWD
and EXPN
) trigram letter language models were build
that return an estimate of the probability that some alphabetic
character sequences is a letter sequence, as pronouncable word or an
abbreviation. The results are fed into the CART classifier, rather
than used directly.
The full classifier accuracy various across domains ranging from 98.1% correct in the new data, to 91.8% in the PC110 (email) data.
Once classification is done a simple set of expanders is used to
generate the words from the token plus type. This is true expcept for
EXPN
labelled tokens (abbreviations), as although they have been
identified as abbreviations we still need to identify what they are
an abbreviation of.
Finding out what the expansion of an abbreviation is can most simply be done by an abbreviation lexicon. But in some cases an new abbreviation may occur, that can be detected as an abbreviation, but isn't in the lexicon. This model attempts to solve this. It relies that given an abbreviation some full expansion of it appear somewhere in a corpus of the sort of text the text normalisation model will be applied to. There are two parts to this abbreviation expansion model. First a model that predicts all possible expnasions of an abbreviation. This is done with a WFST (weighted finite state transducer) model built from a trained CART model that predicts the deletion of characters from full words to form abbreviations. The second stage is a languge model that predicts the occurs of those full words in text. Together, for the classifieds domain, this works with about 20% error.
The advantages of the NSW model are
In most European languages white space is used to separate basic words, but this isn't true in languages like Chinese and Japanese. There text is a continuous flow of characters. Punctuation is still often used but there is regularly no whitespace. Even newlines may be inserted within words to allow proper line up of characters. Within English this problme partly occurs in compound words and consequences of this can be seen in finding pronunciations of unknown words by letter to sound rules.
Consider the word outhomer
(s baseball term). Festival's letter
to sound rules, wrongly pronounce this as aw th am m er
.
Although we can see there is effectiver a word break between the letters
t
and h
the letter to sound rules do not see this and map
these two characters together as a single phone th
. The reason
we can seet outhomer
as two words as the two words out
and
homer
, is due to the high frequency of these individual words and
the lower frequency of the full word pronounced as aw th am m er
.
As another example, consider the word together
, it could
be split as to get her
, but isn't due the relative frequency of
the full word over the not unusual trigram of to get her
.
Thus the best split can be defined as most probable of all possible splits. Mathematically we can put this as the split that maximizes the probability of the sentences which we can estimate by
ProductOf foreach i in K P(wi | wi-1, ... wi-N+1)
Thus we use say tri (N=3) or bi (N-2) grams to estimate the probability of each word. This technique is used, in general for finding segmentations of Chinese and Japanese texts sproat96b. And it is quite successful.
However in its simplest form it needs a pre-tokenized database of words in order to collect the statistics for the ngrams. One technique to solve this is to use an iterating approach where an unsegmented database is segmented with some simple algorithm, say longest matching, then stats are collected and the statistically technique is used, and the stats are re-estimated. This process can be iterated until there no more improvements, on some held out test database.
In some languages, gender, case etc affect how numbers are
proounced. That is the pronounced of the digit 1
depends
on what it is refereing too. For example in Spanish
1 ni@~no --> un ni@~no (one boy) 1 ni@~na --> un ni@~na (one girl) 1 hermano --> un hermano (one brother) 1 hermana --> una hermana (one sister) 1 pais ---> un pais (one country) 1 ra'iz ---> una ra'iz (one root)
Although it might seem possible that checking if the following word ends
in a
or o
might be a good disambuguator there are many
words in Spanish where gender cannot be easily identified from surface
form and a lexicon is required. Further more the digit(s) may not be
refering to the word immediately following that token and may ultimately
require understanding of the sentence to get the right pronunciation.
Spanish is not the only language where declensions exist. In a Polish synthesizer we built some years ago, most of the work became concentrated on getting the pronunciation of numbers correct. This would be true for all slavic languages.
token_to_words
by convention save the existing one and call that for things that
don't match what you are looking for. Thus your file will look
something like
(set! previous_token_to_words token_to_words) (define (token_to_words token name) (cond ;; condition to recognize money tokens ;; return list of words (t (previous_token_to_words token name))))The actual condition and return list of words is similar to the treatment of email addresses described above. The regular expression for money is probably "$[0-9]+\\.[0-9][0-9]",
(define (num-to-ordinal num) "Returns the ordinal in words for num (up to 40)." (cdr (cdr (assoc (parse-number num) '((1 first) (2 second) (3 third) ... (39 thirty-ninth) (40 fortieth))))))Once you decide on the condition, remember that you need to return a list of words.
Go to the first, previous, next, last section, table of contents.