This chapter discusses some of the basic problems in analyzing text when trying to convert it to speech. To be of practical use it is necessary to do at least some level text analysis in a new language. Almost any piece of real text will contain tokens that do not have a simple one to one pronunciation. In Festival our view is that the initial text is tokenized into white space separated items. (See discussion below about how you might do languages that don't normally separate tokens by white space.) These tokens can then be mapped to words through simple rules (or statistically trained models) allowing for one token to map to zero or more words, and also allow that mapping to be context sensitive.
Numbers are probably the most common form of token that doesn't have a simple lookup pronunciation, there is no way you can list all strings of digits in a lexicon so some analysis into words is the most reasonable way of dealing with them. This is dicussed below. Also in many languages strings of digits may sometimes be pronounced as numbers (ordinals or cardinals) or as strings of digits (e.g. telephone numbers) or in some case have their own special pronunciation in certain contexts (e.g. years in English). We will discuss some examples below.
The basic model in Festival is that each token will be mapped a
list of words by a call to a token_to_word
function. This
function will be called on each token and it should return a list of
words. It may check the tokens to context (within the current
utterance) too if necessary. The default action should (for most
languages) simply be returning the token itself as a list of own word
(itself). For example your basic function should look something like.
(define (MYLANG_token_to_words token name) "(MYLANG_token_to_words TOKEN NAME) Returns a list of words for the NAME from TOKEN. This primarily allows the treatment of numbers, money etc." (cond (t (list name))))
This function should be set in your voice selection function as the function for token analysis
(set! token_to_words MYLANG_token_to_words)
This function should be added to to deal with all tokens that are not in your lexicon, cannot be treated by your letter-to-sound rules, or are ambiguous in some way and require context to resolve.
For example suppose we wish to simply treat all tokens consisting of strings of digits to be pronounced as a string of digits (rather than numbers). We would add something like the following
(set! MYLANG_digit_names '((0 "zero") (1 "one") (2 "two") (3 "three") (4 "four") (5 "five") (6 "six") (7 "seven") (8 "eight") (9 "nine"))) (define (MYLANG_token_to_words token name) "(MYLANG_token_to_words TOKEN NAME) Returns a list of words for the NAME from TOKEN. This primarily allows the treatment of numbers, money etc." (cond ((string-matches name "[0-9]+") ;; any string of digits (mapcar (lambda (d) (car (cdr (assoc_string d MTLANG_digit_names)))) (symbolexplode name))) (t (list name))))
But more elaborate rules are also necessary. Some tokens require context to disambiguate and sometimes multiple tokens are really one object e.g `$12 billion' must be rendered as `twelve billion dollars', where the money name crosses over the second word. Such multi-token rules must be split into multiple conditions, one for each part of the combined token. Thus we need to identify the `$<digits>' is in a context followed by `?illion'. The code below renders the full phrase for the dollar amount. The second condition ensures nothing is returned for the `?illion' word as it has already been dealt with by the previous token.
((and (string-matches name "\\$[123456789]+") (string-matches (item.feat token "n.name") ".*illion.?")) (append (digits_to_cardinal (string-after name "$")) ;; amount (list (item.feat token "n.name")) ;; magnitude (list "dollars"))) ;; currency name ((and (string-matches name ".*illion.?") (string-matches (item.feat token "p.name") "\\$[123456789]+")) ;; dealt with in previous token nil)
Note this still is not enough as there may be other types of currency pounds, yen, francs etc, some of which may be mass nouns and require no plural (e.g. `yen') and some of which make be count nouns require plurals. Also this only deals with whole numbers of .*illions, `$1.25 million' is common too. See the full example (for English) in `festival/lib/token.scm'.
A large list of rules are typically required. They should be looked
upon as breaking down the problem into smaller parts, potentially
recursive. For example hyphenated tokens can be split into two words.
It is probably wise to explicitly deal with all tokens than are not
purely alphabetic. Maybe having a catch-all that spells out all tokens
that are not explicitly dealt with (e.g. the numbers). For
example you could add the following as the penumtilmate condition
in your token_to_words
function
((not (string-matches name "[A-Za-z]")) (symbolexplode name))
Note this isn't necessary correct when certain letters may be homograpths.
For example the token `a' may be a determiner or a letter
of the alhpabet. When its a derterminer it may (often) be reduced)
while as a letter it probably ins't (i.e pronunciation in `@'
or `ei'). Other languages also example this problem (e.g. Spanish
`y'. Therefore when we call symbol explode we don't want just the
the letter but to also specify that it is the letter pronunciation we
want and not the any other form. To ensure the lexicon system
gets the right pronunciation we there wish to specify the part
fo speech with the letter. Actually rather than just a string
of atomic words being returned by the token_to_words
function
the words may be descriptions including features. Thus for example
we dont just want to return
(a b c)
We want to be more specific and return
(((name a) (pos nn)) ((name b) (pos nn)) ((name c) (pos nn)))
This can be done by the code
((not (string-matches name "[A-Za-z]")) (mapcar (lambda (l) ((list 'name l) (list 'pos 'nn))) (symbolexplode name)))
The above assumes that all single characters symbols (letters, digits,
punctuation and other "funny" characters have an entry in your lexicon
with a part of speech field nn
, with a pronunctiation of the
character in isolation.
The list of tokens that you may wish to write/train rules for is of couse language dependent and to a certain extent domain dependent. For example there are many more numbers in email text that in narative novels. The number of abbreviations is also much higher in email and news stories than in more normal text. It may be worth having a look at some typical data to find out the distribution and find out what is worth working on. For a rough guide the folowing is a list if the symbol types we currentl deal with in English, many of which will require some treatment in other languages.
Remember the first purpose of text analysis is ensure you can deal with anything, even if it is just saying the word `unknown' (in the appropriate language). Also its probabaly not worth spending time on rare token forms, though remember it not easy to judge what are rare and what are not.
Almost every one will expect a synthesizer to be able to speech numbers. As it is not feasible to list all possible digit strings in your lexicon. You will need to provide a function that returns a string of words for a given string of digits.
In its simplest form you should provide a function that
decodes the string of digits. The example spanish_number
(and spanish_number_from_digits
) in the released Spanish
voice (`festvox_ellpc11k.tar.gz' is a good general
example.
A number of languages uses spaces within numbers where English might use commas. For example German, Polish and others text may contain
64 000
to denote sixty four thousand. As this will be multiple tokens in
Festival's basic analysis it is necessary to write multiple conditions
in your token_to_words
function.
In many languages, the pronunciation of a number depends on the thing that is being counted. For example the digit '1' in Spanish has multiple pronunciations depending on whether it is refering to a masculine or feminine object. In some languages this becomes much more complex where there are a number of possible declensions. In our Polish synthesizer we solved this by adding an extra argument to number generation function which then selected the actual number word (typically the final word in a number) based in the desired declension.
%%%%%%%%%%%%%%%%%%% Example to be added %%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%% Discussion to be added %%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%% Discussion to be added %%%%%%%%%%%%%%%%%%%%%%
Go to the first, previous, next, last section, table of contents.