Building Synthetic Voices
<<< Previous	Text analysis	Next >>>

Token to word rules

The basic model in Festival is that each token will be mapped a list of words by a call to a token_to_word function. This function will be called on each token and it should return a list of words. It may check the tokens to context (within the current utterance) too if necessary. The default action should (for most languages) simply be returning the token itself as a list of own word (itself). For example your basic function should look something like.

(define (MYLANG_token_to_words token name)
  "(MYLANG_token_to_words TOKEN NAME)
Returns a list of words for the NAME from TOKEN.  This primarily
allows the treatment of numbers, money etc."
  (cond
   (t
    (list name))))

This function should be set in your voice selection function as the function for token analysis

(set! token_to_words MYLANG_token_to_words)

This function should be added to to deal with all tokens that are not in your lexicon, cannot be treated by your letter-to-sound rules, or are ambiguous in some way and require context to resolve.

For example suppose we wish to simply treat all tokens consisting of strings of digits to be pronounced as a string of digits (rather than numbers). We would add something like the following

(set! MYLANG_digit_names
   '((0 "zero")
     (1 "one")
     (2 "two")
     (3 "three")
     (4 "four")
     (5 "five")
     (6 "six")
     (7 "seven")
     (8 "eight")
     (9 "nine")))

(define (MYLANG_token_to_words token name)
  "(MYLANG_token_to_words TOKEN NAME)
Returns a list of words for the NAME from TOKEN.  This primarily
allows the treatment of numbers, money etc."
  (cond
   ((string-matches name "[0-9]+") ;; any string of digits
    (mapcar
     (lambda (d)
      (car (cdr (assoc_string d MTLANG_digit_names))))
     (symbolexplode name)))
   (t
    (list name))))

But more elaborate rules are also necessary. Some tokens require context to disambiguate and sometimes multiple tokens are really one object e.g "$12 billion" must be rendered as "twelve billion dollars", where the money name crosses over the second word. Such multi-token rules must be split into multiple conditions, one for each part of the combined token. Thus we need to identify the "$DIGITS" is in a context followed by "?illion". The code below renders the full phrase for the dollar amount. The second condition ensures nothing is returned for the "?illion" word as it has already been dealt with by the previous token.

   ((and (string-matches name "\\$[123456789]+")
         (string-matches (item.feat token "n.name") ".*illion.?"))
     (append
      (digits_to_cardinal (string-after name "$")) ;; amount
      (list (item.feat token "n.name"))            ;; magnitude
      (list "dollars")))                           ;; currency name
   ((and (string-matches name ".*illion.?")
         (string-matches (item.feat token "p.name") "\\$[123456789]+"))
     ;; dealt with in previous token
     nil)

Note this still is not enough as there may be other types of currency pounds, yen, francs etc, some of which may be mass nouns and require no plural (e.g. "yen}" and some of which make be count nouns require plurals. Also this only deals with whole numbers of .*illions, "$1.25 million" is common too. See the full example (for English) in festival/lib/token.scm.

A large list of rules are typically required. They should be looked upon as breaking down the problem into smaller parts, potentially recursive. For example hyphenated tokens can be split into two words. It is probably wise to explicitly deal with all tokens than are not purely alphabetic. Maybe having a catch-all that spells out all tokens that are not explicitly dealt with (e.g. the numbers). For example you could add the following as the penumtilmate condition in your token_to_words function

((not (string-matches name "[A-Za-z]"))
(symbolexplode name))

Note this isn't necessary correct when certain letters may be homograpths. For example the token "a" may be a determiner or a letter of the alhpabet. When its a derterminer it may (often) be reduced) while as a letter it probably ins't (i.e pronunciation in "@" or "ei}". Other languages also example this problem (e.g. Spanish "y". Therefore when we call symbol explode we don't want just the the letter but to also specify that it is the letter pronunciation we want and not the any other form. To ensure the lexicon system gets the right pronunciation we there wish to specify the part fo speech with the letter. Actually rather than just a string of atomic words being returned by the token_to_words function the words may be descriptions including features. Thus for example we dont just want to return

(a b c)

We want to be more specific and return

(((name a) (pos nn))
((name b) (pos nn))
((name c) (pos nn)))

This can be done by the code

   ((not (string-matches name "[A-Za-z]"))
    (mapcar
     (lambda (l)
      ((list 'name l) (list 'pos 'nn)))
     (symbolexplode name)))

The above assumes that all single characters symbols (letters, digits, punctuation and other "funny" characters have an entry in your lexicon with a part of speech field nn, with a pronunctiation of the character in isolation.

The list of tokens that you may wish to write/train rules for is of couse language dependent and to a certain extent domain dependent. For example there are many more numbers in email text that in narative novels. The number of abbreviations is also much higher in email and news stories than in more normal text. It may be worth having a look at some typical data to find out the distribution and find out what is worth working on. For a rough guide the folowing is a list if the symbol types we currentl deal with in English, many of which will require some treatment in other languages.

Money: Money amounts often have different treatment than simple numbers and conventions about the sub-currency part (i.e. cents, pfennings etc). Remember that you its not just numbers in the local currency you have to deal with currency values from different countries are common in lots of different texts (e.g dollars, yen, DMs and euro).
Numbers: strings of digits will of course need mapping even if there is only one mapping for a language (rare). Consider at least telphone numbers verses amounts, most languages make a distinction here. In English we need to distinguish further, see below for the more detailed discussion.
number/number: This can be used as a date, fraction, alternate, context will help, though techniques of dropping back to saying the the string of characters often preserve the ambiguity which can be better that forcing a decision.
acronyms: List of upper case letters (with or without vowels). The decision to pronounce as a word or as letters is difficult in general but good guesses go far. If its short (< 4 chatacters) not in your lexicon not surround by other words in upper case, its probably an acronym, further analyss of vowels, consonant clusters etc will help.
number-number: Could be a range, of score (football), dates etc.
word-word: Usually a simple split on each part is sufficient---but not as when used as a dash.
word/word: As an alternative, or a Unix pathname
's or TOKENs: An appended "s" to a non alphabetic token is probably some form of pluralization, removing it and recursing on the analysis is a reasonable thing to try.
times and dates: These exist is variaous stnadardized forms many of which are easy to recognize and break down.
telephone numbers: This various from country to country (and by various conventions) but there may be standard forms that can be recognized.
romain numerals: Sometimes these are pronounced as numbers "chapter II", or as cardinals "James II".
ascii art: If you are dealing with on line text there are often extra characters in a document that should be ignored, or at least not pronounced literally, e.g. lines of hyphens used as separators.
email addresses, URLs, file names: Depending on your context this may be worth spending time on.
tokens containing any other non-alphanumeric character: Spliting the token around the non-alphanumeric and recursing on each part before and after it may be reasonable.

Remember the first purpose of text analysis is ensure you can deal with anything, even if it is just saying the word "unknown" (in the appropriate language). Also its probably not worth spending time on rare token forms, though remember it not easy to judge what are rare and what are not.