Go to the first, previous, next, last section, table of contents.


12 Lexicons

This chapter covers method for finding the pronunciation of a word. This is either by a lexicon (a large list of words and their pronunciations) or by some method of letter to sound rules.

12.1 Word pronunciations

A pronunciation in Festival requires not just a list of phones but also a syllabic structure. In some languages the syllabic structure is very simple and well defined and can be unambiguously derived from a phone string. In English however this may not always be the case (compound nouns being the difficult case).

The lexicon structure that is basically available in Festival takes both a word and a part of speech (and arbitrary token) to find the given pronunciation. For English this is probably the optimal form, although there exist homographs in the language, the word itself and a fairly broad part of speech tag will mostly identify the proper pronunciation.

An example entry is

("photography"
 n
 (((f  ) 0) ((t o g) 1) ((r  f) 0) ((ii) 0)))

Not that in addition to explicit marking of syllables a stress value is also given (0 or 1). In some languages lexical is fully predictable, in others highly irregular. In some this field may be more appropriately used for an other purpose, e.g. tone type in Chinese.

There may be other languages which require a more complex (less complex) format and the decision to use some other format rather than this one is up to you.

Currently there is only residual support for morphological analysis in Festival. A finite state transducer based analyzer for English based on the work in ritchie92 is included in `festival/lib/engmorph.scm' and `festival/lib/engmorphsyn.scm'. But this should be considered experimental at best. Give the lack of such an analyzer our lexicons need to list not only based forms of words but also all their morphological variants. This is (more or less) acceptable in languages such as English or French but which languages with richer morphology such as German it may seem an unnecessary requirement. Agglutenative languages such as Finnish and Turkish this appears to be even more a restriction. This is probably true but this current restriction not necessary hopeless. We have successfully build very good letter-to-sound rules for German, a language with a rich morphology which allows the system to properly predict pronunciations of morphological variants of root words it has not seen before. We have not yet done any experiments with Finnish or Turkish but see this technique would work, (though of course developing a properly morphological analyzer would be better).

12.2 Lexicons and addenda

The basic assumption in Festival is that you will have a large lexicon, tens of thousands of entries, that is a used as a standard part of an implementation of a voice. Letter-to-sound rules are used as back up when a word is not explicitly listed. This view is based on how English is best dealt with. However this is a very flexible view, An explicit lexicon isn't necessary in Festival and it may be possible to do much of the work in letter-to-sound rules. This is how we have implemented Spanish. However even when there is strong relationship between the letters in a word and their pronunciation we still find the a lexicon useful. For Spanish we still use the lexicon for symbols such as `$', `%', individual letters, as well as irregular pronunciations.

In addition to a large lexicon Festival also supports a smaller list called an addenda this is primarily provided to allow specific applications and users to add entries that aren't in the existing lexicon.

12.3 Out of vocabulary words

Because its impossible to list all words in a natural language for general text-to-speech you will need to provide something to pronounce out of vocabulary words. In some languages this is easy but in other's it is very hard. No matter what you do you must provide something even if it is simply replacing the unknown word with the word `unknown' (or its local language equivalent). By default a lexicon in Festival will throw an error if a requested word isn't found. To change this you can set the lts_method. Most usefully you can reset this to the name of function, which takes a word and a part of speech specification and returns a word pronunciation as described above.

For example is we are always going to return the word unknown but print a warning the the word is being ignored a suitable function is

(define (mylex_lts_function word feats)
"Deal with out of vocabulary word."
  (format t "unknown word: %s\n" word)
  '("unknown" n (((uh n) 1) ((n ou n) 1))))

Note the pronunciation of `unknown' must be in the appropriate phone set. Also the syllabic structure is required. You need to specify this function for your lexicon as follows

(lex.set.lts.method 'mylex_lts_function)

At one level above merely identifying out of vocabulary words, they can be spelled, this of course isn't ideal but it will allow the basic information to be passed over to the listener. This can be done with the out of vocabulary function, as follows.

(define (mylex_lts_function word feats)
"Deal with out of vocabulary words by spelling out the letters in the
word."
 (if (equal? 1 (length word))
     (begin
       (format t "the character %s is missing from the lexicon\" word)
       '("unknown" n (((uh n) 1) ((n ou n) 1))))
     (cons
      word
      'n
      (apply
       append
       (mapcar
        (lambda (letter)
         (car (cdr (cdr (lex.lookup letter 'n)))))
        (symbolexplode word))))))

A few point are worth noting in this function. This recursively calls the lexical lookup function on the characters in a word. Each letter should appear in the lexicon with its pronunciation (in isolation). But a check is made to ensure we don't recurse for ever. The symbolexplode function assumes that that letters are single bytes, which may not be true for some languages and that function would need to be replaced for that language. Note that we append the syllables of each of the letters in the word. For long words this might be too naive as there could be internal prosodic structure in such a spelling that this method would not allow for. In that case you would want letters to be words thus the symbol explosion to happen at the token to word level. Also the above function assumes that the part of speech for letters is n. This is only really important where letters are homographs in languages so this can be used to distinguish which pronunciation you require (cf. `a' in English or `y' in French).

12.4 Letter-to-sound rules by hand

For many languages there is a systematic relationship between the written form of a word and its pronunciation. For some language this can be fairly easy to write down, by hand. In Festival there is a letter to sound rule system that allows rules to be written. This rule system, described in detail in the Festival manual itself is what you should use if you are going to write rules by hand. There is also an automatic training method fully described in the next sections, which produces CART trees which although are easy to interpret are probably unsuitable as a notation for hand specification.

When writing a rule system it is often useful to do it in multiple passes. The Spanish diphone voice distributed as `festvox_ellpc11k.tar.gz' offers a good example of such a use. A set of cascaded LTS rule sets is used to transfer the basic word to a full accented, syllabified string of symbols which is then converted into the bracketed from used by Festival. The levels are normalizations (downcasing and accent normalization), conversion to pronunciation, syllabification, stress and finally identifying weak vowels. Splitting the conversion tasks like this can often make writing the rules much easier, though care should be taken to ensure you don't mix up what you think are letters and what you think are phones.

The LTS rule system is a little primitive and lacks some syntactic sugar (sets etc.) that would make writing rules easier. In their present form you need to be very explicit. Testing your rule set can be done in Festival in isolation (and should be done so, rather than by actual synthesis). The function lts.apply allows you to apply a LTS rule set to a word or list of symbols. See the manual and the Spanish example for more details.

12.5 Building letter-to-sound rules

For some languages the writing of a rule system is too difficult. Although there have been many valiant attempts to do so for languages like English life is basically too short to do this. Therefore we also include a method for automatically building LTS rules sets for a lexicon of pronunciations. This technique has successfully been used from English (British and American), French and German. The difficulty and appropriateness of using letter-to-sound rules is very language dependent,

The following outlines the processes involved in building a letter to sound model for a language given a large lexicon of pronunciations. This technique is likely to work for most European languages (including Russian) but doesn't seem particularly suitable for very language alphabet languages like Japanese and Chinese. The process described here is not (yet) fully automatic but the hand intervention required is small and may easily be done even by people with only a very little knowledge of the language being dealt with.

The process involves the following steps

All except the first two stages of this are fully automatic.

Before building a model its wise to think a little about what you want it to do. Ideally the model is an auxiliary to the lexicon so only words not found in the lexicon will require use of the letter-to-sound rules. Thus only unusual forms are likely to require the rules. More precisely the most common words, often having the most non-standard pronunciations, should probably be explicitly listed always. It is possible to reduce the size of the lexicon (sometimes drastically) by removing all entries that the training LTS model correctly predicts.

Before starting it is wise to consider removing some entries from the lexicon before training, I typically will remove words under 4 letters and if part of speech information is available I remove all function words, ideally only training from nouns verbs and adjectives as these are the most likely forms to be unknown in text. It is useful to have morphologically inflected and derived forms in the training set as it is often such variant forms that not found in the lexicon even though their root morpheme is. Note that in many forms of text, proper names are the most common form of unknown word and even the technique presented here may not adequately cater for that form of unknown words (especially if they unknown words are non-native names). This is all stating that this may or may not be appropriate for your task but the rules generated by this learning process have in the examples we've done been much better than what we could produce by hand writing rules of the form described in the previous section.

First preprocess the lexicon into a file of lexical entries to be used for training, removing functions words and changing the head words to all lower case (may be language dependent). The entries should be of the form used for input for Festival's lexicon compilation. Specifically the pronunciations should be simple lists of phones (no syllabification). Depending on the language, you may wish to remove the stressing--for examples here we have though later tests suggest that we should keep it in even for English. Thus the training set should look something like

("table" nil (t ei b l))
("suspicious" nil (s @ s p i sh @ s))

It is best to split the data into a training set and a test set if you wish to know how well your training has worked. In our tests we remove every tenth entry and put it in a test set. Note this will mean our test results are probably better than if we removed say the last ten in every hundred.

The second stage is to define the set of allowable letter to phone mappings irrespective of context. This can sometimes be initially done by hand then checked against the training set. Initially construct a file of the form

(require 'lts_build)
(set! allowables 
      '((a _epsilon_)
        (b _epsilon_)
        (c _epsilon_)
        ...
        (y _epsilon_)
        (z _epsilon_)
        (# #)))

All letters that appear in the alphabet should (at least) map to _epsilon_, including any accented characters that appear in that language. Note the last two hashes. These are used by to denote beginning and end of word and are automatically added during training, they must appear in the list and should only map to themselves.

To incrementally add to this allowable list run festival as

festival allowables.scm 

and at the prompt type

festival> (cummulate-pairs "oald.train")

with your train file. This will print out each lexical entry that couldn't be aligned with the current set of allowables. At the start this will be every entry. Looking at these entries add to the allowables to make alignment work. For example if the following word fails

("abate" nil (ah b ey t)) 

Add ah to the allowables for letter a, b to b, ey to a and t to letter t. After doing that restart festival and call cummulate-pairs again. Incrementally add to the allowable pairs until the number of failures becomes acceptable. Often there are entries for which there is no real relationship between the letters and the pronunciation such as in abbreviations and foreign words (e.g. "aaa" as "t r ih p ax l ey"). For the lexicons I've used the technique on less than 10 per thousand fail in this way.

It is worth while being consistent on defining your set of allowables. (At least) two mappings are possible for the letter sequence ch---having letter c go to phone ch and letter h go to _epsilon_ and also letter c go to phone _epsilon_ and letter h goes to ch. However only one should be allowed, we preferred c to ch.

It may also be the case that some letters give rise to more than one phone. For example the letter x in English is often pronounced as the phone combination k and s. To allow this, use the multiphone k-s. Thus the multiphone k-s will be predicted for x in some context and the model will separate it into two phones while it also ignoring any predicted _epsilons_. Note that multiphone units are relatively rare but do occur. In English, letter x give rise to a few, k-s in taxi, g-s in example, and sometimes g-zh and k-sh in luxury. Others are w-ah in one, t-s in pizza, y-uw in new (British), ah-m in -ism etc. Three phone multiphone are much rarer but may exist, they are not supported by this code as is, but such entries should probably be ignored. Note the - sign in the multiphone examples is significant and is used to identify multiphones.

The allowables for OALD end up being

(set! allowables 
       '
      ((a _epsilon_ ei aa a e@ @ oo au o i ou ai uh e)
       (b _epsilon_ b )
       (c _epsilon_ k s ch sh @-k s t-s)
       (d _epsilon_ d dh t jh)
       (e _epsilon_ @ ii e e@ i @@ i@ uu y-uu ou ei aa oi y y-u@ o)
       (f _epsilon_ f v )
       (g _epsilon_ g jh zh th f ng k t)
       (h _epsilon_ h @ )
       (i _epsilon_ i@ i @ ii ai @@ y ai-@ aa a)
       (j _epsilon_ h zh jh i y )
       (k _epsilon_ k ch )
       (l _epsilon_ l @-l l-l)
       (m _epsilon_ m @-m n)
       (n _epsilon_ n ng n-y )
       (o _epsilon_ @ ou o oo uu u au oi i @@ e uh w u@ w-uh y-@)
       (p _epsilon_ f p v )
       (q _epsilon_ k )
       (r _epsilon_ r @@ @-r)
       (s _epsilon_ z s sh zh )
       (t _epsilon_ t th sh dh ch d )
       (u _epsilon_ uu @ w @@ u uh y-uu u@ y-u@ y-u i y-uh y-@ e)
       (v _epsilon_ v f )
       (w _epsilon_ w uu v f u)
       (x _epsilon_ k-s g-z sh z k-sh z g-zh )
       (y _epsilon_ i ii i@ ai uh y @ ai-@)
       (z _epsilon_ z t-s s zh )
       (# #)
       ))

Note this is an exhaustive list and (deliberately) says nothing about the contexts or frequency that these letter to phone pairs appear. That information will be generated automatically from the training set.

Once the number of failed matches is significantly low enough let cummulate-pairs run to completion. This counts the number of times each letter/phone pair occurs in allowable alignments.

Next call

festival> (save-table "oald-")

with the name of your lexicon. This changes the cumulation table into probabilities and saves it.

Restart festival loading this new table

festival allowables.scm oald-pl-table.scm

Now each word can be aligned to an equally-lengthed string of phones, epsilon and multiphones.

festival> (aligndata "oald.train" "oald.train.align")

Do this also for you test set.

This will produce entries like

aaronson _epsilon_ aa r ah n s ah n
abandon ah b ae n d ah n
abate ah b ey t _epsilon_
abbe ae b _epsilon_ iy

The next stage is to build features suitable for `wagon' to build models. This is done by

festival> (build-feat-file "oald.train.align" "oald.train.feats")

Again the same for the test set.

Now you need to construct a description file for `wagon' for the given data. The can be done using the script `make_wgn_desc' provided with the speech tools

Here is an example script for building the models, you will need to modify it for your particular database but it shows the basic processes

for i in a b c d e f g h i j k l m n o p q r s t u v w x y z 
do
   # Stop value for wagon
   STOP=2
   echo letter $i STOP $STOP
   # Find training set for letter $i
   cat oald.train.feats |
    awk '{if ($6 == "'$i'") print $0}' >ltsdataTRAIN.$i.feats
   # split training set to get heldout data for stepwise testing
   traintest ltsdataTRAIN.$i.feats
   # Extract test data for letter $i
   cat oald.test.feats |
    awk '{if ($6 == "'$i'") print $0}' >ltsdataTEST.$i.feats
   # run wagon to predict model
   wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \
          -stepwise -desc ltsOALD.desc -stop $STOP -output lts.$i.tree
   # Test the resulting tree against
   wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ltsOALD.desc \
              -tree lts.$i.tree
done

The script `traintest' splits the given file `X' into `X.train' and `X.test' with every tenth line in `X.test' and the rest in `X.train'.

This script can take a significant amount of time to run, about 6 hours on a Sun Ultra 140.

Once the models are created the must be collected together into a single list structure. The trees generated by `wagon' contain fully probability distributions at each leaf, at this time this information can be removed as only the most probable will actually be predicted. This substantially reduces the size of the tress.

(merge_models 'oald_lts_rules "oald_lts_rules.scm")

(merge_models is defined within `lts_build.scm') The given file will contain a set! for the given variable name to an assoc list of letter to trained tree. Note the above function naively assumes that the letters in the alphabet are the 26 lower case letters of the English alphabet, you will need to edit this adding accented letters if required. Note that adding "'" (single quote) as a letter is a little tricky in scheme but can be done--the command (intern "'") will give you the symbol for single quote.

To test a set of lts models load the saved model and call the following function with the test align file

festival oald-table.scm oald_lts_rules.scm
festival> (lts_testset "oald.test.align" oald_lts_rules)

The result (after showing all the failed ones), will be a table showing the results for each letter, for all letters and for complete words. The failed entries may give some notion of how good or bad the result is, sometimes it will be simple vowel differences, long versus short, schwa versus full vowel, other times it may be who consonants missing. Remember the ultimate quality of the letter sound rules is how adequate they are at providing acceptable pronunciations rather than how good the numeric score is.

For some languages (e.g. English) it is necessary to also find a stress pattern for unknown words. Ultimately for this to work well you need to know the morphological decomposition of the word. At present we provide a CART trained system to predict stress patterns for English. If does get 94.6% correct for an unseen test set but that isn't really very good. Later tests suggest that predicting stressed and unstressed phones directly is actually better for getting whole words correct even though the models do slightly worse on a per phone basis black98b.

As the lexicon may be a large part of the system we have also experimented with removing entries from the lexicon if the letter to sound rules system (and stress assignment system) can correct predict them. For OALD this allows us to half the size of the lexicon, it could possibly allow more if a certain amount of fuzzy acceptance was allowed (e.g. with schwa). For other languages the gain here can be very significant, for German and French we can reduce the lexicon by over 90%. The function reduce_lexicon in `festival/lib/lts_build.scm' was used to do this. A discussion of using the above technique as a dictionary compression method is discussed in pagel98. A morphological decomposition algorithm, like that described in black91, may even help more.

The technique described in this section and its relative merits with respect to a number of languages/lexicons and tasks is discussed more fully in black98b.

12.6 Post-lexical rules

In fluent speech word boundaries are often degraded in a way that causes co-articulation across boundaries. A lexical entry should normally provide pronunciations as if the word is being spoken in isolation. It is only once the word has been inserted into the the context in which it is going to spoken can co-articulary effects be applied.

Post lexical rules are a general set of rules which can modify the segment relation (or any other part of the utterance for that matter), after the basic pronunciations have been found. In Festival post-lexical rules are defined as functions which will be applied to the utterance after intonational accents have been assigned.

For example in British English word final /r/ is only produced when the following word starts with a vowel. Thus all other word final /r/s need to be deleted. A Scheme function that implements this is as follows

(define (plr_rp_final_r utt)
  (mapcar
   (lambda (s)
    (if (and (string-equal "r" (item.name s))  ;; this is an r
             ;; it is syllable final
             (string-equal "1" (item.feat s "syl_final"))
             ;; the syllable is word final
             (not (string-equal "0" 
                   (item.feat s "R:SylStructure.parent.syl_break")))
             ;; The next segment is not a vowel
             (string-equal "-" (item.feat s "n.ph_vc")))
        (item.delete s)))
   (utt.relation.items utt 'Segment)))

In English we also use post-lexical rules for phenomena such as vowel reduction and schwa deletion in the possessive `'s'.


Go to the first, previous, next, last section, table of contents.