This chapter covers method for finding the pronunciation of a word. This is either by a lexicon (a large list of words and their pronunciations) or by some method of letter to sound rules.
A pronunciation in Festival requires not just a list of phones but also a syllabic structure. In some languages the syllabic structure is very simple and well defined and can be unambiguously derived from a phone string. In English however this may not always be the case (compound nouns being the difficult case).
The lexicon structure that is basically available in Festival takes both a word and a part of speech (and arbitrary token) to find the given pronunciation. For English this is probably the optimal form, although there exist homographs in the language, the word itself and a fairly broad part of speech tag will mostly identify the proper pronunciation.
An example entry is
("photography" n (((f ) 0) ((t o g) 1) ((r f) 0) ((ii) 0)))
Not that in addition to explicit marking of syllables a stress value is also given (0 or 1). In some languages lexical is fully predictable, in others highly irregular. In some this field may be more appropriately used for an other purpose, e.g. tone type in Chinese.
There may be other languages which require a more complex (less complex) format and the decision to use some other format rather than this one is up to you.
Currently there is only residual support for morphological analysis in Festival. A finite state transducer based analyzer for English based on the work in ritchie92 is included in `festival/lib/engmorph.scm' and `festival/lib/engmorphsyn.scm'. But this should be considered experimental at best. Give the lack of such an analyzer our lexicons need to list not only based forms of words but also all their morphological variants. This is (more or less) acceptable in languages such as English or French but which languages with richer morphology such as German it may seem an unnecessary requirement. Agglutenative languages such as Finnish and Turkish this appears to be even more a restriction. This is probably true but this current restriction not necessary hopeless. We have successfully build very good letter-to-sound rules for German, a language with a rich morphology which allows the system to properly predict pronunciations of morphological variants of root words it has not seen before. We have not yet done any experiments with Finnish or Turkish but see this technique would work, (though of course developing a properly morphological analyzer would be better).
The basic assumption in Festival is that you will have a large lexicon, tens of thousands of entries, that is a used as a standard part of an implementation of a voice. Letter-to-sound rules are used as back up when a word is not explicitly listed. This view is based on how English is best dealt with. However this is a very flexible view, An explicit lexicon isn't necessary in Festival and it may be possible to do much of the work in letter-to-sound rules. This is how we have implemented Spanish. However even when there is strong relationship between the letters in a word and their pronunciation we still find the a lexicon useful. For Spanish we still use the lexicon for symbols such as `$', `%', individual letters, as well as irregular pronunciations.
In addition to a large lexicon Festival also supports a smaller list called an addenda this is primarily provided to allow specific applications and users to add entries that aren't in the existing lexicon.
Because its impossible to list all words in a natural language for
general text-to-speech you will need to provide something to pronounce
out of vocabulary words. In some languages this is easy but in other's
it is very hard. No matter what you do you must provide
something even if it is simply replacing the unknown word with the word
`unknown' (or its local language equivalent). By default a lexicon
in Festival will throw an error if a requested word isn't found. To
change this you can set the lts_method
. Most usefully you can
reset this to the name of function, which takes a word and a part of
speech specification and returns a word pronunciation as described above.
For example is we are always going to return the
word unknown
but print a warning the the word is being
ignored a suitable function is
(define (mylex_lts_function word feats) "Deal with out of vocabulary word." (format t "unknown word: %s\n" word) '("unknown" n (((uh n) 1) ((n ou n) 1))))
Note the pronunciation of `unknown' must be in the appropriate phone set. Also the syllabic structure is required. You need to specify this function for your lexicon as follows
(lex.set.lts.method 'mylex_lts_function)
At one level above merely identifying out of vocabulary words, they can be spelled, this of course isn't ideal but it will allow the basic information to be passed over to the listener. This can be done with the out of vocabulary function, as follows.
(define (mylex_lts_function word feats) "Deal with out of vocabulary words by spelling out the letters in the word." (if (equal? 1 (length word)) (begin (format t "the character %s is missing from the lexicon\" word) '("unknown" n (((uh n) 1) ((n ou n) 1)))) (cons word 'n (apply append (mapcar (lambda (letter) (car (cdr (cdr (lex.lookup letter 'n))))) (symbolexplode word))))))
A few point are worth noting in this function. This recursively calls
the lexical lookup function on the characters in a word. Each letter
should appear in the lexicon with its pronunciation (in isolation).
But a check is made to ensure we don't recurse for ever. The
symbolexplode
function assumes that that letters are single
bytes, which may not be true for some languages and that function would
need to be replaced for that language. Note that we append the
syllables of each of the letters in the word. For long words this might
be too naive as there could be internal prosodic structure in such a
spelling that this method would not allow for. In that case you would
want letters to be words thus the symbol explosion to happen at the
token to word level. Also the above function assumes that the part of
speech for letters is n
. This is only really important where
letters are homographs in languages so this can be used to distinguish
which pronunciation you require (cf. `a' in English or `y' in
French).
For many languages there is a systematic relationship between the written form of a word and its pronunciation. For some language this can be fairly easy to write down, by hand. In Festival there is a letter to sound rule system that allows rules to be written. This rule system, described in detail in the Festival manual itself is what you should use if you are going to write rules by hand. There is also an automatic training method fully described in the next sections, which produces CART trees which although are easy to interpret are probably unsuitable as a notation for hand specification.
When writing a rule system it is often useful to do it in multiple passes. The Spanish diphone voice distributed as `festvox_ellpc11k.tar.gz' offers a good example of such a use. A set of cascaded LTS rule sets is used to transfer the basic word to a full accented, syllabified string of symbols which is then converted into the bracketed from used by Festival. The levels are normalizations (downcasing and accent normalization), conversion to pronunciation, syllabification, stress and finally identifying weak vowels. Splitting the conversion tasks like this can often make writing the rules much easier, though care should be taken to ensure you don't mix up what you think are letters and what you think are phones.
The LTS rule system is a little primitive and lacks some syntactic sugar
(sets etc.) that would make writing rules easier. In their present form
you need to be very explicit. Testing your rule set can be done in
Festival in isolation (and should be done so, rather than by actual
synthesis). The function lts.apply
allows you to apply a LTS
rule set to a word or list of symbols. See the manual and the
Spanish example for more details.
For some languages the writing of a rule system is too difficult. Although there have been many valiant attempts to do so for languages like English life is basically too short to do this. Therefore we also include a method for automatically building LTS rules sets for a lexicon of pronunciations. This technique has successfully been used from English (British and American), French and German. The difficulty and appropriateness of using letter-to-sound rules is very language dependent,
The following outlines the processes involved in building a letter to sound model for a language given a large lexicon of pronunciations. This technique is likely to work for most European languages (including Russian) but doesn't seem particularly suitable for very language alphabet languages like Japanese and Chinese. The process described here is not (yet) fully automatic but the hand intervention required is small and may easily be done even by people with only a very little knowledge of the language being dealt with.
The process involves the following steps
All except the first two stages of this are fully automatic.
Before building a model its wise to think a little about what you want it to do. Ideally the model is an auxiliary to the lexicon so only words not found in the lexicon will require use of the letter-to-sound rules. Thus only unusual forms are likely to require the rules. More precisely the most common words, often having the most non-standard pronunciations, should probably be explicitly listed always. It is possible to reduce the size of the lexicon (sometimes drastically) by removing all entries that the training LTS model correctly predicts.
Before starting it is wise to consider removing some entries from the lexicon before training, I typically will remove words under 4 letters and if part of speech information is available I remove all function words, ideally only training from nouns verbs and adjectives as these are the most likely forms to be unknown in text. It is useful to have morphologically inflected and derived forms in the training set as it is often such variant forms that not found in the lexicon even though their root morpheme is. Note that in many forms of text, proper names are the most common form of unknown word and even the technique presented here may not adequately cater for that form of unknown words (especially if they unknown words are non-native names). This is all stating that this may or may not be appropriate for your task but the rules generated by this learning process have in the examples we've done been much better than what we could produce by hand writing rules of the form described in the previous section.
First preprocess the lexicon into a file of lexical entries to be used for training, removing functions words and changing the head words to all lower case (may be language dependent). The entries should be of the form used for input for Festival's lexicon compilation. Specifically the pronunciations should be simple lists of phones (no syllabification). Depending on the language, you may wish to remove the stressing--for examples here we have though later tests suggest that we should keep it in even for English. Thus the training set should look something like
("table" nil (t ei b l)) ("suspicious" nil (s @ s p i sh @ s))
It is best to split the data into a training set and a test set if you wish to know how well your training has worked. In our tests we remove every tenth entry and put it in a test set. Note this will mean our test results are probably better than if we removed say the last ten in every hundred.
The second stage is to define the set of allowable letter to phone mappings irrespective of context. This can sometimes be initially done by hand then checked against the training set. Initially construct a file of the form
(require 'lts_build) (set! allowables '((a _epsilon_) (b _epsilon_) (c _epsilon_) ... (y _epsilon_) (z _epsilon_) (# #)))
All letters that appear in the alphabet should (at least) map to
_epsilon_
, including any accented characters that appear in that
language. Note the last two hashes. These are used by to denote
beginning and end of word and are automatically added during training,
they must appear in the list and should only map to themselves.
To incrementally add to this allowable list run festival as
festival allowables.scm
and at the prompt type
festival> (cummulate-pairs "oald.train")
with your train file. This will print out each lexical entry that couldn't be aligned with the current set of allowables. At the start this will be every entry. Looking at these entries add to the allowables to make alignment work. For example if the following word fails
("abate" nil (ah b ey t))
Add ah
to the allowables for letter a
, b
to
b
, ey
to a
and t
to letter t
. After
doing that restart festival and call cummulate-pairs
again.
Incrementally add to the allowable pairs until the number of failures
becomes acceptable. Often there are entries for which there is no real
relationship between the letters and the pronunciation such as in
abbreviations and foreign words (e.g. "aaa" as "t r ih p ax l ey"). For
the lexicons I've used the technique on less than 10 per thousand fail
in this way.
It is worth while being consistent on defining your set of allowables.
(At least) two mappings are possible for the letter sequence
ch
---having letter c
go to phone ch
and letter
h
go to _epsilon_
and also letter c
go to phone
_epsilon_
and letter h
goes to ch
. However only
one should be allowed, we preferred c
to ch
.
It may also be the case that some letters give rise to more than one
phone. For example the letter x
in English is often pronounced as
the phone combination k
and s
. To allow this, use the
multiphone k-s
. Thus the multiphone k-s
will be predicted
for x
in some context and the model will separate it into two
phones while it also ignoring any predicted _epsilons_
. Note that
multiphone units are relatively rare but do occur. In English, letter
x
give rise to a few, k-s
in taxi
, g-s
in
example
, and sometimes g-zh
and k-sh
in
luxury
. Others are w-ah
in one
, t-s
in
pizza
, y-uw
in new
(British), ah-m
in
-ism
etc. Three phone multiphone are much rarer but may exist, they
are not supported by this code as is, but such entries should probably
be ignored. Note the -
sign in the multiphone examples is
significant and is used to identify multiphones.
The allowables for OALD end up being
(set! allowables ' ((a _epsilon_ ei aa a e@ @ oo au o i ou ai uh e) (b _epsilon_ b ) (c _epsilon_ k s ch sh @-k s t-s) (d _epsilon_ d dh t jh) (e _epsilon_ @ ii e e@ i @@ i@ uu y-uu ou ei aa oi y y-u@ o) (f _epsilon_ f v ) (g _epsilon_ g jh zh th f ng k t) (h _epsilon_ h @ ) (i _epsilon_ i@ i @ ii ai @@ y ai-@ aa a) (j _epsilon_ h zh jh i y ) (k _epsilon_ k ch ) (l _epsilon_ l @-l l-l) (m _epsilon_ m @-m n) (n _epsilon_ n ng n-y ) (o _epsilon_ @ ou o oo uu u au oi i @@ e uh w u@ w-uh y-@) (p _epsilon_ f p v ) (q _epsilon_ k ) (r _epsilon_ r @@ @-r) (s _epsilon_ z s sh zh ) (t _epsilon_ t th sh dh ch d ) (u _epsilon_ uu @ w @@ u uh y-uu u@ y-u@ y-u i y-uh y-@ e) (v _epsilon_ v f ) (w _epsilon_ w uu v f u) (x _epsilon_ k-s g-z sh z k-sh z g-zh ) (y _epsilon_ i ii i@ ai uh y @ ai-@) (z _epsilon_ z t-s s zh ) (# #) ))
Note this is an exhaustive list and (deliberately) says nothing about the contexts or frequency that these letter to phone pairs appear. That information will be generated automatically from the training set.
Once the number of failed matches is significantly low enough
let cummulate-pairs
run to completion. This counts the number
of times each letter/phone pair occurs in allowable alignments.
Next call
festival> (save-table "oald-")
with the name of your lexicon. This changes the cumulation table into probabilities and saves it.
Restart festival loading this new table
festival allowables.scm oald-pl-table.scm
Now each word can be aligned to an equally-lengthed string of phones, epsilon and multiphones.
festival> (aligndata "oald.train" "oald.train.align")
Do this also for you test set.
This will produce entries like
aaronson _epsilon_ aa r ah n s ah n abandon ah b ae n d ah n abate ah b ey t _epsilon_ abbe ae b _epsilon_ iy
The next stage is to build features suitable for `wagon' to build models. This is done by
festival> (build-feat-file "oald.train.align" "oald.train.feats")
Again the same for the test set.
Now you need to construct a description file for `wagon' for the given data. The can be done using the script `make_wgn_desc' provided with the speech tools
Here is an example script for building the models, you will need to modify it for your particular database but it shows the basic processes
for i in a b c d e f g h i j k l m n o p q r s t u v w x y z do # Stop value for wagon STOP=2 echo letter $i STOP $STOP # Find training set for letter $i cat oald.train.feats | awk '{if ($6 == "'$i'") print $0}' >ltsdataTRAIN.$i.feats # split training set to get heldout data for stepwise testing traintest ltsdataTRAIN.$i.feats # Extract test data for letter $i cat oald.test.feats | awk '{if ($6 == "'$i'") print $0}' >ltsdataTEST.$i.feats # run wagon to predict model wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \ -stepwise -desc ltsOALD.desc -stop $STOP -output lts.$i.tree # Test the resulting tree against wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ltsOALD.desc \ -tree lts.$i.tree done
The script `traintest' splits the given file `X' into `X.train' and `X.test' with every tenth line in `X.test' and the rest in `X.train'.
This script can take a significant amount of time to run, about 6 hours on a Sun Ultra 140.
Once the models are created the must be collected together into a single list structure. The trees generated by `wagon' contain fully probability distributions at each leaf, at this time this information can be removed as only the most probable will actually be predicted. This substantially reduces the size of the tress.
(merge_models 'oald_lts_rules "oald_lts_rules.scm")
(merge_models
is defined within `lts_build.scm')
The given file will contain a set!
for the given variable
name to an assoc list of letter to trained tree. Note the above
function naively assumes that the letters in the alphabet are
the 26 lower case letters of the English alphabet, you will need
to edit this adding accented letters if required. Note that
adding "'" (single quote) as a letter is a little tricky in scheme
but can be done--the command (intern "'")
will give you
the symbol for single quote.
To test a set of lts models load the saved model and call the following function with the test align file
festival oald-table.scm oald_lts_rules.scm festival> (lts_testset "oald.test.align" oald_lts_rules)
The result (after showing all the failed ones), will be a table showing the results for each letter, for all letters and for complete words. The failed entries may give some notion of how good or bad the result is, sometimes it will be simple vowel differences, long versus short, schwa versus full vowel, other times it may be who consonants missing. Remember the ultimate quality of the letter sound rules is how adequate they are at providing acceptable pronunciations rather than how good the numeric score is.
For some languages (e.g. English) it is necessary to also find a stress pattern for unknown words. Ultimately for this to work well you need to know the morphological decomposition of the word. At present we provide a CART trained system to predict stress patterns for English. If does get 94.6% correct for an unseen test set but that isn't really very good. Later tests suggest that predicting stressed and unstressed phones directly is actually better for getting whole words correct even though the models do slightly worse on a per phone basis black98b.
As the lexicon may be a large part of the system we have also
experimented with removing entries from the lexicon if the letter to
sound rules system (and stress assignment system) can correct predict
them. For OALD this allows us to half the size of the lexicon, it could
possibly allow more if a certain amount of fuzzy acceptance was allowed
(e.g. with schwa). For other languages the gain here can be very
significant, for German and French we can reduce the lexicon by over 90%.
The function reduce_lexicon
in `festival/lib/lts_build.scm'
was used to do this. A discussion of using the above technique as a
dictionary compression method is discussed in pagel98. A
morphological decomposition algorithm, like that described in
black91, may even help more.
The technique described in this section and its relative merits with respect to a number of languages/lexicons and tasks is discussed more fully in black98b.
In fluent speech word boundaries are often degraded in a way that causes co-articulation across boundaries. A lexical entry should normally provide pronunciations as if the word is being spoken in isolation. It is only once the word has been inserted into the the context in which it is going to spoken can co-articulary effects be applied.
Post lexical rules are a general set of rules which can modify the segment relation (or any other part of the utterance for that matter), after the basic pronunciations have been found. In Festival post-lexical rules are defined as functions which will be applied to the utterance after intonational accents have been assigned.
For example in British English word final /r/ is only produced when the following word starts with a vowel. Thus all other word final /r/s need to be deleted. A Scheme function that implements this is as follows
(define (plr_rp_final_r utt) (mapcar (lambda (s) (if (and (string-equal "r" (item.name s)) ;; this is an r ;; it is syllable final (string-equal "1" (item.feat s "syl_final")) ;; the syllable is word final (not (string-equal "0" (item.feat s "R:SylStructure.parent.syl_break"))) ;; The next segment is not a vowel (string-equal "-" (item.feat s "n.ph_vc"))) (item.delete s))) (utt.relation.items utt 'Segment)))
In English we also use post-lexical rules for phenomena such as vowel reduction and schwa deletion in the possessive `'s'.
Go to the first, previous, next, last section, table of contents.