Building Synthetic Voices | ||
---|---|---|
<<< Previous | Lexicons | Next >>> |
For some languages the writing of a rule system is too difficult. Although there have been many valiant attempts to do so for languages like English life is basically too short to do this. Therefore we also include a method for automatically building LTS rules sets for a lexicon of pronunciations. This technique has successfully been used from English (British and American), French and German. The difficulty and appropriateness of using letter-to-sound rules is very language dependent,
The following outlines the processes involved in building a letter to sound model for a language given a large lexicon of pronunciations. This technique is likely to work for most European languages (including Russian) but doesn't seem particularly suitable for very language alphabet languages like Japanese and Chinese. The process described here is not (yet) fully automatic but the hand intervention required is small and may easily be done even by people with only a very little knowledge of the language being dealt with.
The process involves the following steps
Pre-processing lexicon into suitable training set
Defining the set of allowable pairing of letters to phones. (We intend to do this fully automatically in future versions).
Constructing the probabilities of each letter/phone pair.
Aligning letters to an equal set of phones/_epsilons_.
Extracting the data by letter suitable for training.
Building CART models for predicting phone from letters (and context).
Building additional lexical stress assignment model (if necessary).
Before building a model its wise to think a little about what you want it to do. Ideally the model is an auxiliary to the lexicon so only words not found in the lexicon will require use of the letter-to-sound rules. Thus only unusual forms are likely to require the rules. More precisely the most common words, often having the most non-standard pronunciations, should probably be explicitly listed always. It is possible to reduce the size of the lexicon (sometimes drastically) by removing all entries that the training LTS model correctly predicts.
Before starting it is wise to consider removing some entries from the lexicon before training, I typically will remove words under 4 letters and if part of speech information is available I remove all function words, ideally only training from nouns verbs and adjectives as these are the most likely forms to be unknown in text. It is useful to have morphologically inflected and derived forms in the training set as it is often such variant forms that not found in the lexicon even though their root morpheme is. Note that in many forms of text, proper names are the most common form of unknown word and even the technique presented here may not adequately cater for that form of unknown words (especially if they unknown words are non-native names). This is all stating that this may or may not be appropriate for your task but the rules generated by this learning process have in the examples we've done been much better than what we could produce by hand writing rules of the form described in the previous section.
First preprocess the lexicon into a file of lexical entries to be used for training, removing functions words and changing the head words to all lower case (may be language dependent). The entries should be of the form used for input for Festival's lexicon compilation. Specifically the pronunciations should be simple lists of phones (no syllabification). Depending on the language, you may wish to remove the stressing---for examples here we have though later tests suggest that we should keep it in even for English. Thus the training set should look something like
It is best to split the data into a training set and a test set if you wish to know how well your training has worked. In our tests we remove every tenth entry and put it in a test set. Note this will mean our test results are probably better than if we removed say the last ten in every hundred.("table" nil (t ei b l))
("suspicious" nil (s @ s p i sh @ s))
The second stage is to define the set of allowable letter to phone mappings irrespective of context. This can sometimes be initially done by hand then checked against the training set. Initially construct a file of the form
All letters that appear in the alphabet should (at least) map to(require 'lts_build)
(set! allowables
'((a _epsilon_)
(b _epsilon_)
(c _epsilon_)
...
(y _epsilon_)
(z _epsilon_)
(# #)))
_epsilon_
, including any accented characters that appear in that
language. Note the last two hashes. These are used by to denote
beginning and end of word and are automatically added during training,
they must appear in the list and should only map to themselves. To incrementally add to this allowable list run festival as
and at the prompt typefestival allowables.scm
with your train file. This will print out each lexical entry that couldn't be aligned with the current set of allowables. At the start this will be every entry. Looking at these entries add to the allowables to make alignment work. For example if the following word failsfestival> (cummulate-pairs "oald.train")
Add("abate" nil (ah b ey t))
ah
to the allowables for letter a
, b
to
b
, ey
to a
and t
to letter t
. After
doing that restart festival and call cummulate-pairs
again.
Incrementally add to the allowable pairs until the number of failures
becomes acceptable. Often there are entries for which there is no real
relationship between the letters and the pronunciation such as in
abbreviations and foreign words (e.g. "aaa" as "t r ih p ax l ey"). For
the lexicons I've used the technique on less than 10 per thousand fail
in this way. It is worth while being consistent on defining your set of allowables.
(At least) two mappings are possible for the letter sequence
ch}---havin
letter c
go to phone ch
and letter
h
go to _epsilon_
and also letter c
go to phone
_epsilon_
and letter h
goes to ch
. However only
one should be allowed, we preferred c
to ch
.
It may also be the case that some letters give rise to more than one
phone. For example the letter x
in English is often pronounced as
the phone combination k
and s
. To allow this, use the
multiphone k-s
. Thus the multiphone k-s
will be predicted
for x
in some context and the model will separate it into two
phones while it also ignoring any predicted _epsilons_
. Note that
multiphone units are relatively rare but do occur. In English, letter
x
give rise to a few, k-s
in taxi
, g-s
in
example
, and sometimes g-zh
and k-sh
in
luxury
. Others are w-ah
in one
, t-s
in
pizza
, y-uw
in new
(British), ah-m
in
-ism
etc. Three phone multiphone are much rarer but may exist, they
are not supported by this code as is, but such entries should probably
be ignored. Note the -
sign in the multiphone examples is
significant and is used to identify multiphones.
The allowables for OALD end up being
Note this is an exhaustive list and (deliberately) says nothing about the contexts or frequency that these letter to phone pairs appear. That information will be generated automatically from the training set.(set! allowables
'
((a _epsilon_ ei aa a e@ @ oo au o i ou ai uh e)
(b _epsilon_ b )
(c _epsilon_ k s ch sh @-k s t-s)
(d _epsilon_ d dh t jh)
(e _epsilon_ @ ii e e@ i @@ i@ uu y-uu ou ei aa oi y y-u@ o)
(f _epsilon_ f v )
(g _epsilon_ g jh zh th f ng k t)
(h _epsilon_ h @ )
(i _epsilon_ i@ i @ ii ai @@ y ai-@ aa a)
(j _epsilon_ h zh jh i y )
(k _epsilon_ k ch )
(l _epsilon_ l @-l l-l)
(m _epsilon_ m @-m n)
(n _epsilon_ n ng n-y )
(o _epsilon_ @ ou o oo uu u au oi i @@ e uh w u@ w-uh y-@)
(p _epsilon_ f p v )
(q _epsilon_ k )
(r _epsilon_ r @@ @-r)
(s _epsilon_ z s sh zh )
(t _epsilon_ t th sh dh ch d )
(u _epsilon_ uu @ w @@ u uh y-uu u@ y-u@ y-u i y-uh y-@ e)
(v _epsilon_ v f )
(w _epsilon_ w uu v f u)
(x _epsilon_ k-s g-z sh z k-sh z g-zh )
(y _epsilon_ i ii i@ ai uh y @ ai-@)
(z _epsilon_ z t-s s zh )
(# #)
))
Once the number of failed matches is significantly low enough
let cummulate-pairs
run to completion. This counts the number
of times each letter/phone pair occurs in allowable alignments.
with the name of your lexicon. This changes the cumulation table into probabilities and saves it.festival> (save-table "oald-")
Restart festival loading this new table
Now each word can be aligned to an equally-lengthed string of phones, epsilon and multiphones.festival allowables.scm oald-pl-table.scm
Do this also for you test set.festival> (aligndata "oald.train" "oald.train.align")
This will produce entries like
aaronson _epsilon_ aa r ah n s ah n
abandon ah b ae n d ah n
abate ah b ey t _epsilon_
abbe ae b _epsilon_ iy
The next stage is to build features suitable for wagon to build models. This is done by
Again the same for the test set.festival> (build-feat-file "oald.train.align" "oald.train.feats")
Now you need to construct a description file for wagon for the given data. The can be done using the script make_wgn_desc provided with the speech tools
Here is an example script for building the models, you will need to modify it for your particular database but it shows the basic processes
The script traintest splits the given file X into X.train and X.test with every tenth line in X.test and the rest in X.train.for i in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
# Stop value for wagon
STOP=2
echo letter $i STOP $STOP
# Find training set for letter $i
cat oald.train.feats |
awk '{if ($6 == "'$i'") print $0}' >ltsdataTRAIN.$i.feats
# split training set to get heldout data for stepwise testing
traintest ltsdataTRAIN.$i.feats
# Extract test data for letter $i
cat oald.test.feats |
awk '{if ($6 == "'$i'") print $0}' >ltsdataTEST.$i.feats
# run wagon to predict model
wagon -data ltsdataTRAIN.$i.feats.train -test ltsdataTRAIN.$i.feats.test \
-stepwise -desc ltsOALD.desc -stop $STOP -output lts.$i.tree
# Test the resulting tree against
wagon_test -heap 2000000 -data ltsdataTEST.$i.feats -desc ltsOALD.desc \
-tree lts.$i.tree
done
This script can take a significant amount of time to run, about 6 hours on a Sun Ultra 140.
Once the models are created the must be collected together into a single list structure. The trees generated by wagon contain fully probability distributions at each leaf, at this time this information can be removed as only the most probable will actually be predicted. This substantially reduces the size of the tress.
(merge_models is defined within lts_build.scm The given file will contain a(merge_models 'oald_lts_rules "oald_lts_rules.scm" allowables)
set!
for the given variable
name to an assoc list of letter to trained tree. Note the above
function naively assumes that the letters in the alphabet are
the 26 lower case letters of the English alphabet, you will need
to edit this adding accented letters if required. Note that
adding "'" (single quote) as a letter is a little tricky in scheme
but can be done---the command (intern "'")
will give you
the symbol for single quote. To test a set of lts models load the saved model and call the following function with the test align file
The result (after showing all the failed ones), will be a table showing the results for each letter, for all letters and for complete words. The failed entries may give some notion of how good or bad the result is, sometimes it will be simple vowel differences, long versus short, schwa versus full vowel, other times it may be who consonants missing. Remember the ultimate quality of the letter sound rules is how adequate they are at providing acceptable pronunciations rather than how good the numeric score is.festival oald-table.scm oald_lts_rules.scm
festival> (lts_testset "oald.test.align" oald_lts_rules)
For some languages (e.g. English) it is necessary to also find a stress pattern for unknown words. Ultimately for this to work well you need to know the morphological decomposition of the word. At present we provide a CART trained system to predict stress patterns for English. If does get 94.6% correct for an unseen test set but that isn't really very good. Later tests suggest that predicting stressed and unstressed phones directly is actually better for getting whole words correct even though the models do slightly worse on a per phone basis [black98b].
As the lexicon may be a large part of the system we have also
experimented with removing entries from the lexicon if the letter to
sound rules system (and stress assignment system) can correct predict
them. For OALD this allows us to half the size of the lexicon, it could
possibly allow more if a certain amount of fuzzy acceptance was allowed
(e.g. with schwa). For other languages the gain here can be very
significant, for German and French we can reduce the lexicon by over 90%.
The function reduce_lexicon
in festival/lib/lts_build.scm
was used to do this. A discussion of using the above technique as a
dictionary compression method is discussed in [pagel98]. A
morphological decomposition algorithm, like that described in
[black91], may even help more.
The technique described in this section and its relative merits with respect to a number of languages/lexicons and tasks is discussed more fully in [black98b].
<<< Previous | Home | Next >>> |
Building letter-to-sound rules by hand | Up | Post-lexical rules |