Once an alignment is found we can train a phone prediction model. In our work we have used decision tree technology [3] as we feel this is simple and produces compact models. We also feel that other learning techniques would not produce significantly better results.
For each letter in the alphabet of the language we trained a CART tree given the letter context (three either side) to predict epsilon, phone or double phone from the aligned data. One can build a single tree without any significant difference in the accuracy but building separate trees is faster and allows for parallelization.
We split the data into train and test data by removing every tenth word from the lexicon. This means that the data set contains only one occurrence of each word and hence word frequency is ignored. Another factor is that as these lexicons usually contain many morphological variations, it is likely there will be a similar word or words in the training set.
We removed short words (under four letters) from the training and test sets as these words are typically function words which in general may have non-standard pronunciations, or are abbreviations (e.g. ``aaa'' as /t r ih p ah l ey/) which have little or no relationship with their pronunciation. Also, where part of speech information was available, we removed all non-content words. The reasoning is that unknown words are typically not the most common words and in general unknown words will have more standard pronunciations rather than idiosyncratic ones.
We have so far tried this technique on four lexicons, Oxford Advanced Learners Dictionary of Contemporary English (OALD) (British English) [10], CMUDICT (US English) [4], BRULEX (French) [5] and the German Celex Lexicon [1].
Correct | ||
Lexicon | Letters | Words |
OALD | 95.80% | 74.56% |
CMUDICT | 91.99% | 57.80% |
BRULEX | 99.00% | 93.03% |
DE-CELEX | 98.79% | 89.38% |
The above results are the best results achieved after testing various parameters in the CART building process. Particularly we varied the ``stop'' value which specifies the minimum number of examples necessary in the training set before a question is hypothesized to distinguish the group. Normally the smaller the stop value the more over-trained the models may become. However the following table shows the results for OALD, tested on held out data, while changing the stop value
Correct | |||
Stop | Letters | Words | Size |
8 | 92.89% | 59.63% | 9884 |
6 | 93.41% | 61.65% | 12782 |
5 | 93.70% | 63.15% | 14968 |
4 | 94.06% | 65.17% | 17948 |
3 | 94.36% | 67.19% | 22912 |
2 | 94.86% | 69.36% | 30368 |
1 | 95.80% | 74.56% | 39500 |
Note that comparisons with other LTS training techniques are not that easy. As when the train/test sets differ, and when the domains differ there can be no direct comparisons. For example if we remove proper names from the OALD and train and test on the remainder our word correct score goes up to 80%. However the above results compare favorably with other systems using similar data sets (e.g. [8]).