Evaluation and Improvements

This chapter discusses evaluation of speech synthesis voices and provides a detailed procedure to allow diagnostic testing of new voices.


Now that you have built your voice, how can you tell if it works, and how can you find out what you need to make it better. This chapter deals with some issues of evaluating a voice in Festival. Some of the points here also apply to testing and improving existing voices too.

The evaluation of speech synthesis is notoriously hard. Evaluation in speech recognition was the major factor in making general speech recognition work. Rigourous tests on well defined data made the evaluation of different techniques possible. Though in spite of its success the strict evaluation criteria as used in speech recognition can cloud the ultimate goal. It is important always to remember that tests are there to evaluate a systems performance rather than become the task itself. Just as techniques can overtrain on data it is possible to over train on the test data and/or methodology too thus loosing the generality and purpose of the evaluation.

In speech recognition a simple (though naive) measure of phones or words correct gives a reasonable indicator of how well a speech recognition system works. In synthesis this a lot harder. A word can have multiple pronunciations, so it is much harder to automatically test if a synthesizer's phoneme accuracy, besides much of the quality is not just in if it is correct but if it "sounds good". This is effectly the crux of the matter. The only real synthesis evaluation technique is having a human listen to the result. Humans individually are not very reliably testers of systems, but humans in general are. However it is usually not feasible to have testers listen to large amounts of synthetic speech and return a general goodness score. More specific tests are required.

Although listening tests are the ultimate, because they are expensive in resources (undergraduates are not willing to listing to bad synthesis all day for free), and the design of listening tests is a non-trivial task, there are a number of more general tests which can be run at less expenses and can help greatly.

It is common that a new voice in Festival (or any other speech synthesis systems), has limitations and it is wise to test what the limitations are and decide if such limitations are acceptable or not. This depends a lot on what you wish to use your voice for. For example if the voice a Scottish English voice to be primarily used as the output of a Chinese speech tranlation system, the vocabulary is constained by the translation system itself so a large lexicon is probably not much of an issue, but the vocabulary will include many anglosized (calenodianized ?) versions of Chinese names, which are not common in standard English so letter-to-sound rules should be made more sensitive for that input. If the system is to be used to read address lists, it should be able to tokenize names and address appropriately, and if it is to be used in a dialogue system the intonation model should be able to deal with questions and continuations properly. Optimizing your voices for the most common task, and minimizing the errors is what evaluation is for.