Checking and correcting diphones

This probably sounds like we're repeating ourselves here, and we are, because it's quite important for the overall quality of the voice: once you have the basic diphone database working it is worthwhile systematically testing it as it is common to have mistakes. These may be mislabeling, and mispronunciation for the phones themselves. Two possible strategies are possible for testing both of which have their advantages. This first is a simple exhaustive synthesis of all diphones. Ideally, the diphone prompts are exactly the set of utterances that test each and every diphone. using the SayPhones function you can synthesize and listen to each prompt. Actually, for a first pass, it may even be useful to synthesize each nonsense word without listening as some of the problems missing files, missing diphones, badly extracted pitchmarks will show up without you having to listen to at all.

When a problem occurs, trace back why, check the entry in the diphone index, then check the label for the nonsense word, then check how that label matches the actually waveform file itself (display the waveform with the label file and spectrogram to see if the label is correct).

Listing all the problems that could occur is impossible. What you need to do is break down the problem and find out where it might be occurring. If you just get apparent garbage being synthesized, take a look at the synthesized waveform

(set! utt1 (SayPhones '(pau hh ah l ow pau)))
( utt1 "hello.wav")

Is it garbage, can you recognized any part of it? It could be a byte swap problem or a format problem for your files. Can your nonsense word file be played and displayed as is? Can your LPC residual files be played and displayed. Residual files should look like very low powered waveform files and sound very buzzy when played but almost recognizable if you know what is being said (sort of like Kenny from South Park).

If you can recognize some of what is being said but it is fairly uniformly garbled it is possible your pitchmarks are not being aligned properly. Use some display mechanism to see where the pitchmarks are. These should be aligned (during voiced speech) with the peaks in the signal.

If all is well except for some parts of the signal are bad or overflowed, then check the diphone where the errors occur.

There are a number of solutions to problems that may save you some time, for the most part they should be considered cheating, but they may save having to re-record, which is something that you will probably want to avoid if at all possible.

Note that some phones are very similar, particular the left half side of most stops are indistinguishable, as the consist of mostly silence. Thus if you find you didn't get a good SOMETHING-p diphone you can easily make it use the SOMETHING-b diphone instead. You can do this by hand editing the diphone index file accordingly.

The linguists among you may not find that acceptable, but you can go further, the burst part of /p/ and /b/ isn't that different when it comes down to it and if is it just one or two diphones you can simply map those too. Considering problems are often in one or two badly articulated phones replace a /p/ with a /b/ (or similar) in one or two diphones may not be that bad.

Once, however, the problems become systematic over a number of phones re-recording them should be considered. Though remember if you do have to re-record you want to have as similar an environment as possible which is not always easy. Eventually you may need to re-record the whole database again.

Recording diphone databases is not an exact science, although we have a fair amount of experience in recording these databases, they never completely go as planned. Some apparently minor problem often occurs, noise on the channel, slightly different power over two sessions. Even when everything seems the same and we can't identify any difference between two recording environments we have found that some voices are better than others for building diphone databases. We can't immediately say why, we discussed some of these issues above in selecting a speaker but there is still some other parameters which we can't identify so don't be disheartened when you database isn't as good as you hoped, ours sometimes fail too.