Labeling the diphones

Labeling nonsense words is much easier than labeling continuous speech, whether it is by hand or automatically. With nonsense words, it is completely defined which phones are there and they are (hopefully) clearly articulated.

We have had significant experience in hand labeling diphones, and with the right tools it can be done fairly quickly (e.g. 20 hours for 2500 nonsense words) even if it is a mind-numbing exercise which your voice talent may offer you little sympathy for after you've made them babble for hours in a box with electrodes on their throat (optional). But labeling can't realistically be done for more than an hour or two at any one time. As a minimum, the start of the preceding phone to the first phone in the diphone, the changeover, and the end of the second phone in the diphone should be labeled. Note we recommend phone boundary labeling as that is much better defined than phone middle marking. The diphone will, by default be extracted from the middle of phone one to the middle of phone two.

Your data set conventions may include the labeling of closures within stops explicitly. Thus you would expect the label tcl at the end of the silence part of a /t/ and a label t after the burst. This way the diphone boundary can automatically be placed within the silence part of the stop. The label DB can be used when explicit diphone boundaries are desireable; this is useful within phones such as diphthongs where the temporal middle need not be the most stable part.

Another place when specific diphone boundaries are recommended is in the phone-to-silence diphones. The phones at the end of words are typically longer than word internal phones, and tend to trail off in energy. Thus the midpoint of a phone immediately before a silence typically has much less energy than the midpoint of a word internal phone. Thus, when a diphone is to be concatenated to a phone-silence diphone, there would be a big jump in energy (as well as other related spectral characteristics). Our solution to this is explicitly label a diphone boundary near the beginning of the phone before the silence (about 20% in) where the energy is much closer to what it will be in the diphone that will precede it.

If you are using explicit closures, it is worth noting that stops at the start of words don't seem to have a closure part; however it is a good idea to actually label one anyway, if you are doing this by hand. Just "steal" a suitable short piece of silence from the preceding part of the waveform.

Because the words will often have very varying amounts of silence around them, it is a good idea to label multiple silences around the word, so that the silence immediately before the first phone is about 200-300 ms, and labeling the silence before that as another phone; likewise with the final silence. Also, as the final phone before the end silence may trail off, we recommend that the end of the last phone come at the very end of any signal thus appear to include silence within it. Then label the real silence (200-300 ms) after it. The reason for this is if the end silence happens to include some part of the spoken signal, and if this is duplicated, as is the case when duration is elongated, an audible buzz can be introduced.

Because labeling of diphone nonsense words is such a constrained task we have included a program for automatically providing a labeling for the spoken prompts. This requires that prompts be generated for the diphone database. The aligner uses those prompts to do the aligning. Though its not actually necessary that the prompts were used as prompts they do need to be generated for this alignment process. This is not the only means for alignment; you may also, for instance, use a speech recognizer, such as CMU Sphinx, to segment (align) the data.

The idea behind the aligner is to take the prompt and the spoken form and derive mel-scale cepstral parameterizations (and their deltas) of the files. Then a DTW (dynamic time warping) algorithm is used to find the best alignment between these two sets of features. Then the prompt label file is used to index through the alignment to give a label file for the spoken nonsense word. This is largely based on the techniques described in [malfrere97], though this general technique has been used for many years.

We have tested this aligner on a number of existing hand-labeled databases to compare the quality of the alignments with respect to the hand labeling. We have also tested aligning prompts generated from a language different from that being recorded. To do this there needs to be reasonable mapping between the language phonesets.

Here are results for automatically finding labels for the ked (US English) by aligning them against prompts generated by three different voices

ked itself

mean error 14.77ms stddev 17.08

mwm (US English)

mean error 27.23ms stddev 28.95

gsw (UK English)

mean error 25.25ms stddev 23.923

Note that gsw actually gives better results than mwm, even though it is a different dialect of English. We built three diphone index files from each of the label sets generated from there alignment processes. ked-to-ked was the best, and only marginally worse that the database made from the manually produced labels. The database from mwm and gsw produced labels were a little worse but not unacceptably so. Considering a significant amount of careful corrections were made to the manually produced labels, these automatically produced labels are still significantly better than the first pass of hand labels.

A further experiment was made across languages; the ked diphones were used as prompts to align a set of Korean diphones. Even though there are a number of phones in Korean not present in English (various forms of aspirated consonants), the results are quite usable.

Whether you use hand labeling or automatic alignment, it is always worthwhile doing some hand-correction after the basic database is built. Mistakes (sometimes systematic) always occur and listening to substantial subset of the diphones (or them all if you resynthesize the nonsense words) is definitely worth the time in finding bad diphones. The diva is in the details.

The script festvox/src/diphones/make_labs will process a set of prompts and their spoken (recorded) form generating a set of label files, to the best of its ability. The script expects the following to already exist

prompt-wav/

The waveforms as synthesized by Festival

prompt-lab/

The label files corresponding to the synthesized prompts in prompt-wav.

prompt-cep/

The directory where the cepstral feature streams for each prompt will be saved.

wav/

The directory holding the nonsense words spoken by your voice talent. The should have the same file id as the waveforms in prompt-wav/.

cep/

The directory where the cepstral feature streams for the recorded material will be saved.

lab/

The directory where the generated label files for the spoke words in wav/ will be saved.

To run the script over the prompt waveforms

bin/make_labs prompt-wav/*.wav

The script is written so it may be use used in parallel on multiple machines if you want to distribute the process. On a Pentium Pro 200MHz, which you probably won't be able to find any more, a 2000 word diphone databases can be labeled in about 30 minutes. Most of that time is in generating the cepstrum coefficients. This is down to a few minutes at most on a dual Pentium III 550.

Once the nonsense words have been labeled, you need to build a diphone index. The index identifies which diphone comes from which files, and from where. This can be automatically built from the label files (mostly). The Festival script festvox/src/diphones/make_diph_index will take the diphone list (as used above), find the occurrence of each diphone in the label files, and build an index. The index consists of a simple header, followed by a single line for each diphone: the diphone name, the fileid, start time, mid-point (i.e. the phone boundary) and end time. The times are given in seconds (note that early versions of Festival, using a different diphone synthesizer module, used milliseconds for this. If you have such an old version of Festival, it's time to update it).

An example from the start of a diphone index file is

EST_File index
DataType ascii
NumEntries  1610
IndexName ked2_diphone
EST_Header_End
y-aw kd1_002 0.435 0.500 0.560
y-ao kd1_003 0.400 0.450 0.510
y-uw kd1_004 0.345 0.400 0.435
y-aa kd1_005 0.255 0.310 0.365
y-ey kd1_006 0.245 0.310 0.370
y-ay kd1_008 0.250 0.320 0.380
y-oy kd1_009 0.260 0.310 0.370
y-ow kd1_010 0.245 0.300 0.345
y-uh kd1_011 0.240 0.300 0.330
y-ih kd1_012 0.240 0.290 0.320
y-eh kd1_013 0.245 0.310 0.345
y-ah kd1_014 0.305 0.350 0.395
...

Note the number of entries field must be correct; if it is too small it will (often confusingly) ignore the entries after that point.

This file can be created with a diphone list file and the lab files in by the command

$FESTVOXDIR/src/diphones/make_diph_index etc/usdiph.list dic/kaldiph.est

You should check that this has successfully found all the named diphones. When an diphone is not found in a label file, an entry with zeroes for the start, middle, and end is generated, which will produce a warning when being used in Festival, but it is worth checking in advance.

The make_diph_index program will take the midpoint between phone boundaries for the diphone boundary, unless otherwise specified with the label DB. It will also automatically remove underscores and dollar symbols from the diphone names before searching for the diphone in the label file, and it will only find the first occurrence of the diphone.