Extracting the pitchmarks

Festival, in its publically distributed form, currently only supports residual excited Linear-Predictive Coding (LPC) resynthesis [hunt89]. It does support PSOLA [moulines90], though this is not distributed in the public version. Both of these techniques are pitch synchronous, that is there require information about where pitch periods occur in the acoustic signal. Where possible, it is better to record with an electroglottograph (EGG, also known as a laryngograph) at the same time as the voice signal. The EGG records electrical activity in the glottis during speech, which makes it easier to get the pitch moments, and so they can be more precisely found.

Although extracting pitch periods from the EGG signal is not trivial, it is fairly straightforward in practice, as The Edinburgh Speech Tools include a program pitchmark which will process the EGG signal giving a set of pitchmarks. However it is not fully automatic and requires someone to look at the result and make some decisions to change parameters that may improve the result.

The first major issue in processing the signal is deciding which way is up. From our experience, we have seen the signal inverted in some cases and it is necessary to identify the direction in order for the rest of the processing to work properly. In general we've found the CSTR's LAR output is upside down while OGI's and CMU's output is the right way up, though this can even flip from file to file. If you find inverted signals, you should add -inv to the arguments to pitchmark.

The object is to produce a single mark at the peak of each pitch period and "fake" or "phantom" periods during unvoiced regions. The basic command we have found that works for us is

pitchmark lar/file001.lar -o pm/file001.pm -otype est \
     -min 0.005 -max 0.012 -fill -def 0.01 -wave_end

It is worth doing one or two by hand and confirming that a reasonable pitch periods are found. Note that the -min and -max arguments are speaker-dependent. This can be moved towards the fixed F0 point used in the prompts, though remember the speaker will not have been exactly constant. The script festvox/src/general/make_pm can be copied and modified (for the particular pitch range) and run to generate the pitchmarks

bin/make_pm lar/*.lar

If you don't have an EGG signal for your diphones, the alternative is to extract the pitch periods using some other signal processing function. Finding the pitch periods is similar to finding the F0 contour and, although harder than finding it from the EGG signal, with clean laboratory-recorded speech, such as diphones, it is possible. The following script is a modification of the make_pm script above for extracting pitchmarks from a raw waveform signal. It is not as good as extracting from the EGG file, but it works. It is more computationally intensive, as it requires rather high order filters. The value should change depending on the speaker's pitch range.

for i in $*
do
   fname=`basename $i .wav`
   echo $i
   $ESTDIR/bin/ch_wave -scaleN 0.9 $i -F 16000 -o /tmp/tmp$$.wav
   $ESTDIR/bin/pitchmark /tmp/tmp$$.wav -o pm/$fname.pm \
             -otype est -min 0.005 -max 0.012 -fill -def 0.01 \
             -wave_end -lx_lf 200 -lx_lo 71 -lx_hf 80 -lx_ho 71 -med_o 0
done

If you are extracting pitch periods automatically, it is worth taking more care to check the signal. We have found that recording consistency and bad pitch extraction the two most common causes of poor quality synthesis.

See the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements for a more detailed discussion on how to do this.