Building Synthetic Voices | ||
---|---|---|
<<< Previous | Diphone databases | Next >>> |
Festival, in its publically distributed form, currently only supports residual excited Linear-Predictive Coding (LPC) resynthesis [hunt89]. It does support PSOLA [moulines90], though this is not distributed in the public version. Both of these techniques are pitch synchronous, that is there require information about where pitch periods occur in the acoustic signal. Where possible, it is better to record with an electroglottograph (EGG, also known as a laryngograph) at the same time as the voice signal. The EGG records electrical activity in the glottis during speech, which makes it easier to get the pitch moments, and so they can be more precisely found.
Although extracting pitch periods from the EGG signal is not trivial, it is fairly straightforward in practice, as The Edinburgh Speech Tools include a program pitchmark which will process the EGG signal giving a set of pitchmarks. However it is not fully automatic and requires someone to look at the result and make some decisions to change parameters that may improve the result.
The first major issue in processing the signal is deciding which way is
up. From our experience, we have seen the signal inverted in some cases
and it is necessary to identify the direction in order for the rest of
the processing to work properly. In general we've found the CSTR's LAR
output is upside down while OGI's and CMU's output is the right way up,
though this can even flip from file to file. If you find inverted
signals, you should add -inv
to the arguments to
pitchmark.
The object is to produce a single mark at the peak of each pitch period and "fake" or "phantom" periods during unvoiced regions. The basic command we have found that works for us is
It is worth doing one or two by hand and confirming that a reasonable pitch periods are found. Note that thepitchmark lar/file001.lar -o pm/file001.pm -otype est \
-min 0.005 -max 0.012 -fill -def 0.01 -wave_end
-min
and -max
arguments are speaker-dependent. This can be moved towards the fixed F0
point used in the prompts, though remember the speaker will not have
been exactly constant. The script festvox/src/general/make_pm
can be copied and modified (for the particular pitch range) and run to
generate the pitchmarks
bin/make_pm lar/*.lar
If you don't have an EGG signal for your diphones, the alternative is to extract the pitch periods using some other signal processing function. Finding the pitch periods is similar to finding the F0 contour and, although harder than finding it from the EGG signal, with clean laboratory-recorded speech, such as diphones, it is possible. The following script is a modification of the make_pm script above for extracting pitchmarks from a raw waveform signal. It is not as good as extracting from the EGG file, but it works. It is more computationally intensive, as it requires rather high order filters. The value should change depending on the speaker's pitch range.
If you are extracting pitch periods automatically, it is worth taking more care to check the signal. We have found that recording consistency and bad pitch extraction the two most common causes of poor quality synthesis.for i in $*
do
fname=`basename $i .wav`
echo $i
$ESTDIR/bin/ch_wave -scaleN 0.9 $i -F 16000 -o /tmp/tmp$$.wav
$ESTDIR/bin/pitchmark /tmp/tmp$$.wav -o pm/$fname.pm \
-otype est -min 0.005 -max 0.012 -fill -def 0.01 \
-wave_end -lx_lf 200 -lx_lo 71 -lx_hf 80 -lx_ho 71 -med_o 0
done
See the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements for a more detailed discussion on how to do this.
<<< Previous | Home | Next >>> |
Labeling the diphones | Up | Building LPC parameters |