Building Synthetic Voices
<<< Previous	Diphone databases	Next >>>

Building LPC parameters

Currently the only publically distributed signal processing method in Festival is residual excited LPC. To use this, you must extract LPC parameters and LPC residual files for each file in the diphone database. Ideally, the LPC analysis should be done pitch-synchronously, thus requiring that pitch marks are created before the LPC analysis takes place.

A script suitable for generating the LPC coefficients and residuals is given in festvox/src/general/make_lpc and is repeated here.

for i in $*
do
   fname=`basename $i .wav`
   echo $i

   # Potential normalise the power
   #$ESTDIR/bin/ch_wave -scaleN 0.5 $i -o /tmp/tmp$$.wav
   # resampling can be done now too
   #$ESTDIR/bin/ch_wave -F 11025 $i -o /tmp/tmp$$.wav
   # Or use as is
   cp -p $i /tmp/tmp$$.wav
   $ESTDIR/bin/sig2fv /tmp/tmp$$.wav -o lpc/$fname.lpc \
             -otype est -lpc_order 16 -coefs "lpc" \
             -pm pm/$fname.pm -preemph 0.95 -factor 3 \
             -window_type hamming
   $ESTDIR/bin/sigfilter /tmp/tmp$$.wav -o lpc/$fname.res \
              -otype nist -lpcfilter lpc/$fname.lpc -inv_filter
   rm /tmp/tmp$$.wav
done

Note the (optional) use of ch_wave to attempt to normalize the power in the wave to a percentage of its maximum. This is a very crude method for making the waveforms have a reasonably equivalent power. Wildly different power fluctuations in power between segments is likely to be noticed when they are joined. Differing power in the nonsense words may occur if not enough care has been taking in the recording. Either the settings on the recording equipment have been changed (bad) or the speaker has changed their vocal effort (worse). It is important that this should be avoided as the above normalization does not make the problem of different power go away it only makes the problem slightly less bad.

A more elaborate power normaliziation has been successful, but it is a little harder, though it was definitely successful for the KED US American voice that had major power fluctuations over different recording sesssions. The idea is to find the power during vowels in each nonsense word, then find the mean power for each vowel overall files. Then, for each file, find the average factor difference for each actual vowel with the mean for that vowel and scale the waveform according to that value. We now provided a basic script which does this

bin/find_powerfacts lab/*.lab

This script creates (among others) etc/powfacts which if it exists, is used to normalize the power of each waveform file during the making of the LPC coefficients.

We generate a set of ch_wave commands that extract the parts of the wave from that are vowels (using -start and -end options, make the output be in ascii -otype raw -ostype ascii and use a simple script to calculate the RMS power. We then calculate the mean power for each vowel with another awk script using the result as a table, then finally we process the fileid, actual vowel power information to generate a power factor to by averaging the ration of each vowel's actual power to the mean power for that vowel. You may wish to still modify the power further after this if it is too low or high.

Note that power normalization is intended to remove artifacts caused by different recording environment, i.e. the person moved from the microphone, the levels were changed etc. they should not modify the intrinsic power differences in the phones themselves. The above techniques try to preserve the intrinsic power, which is why we take the average over all vowels in a nonsense word, though you should listen to the results and make the ultimate decision yourself.

If all has been recorded properly, of course, individual power modification should be unnecessary. Once again, we can't stress enough how important it is to have good and consistent recording conditions, so as to avoid steps like this.

If you want to generate a database using a different sampling rate than the recordings were made with, this is the time to resample. For example an 8KHz or 11.025KHz will be smaller than a 16KHz database. If the eventual voice is to be played over the telephone, for example, there is little point in generating anything but 8Khz. Also it will be faster to synthesize 8Khz utterances than 16Khz ones.

The number of LPC coefficients used to represent each pitch period can be changed depending on sample rate you choose. Hearsay, reasonable experience, and perhaps some theoretical underpining, suggests the following formula for calculating the order

(sample_rate/1000)+2

But that should only be taken as a rough guide though a larger sample rate deserves a greater number of coeeficients.

<<< Previous	Home	Next >>>
Extracting the pitchmarks	Up	Defining a diphone voice