F0 Generation

Predicting where accents go (and their types) is only half of the problem. We also have build an F0 contour based on these. Note intonation is split between accent placement and F0 generation as it is obvious that accent position influences durations and an F0 contour cannot be generated without knowing the durations of the segments the contour is to be generated over.

There are three basic F0 generation modules available in Festival, though others could be added, by general rule, by linear regression/CART, and by Tilt.

F0 by rule

The first is designed to be the most general and will always allow some form of F0 generation. This method allows target points to be programmatically created for each syllable in an utterance. The idea follows closely a generalization of the implementation of ToBI type accents in [anderson84], where n-points are predicted for each accent. They (and others in intonation) appeal to the notion of baseline and place target F0 points above and below that line based on accent type, position in phrase. The baseline itself is often defined to decline over the phrase reflecting the general declination of F0 over type.

The simple idea behind this general method is that a Lisp function is called for each syllable in the utterance. That Lisp function returns a list of target F0 points that lie within that syllable. Thus the generality of this methods actual lies in the fact that it simply allows the user to program anything they want. For example our simple hat accent can be generated using this technique as follows.

This fixes the F0 range of the speaker so would need to be changed for different speakers.

(define (targ_func1 utt syl)
  "(targ_func1 UTT STREAMITEM)
Returns a list of targets for the given syllable."
  (let ((start (item.feat syl 'syllable_start))
        (end (item.feat syl 'syllable_end)))
    (if (equal? (item.feat syl "R:Intonation.daughter1.name") "Accented")
         (list start 110)
         (list (/ (+ start end) 2.0) 140)
         (list end 100)))))

It simply checks if the current syllable is accented and if so returns a list of position/target pairs. A value at the start of the syllable or 110Hz, a value at 140Hz at the mid-point of the syllable and a value of 100 at the end.

This general technique can be expanded with other rules as necessary. Festival includes an implementation of ToBI using exactly this technique, it is based on the rules described in [jilka96] and in the file festival/lib/tobi_f0.scm.

F0 by linear regression

This technique was developed specifically to avoid the difficult decisions of exactly what parameters with what value should be used in rules like those of [anderson84]. The first implementation of this work is presented [black96]. The idea is to find the appropriate F0 target value for each syllable based on available features by training from data. A set of features are collected for each syllable and a linear regression model is used to model three points on each syllable. The technique produces reasonable synthesis and requires less analysis of the intonation models that would be required to write a rule system using the general F0 target method described in the previous section.

However to be fair, this technique is also much simpler and there are are obviously a number of intonational phenomena which this cannot capture (e.g. multiple accents on syllables and it will never really capture accent placement with respect to the vowel). The previous technique allows specification of structure but without explicit training from data (though doesn't exclude that) while this technique imposes almost no structure but depends solely on data. The Tilt modeling discussed in the following section tries to balance these two extremes.

The advantage of the linear regression method is very little knowledge about the intonation the language under study needs to be known. Of course if there is knowledge and theories it is usually better to follow them (or at least find the features which influence the F0 in that language). Extracting features for F0 modeling is similar to extracting features for the other models. This time we want the means F0 at the start middle and end of each utterance. The Festival features syl_startpitch, syl_midpitch and syl_endpitch proved this. Note that syl_midpitch returns the pitch at the mid of the vowel in the syllable rather than the middle of the syllable.

For a linear regression model all features must be continuous. Thus features which are categorical that influence F0 need to be converted. The standard technique for this is to introduce new features, one for each possible value in the class and output values of 0 or 1 for these modified features depending on the value of the base features. For example in a ToBI environment the output of the feature tobi_accent will include H*, L*, L+H* etc. In the modified form you would have features of the form tobi_accent_H*, tobi_accent_L*, tobi_accent_L_H*, etc.

The program ols in the speech tools takes feature files and description files in exactly the same format as wagon, except that all feature must be declared as type float. The standard ordinary least squares algorithm used to find the coefficients cannot, in general, deal with features that are directly correlated with others as this causes a singularity when inverting the matrix. The solution to this is to exclude such features. The option -robust enables that though at the expense of a longer compute time. Again like file a stepwise option is included so that the best subset of features may be found.

The resulting models may be used by the Int_Targets_LR module which takes its LR models from the variables f0_lr_start, f0_lr_mid and f0_lr_end. The output of ols is a list of coefficients (with the Intercept first). These need to be converted to the appropriate bracket form including their feature names. An example of which is in festival/lib/f2bf0lr.scm.

If the conversion of categoricals to floats seems to much work or would prohibitively increase the number of features you could use wagon to generate trees to predict F0 values. The advantage is that of a decision tree over the LR model is that it can deal with data in a non-linear fashion, But this is also the disadvantage. Also the decision tree technique may split the data sub-optimally. The LR model is probably more theoretically appropriate but ultimately the results depend on how goods the models sound.

Dump features as with the LR models, but this time there is no need convert categorical features to floats. A potential set of features to do this from (substitute syl_midpitch and syl_endpitch for the other two models is


The above, of course assumes a ToBI accent labeling, modify that as appropriate for you actually labeling.

Once you have generated three trees predicting values for start, mid and end points in each syllable you will need to add some Scheme code to use these appropriately. Suitable code is provided in src/intonation/tree_f0.scm you will need to include that in your voice. To use it as the intonation target module you will need to add something like the following to your voice function

(set! F0start_tree f2b_F0start_tree)
(set! F0mid_tree f2b_F0mid_tree)
(set! F0end_tree f2b_F0end_tree)
(set! int_params
'((target_f0_mean 110) (target_f0_std 10)
  (model_f0_mean 170) (model_f0_std 40)))
(Parameter.set 'Int_Target_Method Int_Targets_Tree)

The int_params values allow you to use the model with a speaker of a different pitch range. That is all predicted values are converted using the formula

   (+ (* (/ (- value model_f0_mean) model_f0_stddev)
       target_f0_stddev) target_f0_mean)))

Or for those of you who can't real Lisp expressions

   ((value - model_f0_mean) / model_f0_stddev) * target_f0_stddev)+

The values in the example above are for converting a female speaker (used for training) to a male pitch range.

Tilt modeling

Tilt modeling is still under development and not as mature as the other methods as described above, but it potentially offers a more consistent solution to the problem. A tilt parameterization of a natural F0 contour can be automatically derived from a waveform and a labeling of accent placements (a simple "a" for accents and "b" of boundaries) [taylor99]. Further work is being done on trying to automatically find the accents placements too.

For each "a" in an labeling four continuous parameters are found: height, duration, peak position with respect to vowel start, and tilt. Prediction models may then be generate to predict these parameters which we feel better capture the dimensions of F0 contour itself. We have had success in building models for these parameters, [dusterhoff97a], with better results than the linear regression model on comparable data. However so far we have not done any tests with Tilt on languages other than English.

The speech tools include the programs tilt_analyse and tilt_synthesize to aid model building but we do not yet include fill Festival end support for using the generated models.