Measuring the accuracy of a generated F0 contour is not easy; small changes in the contour are perceptually important at some stages while similar changes elsewhere may be irrelevant. However in order to have some measure of accuracy we follow others ([6], [2]) and use the root mean squared error (RMSE) between the generated contour and the original (smoothed) contour. We also use the correlation between the generated contour and original. The RMSE magnitude is dependent of the F0 range of the speaker (larger for females than for males) as well as the actual error, while the correlation is more independent. Note that for these examples the segment durations are the same in the generated examples as in the originals and hence voiced sections and unvoiced sections of the signal will always align. RMSE and correlation are only calculated during the voiced sections.
In addition to the overall comparison we also recorded the accuracy of each of the individual models we built on held out test data which helped us concentrate on particular areas for improvement (notably peak position).
Three experiments were carried out, varying the methods used to label the Tilt events. In all cases the continuous Tilt parameters are automatically derived from the Tilt events. In the first experiment the Tilt events were derived from the ToBI labels already in the database by a mostly trivial mapping. In the second experiment we used those same event labels but also include ToBI labels in the features we used to predict the parameters. In the final experiment we used hand labelled Tilt events, and like the first experiment, used no ToBI features in building the models.