It is worth mentioning a few practical aspects of our algorithm. As the training procedure is fully automatic, it is simple to retrain the system on a new prosodically marked database. Training is quick, requiring a single pass over the data for the POS sequence model and the phrase break model. At run-time, the Viterbi decoder requires only a few calculations per input sentence (several orders of magnitude less than the signal processing component of the TTS system, for instance). The framework also allows for some flexibility at run time. As is common with speech recognition Viterbi decoders, our system allows a grammar scaling factor which controls the relative importance of the POS sequence and phrase break models. With some experimentation, this can be effectively used to control the relative insertion and deletion ratios.
The algorithm has been implemented in the Festival speech synthesis system [Black and Taylor, 1997]. The setup in Festival uses a POS sequence model based on 23 tags, with two tags before the juncture and one after. Smoothing for the POS sequence model is in the form of Good-Turing smoothing followed by back-off smoothing of threshold 3. An unsmoothed 6-gram phrase break model is used.
We have not yet attempted to use our system on any languages other than English. We expect that the algorithm will work with any language that has a phrase structure which can be related to superficial syntactic information such as POS tags.