Building prosodic models

Phrasing

Prosodic phrasing in speech synthesis makes the whole speech more understandable. Due to the size of peoples lungs there is a finite length of time people can talk before they can take a breath, which defines an upper bound on prosodic phrases. However we rarely make our phrases this maximum length and use phrasing to mark groups within the speech. There is the apocryphal story of the speech synthesis example with an unnaturally long prosodic phrase played at a conference presentation. At the end of the phrase the audience all took a large in-take of breathe.

For the most case very simple prosodic phrasing is sufficient. A comparison of various prosodic phrasing techniques is discussed in [taylor98a], though we will cover some of them here also.

For English (and most likely many other language too) simple rules based on punctuation is a very good predictor of prosodic phrase boundaries. It is rare that punctuation exists where there is no boundary, but there will be a substantial number of prosodic boundaries which are not explicitly marked with punctuation. Thus a prosodic phrasing algorithm solely based on punctuation will typically under predict but rarely make a false insertion. However depending on the actual application you wish to use the synthesizer for it may be the case that explicitly adding punctuation at desired phrase breaks is possible and a prediction system based solely on punctuation is adequate.

Festival basically supports two methods for predicting prosodic phrases, though any other method can easily be used. Note that these do not necessary entail pauses in the synthesized output. Pauses are further predicted from prosodic phrase information.

The first basic method is by CART tree. A test is made on each word to predict it is at the end of a prosodic phrase. The basic CART tree returns B or BB (though may return what you consider is appropriate form break labels as long as the rest of your models support it). The two levels identify different levels of break, BB being a used to denote a bigger break (and end of utterance).

The following tree is very simple and simply adds a break after the last word of a token that has following punctuation. Note the first condition is done by a lisp function as we wand to ensure that only the last word in a token gets the break. (Earlier erroneous versions of this would insert breaks after each word in "1984."

(set! simple_phrase_cart_tree
'
((lisp_token_end_punc in ("?" "." ":"))
  ((BB))
  ((lisp_token_end_punc in ("'" "\"" "," ";"))
   ((B))
   ((n.name is 0)  ;; end of utterance
    ((BB))
    ((NB))))))

This tree is defined festival/lib/phrase.scm in the standard distribution and is certainly a good first step in defining a phrasing model for a new language.

To make a better phrasing model requires more information. As the basic punctuation model underpredicts we need information that will find reasonable boundaries within strings of words. In English, boundaries are more likely between content words and function words, because most function words are before the words they related to, in Japanese function words are typically after their relate content words so breaks are more likely between function words and content words. If you have no data to train from, written rules, in a CART tree, can exploited this fact and give a phrasing model that is better than a punctuation only. Basically a rule could be if the current word is a content word and the next is a function word (or the reverse if that appropriate for a language) and we are more than 5 words from a punctuation symbol them predict a break. We maybe also want to insure that we are also at least five words from predicted break too.

Note the above basic rules aren't optimal but when you are building a new voice in a new language and have no data to train from you will get reasonably far with simple rules like that, such that phrasing prediction will be less of a problem than the other problems you will find in you voice.

To implement such a scheme we need three basic functions: one to determine if the current word is a function of content word, one to determine number of words since previous punctuation (or start of utterance) and one to determine number of words to next punctuation (or end of utterance. The first of these functions is already provided for with a feature, through the feature function gpos. This uses the word list in the lisp variable guess_pos to determine the basic category of a word. Because in most languages the set of function words is very nearly a closed class they can usually be explicitly listed. The format of the guess_pos variable is a list of lists whose first element is the set name and the rest of the list if the words that are part of that set. Any word not a member of any of these sets is defined to be in the set content. For example the basic definition for this for English, given in festival/lib/pos.scm is

(set! english_guess_pos
      '((in of for in on that with by at from as if that against about 
    before because if under after over into while without
    through new between among until per up down)
(to to)
(det the a an no some this that each another those every all any 
     these both neither no many)
(md will may would can could should must ought might)
(cc and but or plus yet nor)
(wp who what where how when)
(pps her his their its our their its mine)
(aux is am are was were has have had be)
(punc "." "," ":" ";" "\"" "'" "(" "?" ")" "!")
))

The punctuation distance check can be written as a Lisp feature function

(define (since_punctuation word)
 "(since_punctuation word)
Number of words since last punctuation or beginning of utterance."
 (cond
   ((null word) 0) ;; beginning or utterance
   ((string-equal "0" (item.feat word "p.lisp_token_end_punc")) 0)
   (t
    (+ 1 (since_punctuation (item.prev word))))))

The function looking forward would be

(define (until_punctuation word)
 "(until_punctuation word)
Number of words until next punctuation or end of utterance."
 (cond
   ((null word) 0) ;; beginning or utterance
   ((string-equal "0" (token_end_punc word)) 0)
   (t
    (+ 1 (since_punctuation (item.prev word))))))

The whole tree using these features that will insert a break at punctuation or between content and function words more than 5 words from a punctuation symbol is as follows

(set! simple_phrase_cart_tree_2
'
((lisp_token_end_punc in ("?" "." ":"))
  ((BB))
  ((lisp_token_end_punc in ("'" "\"" "," ";"))
   ((B))
   ((n.name is 0)  ;; end of utterance
    ((BB))
    ((lisp_since_punctuation > 5)
     ((lisp_until_punctuation > 5)
      ((gpos is content)
       ((n.gpos content)
        ((NB))
        ((B)))   ;; not content so a function word
       ((NB)))   ;; this is a function word
      ((NB)))    ;; to close to punctuation
     ((NB)))     ;; to soon after punctuation
    ((NB))))))

To use this add the above to a file in your festvox/ directory and ensure it is loaded by your standard voice file. In your voice definition function. Add the following

   (set! guess_pos english_guess_pos) ;; or appropriate for your language
 
   (Parameter.set 'Phrase_Method 'cart_tree)
   (set! phrase_cart_tree simple_phrase_cart_tree_2)

A much better method for predicting phrase breaks is using a full statistical model trained from data. The problem is that you need a lot of data to train phrase break models. Elsewhere in this document we suggest the use of a timit style database or around 460 sentences, (around 14500 segments) for training models. However a database such as this as very few internal utterance phrase breaks. An almost perfect model word predict breaks at the end of each utterances and never internally. Even the f2b database from the Boston University Radio New Corpus [ostendorf95] which does have a number of utterance internal breaks isn't really big enough. For English we used the MARSEC database [roach93] which is much larger (around 37,000 words). Finding such a database for your language will not be easy and you may need to fall back on a purely hand written rule system.

Often syntax is suggested as a strong correlate of prosodic phrase. Although there is evidence that it influences prosodic phrasing, there are notable exceptions [bachenko90]. Also considering how difficult it is to get a reliable parse tree it is probably not worth the effort, training a reliable parser is non-trivial, (though we provide a method for training stochastic context free grammars in the speech tools, see manual for details). Of course if your text to be synthesized is coming from a language system such as machine translation or language generation then a syntax tree may be readily available. In that case a simple rule mechanism taking into account syntactic phrasing may be useful

When only moderate amounts of data are available for training a simple CART tree may be able to tease out a reasonable model. See [hirschberg94] for some discussion on this. Here is a short example of building a CART tree for phrase prediction. Let us assume you have a database of utterances as described previously. By convention we build models in directories under festival/ in the main database directory. Thus let us create festival/phrbrk.

First we need to list the features that are likely to be suitable predictors for phrase breaks. Add these to a file phrbrk.feats, what goes in here will depend on what you have, full part of speech helps a lot but you may not have that for your language. The gpos described above is a good cheap alternative. Possible features may be

word_break
lisp_token_end_punc
lisp_until_punctuation
lisp_since_punctuation
p.gpos
gpos
n.gpos

Given this list you can extract features form your database of utterances with the Festival script dumpfeats

dumpfeats -eval ../../festvox/phrbrk.scm -feats phrbrk.feats \
   -relation Word -output phrbrk.data ../utts/*.utts

festvox/phrbrk.scm should contain the definitions of the function until_punctuation, since_punctuation and any other Lisp feature functions you define.

Next we want to split this data into test and train data. We provide a simple shell script called traintest which splits a given file 9:1, i.e every 10th line is put in the test set.

traintest phrbrk.data

As we intend to run wagon the CART tree builder on this data we also need create the feature description file for the data. The feature description file consists of a bracketed list of feature name and type. Type may be int float or categorical where a list of possible values is given. The script make_wagon_desc (distributed with the speech tools) will make a reasonable approximation for this file

make_wagon_desc phrbrk.data phrbrk.feats phrbrk.desc

This script will treat all features as categorical. Thus any float or int features will be treated categorically and each value found in the data will be listed as a separate item. In our example lisp_since_punctuation and lisp_until_punctuation are actually float (well maybe even int) but they will be listed as categorically in phrbrk.desc, something like

...
(lisp_since_punctuation
0
1
2
4
3
5
6
7
8)
...

You should change this entry (by hand) to be

...
(lisp_since_punctuation float )
...

The script cannot work out the type of a feature automatically so you must make this decision yourself.

Now that we have the data and description we can build a CART tree. The basic command for wagon will be

wagon -desc phrbrk.desc -data phrbrk.data.train -test phrbrk.data.test \
   -output phrbrk.tree

You will probably also want to set a stop value. The default stop value is 50, which means there must be at least 50 examples in a group before it will consider looking for a question to split it. Unless you have a lot of data this is probably too large and a value of 10 to 20 is probably more reasonable.

Other arguments to wagon should also be considered. A stepwise approach where all features are tested incrementally to find the best set of features which give the best tree can give better results than simply using all features. Though care should be taken with this as the generated tree becomes optimized from the given test set. Thus a further held our test set is required to properly test the accuracy of the result. In the stepwise case it is normal to split the train set again and call wagon as follows

traintest phrbrk.data.train
wagon -desc phrbrk.desc -data phrbrk.data.train.train \
   -test phrbrk.data.train.test \
   -output phrbrk.tree -stepwise
wagon_test -data phrbrk.data.test -desc phrbrk.desc \
   -tree phrbrk.tree

Stepwise is particularly useful when features are highly correlated with themselves and its not clear which is best general predictor. Note that stepwise will take much longer to run as it potentially must build a large number of trees.

Other arguments to wagon can be considered, refer to the relevant chapter in speech tools manual for their details.

However it should be noted that without a good intonation and duration model spending time on producing good phrasing is probably not worth it. The quality of all these three prosodic components is closely related such that if one is much better than there may not be any real benefit.