Building Synthetic Voices | ||
---|---|---|
<<< Previous | Next >>> |
Prosodic phrasing in speech synthesis makes the whole speech more understandable. Due to the size of peoples lungs there is a finite length of time people can talk before they can take a breath, which defines an upper bound on prosodic phrases. However we rarely make our phrases this maximum length and use phrasing to mark groups within the speech. There is the apocryphal story of the speech synthesis example with an unnaturally long prosodic phrase played at a conference presentation. At the end of the phrase the audience all took a large in-take of breathe.
For the most case very simple prosodic phrasing is sufficient. A comparison of various prosodic phrasing techniques is discussed in [taylor98a], though we will cover some of them here also.
For English (and most likely many other language too) simple rules based on punctuation is a very good predictor of prosodic phrase boundaries. It is rare that punctuation exists where there is no boundary, but there will be a substantial number of prosodic boundaries which are not explicitly marked with punctuation. Thus a prosodic phrasing algorithm solely based on punctuation will typically under predict but rarely make a false insertion. However depending on the actual application you wish to use the synthesizer for it may be the case that explicitly adding punctuation at desired phrase breaks is possible and a prediction system based solely on punctuation is adequate.
Festival basically supports two methods for predicting prosodic phrases, though any other method can easily be used. Note that these do not necessary entail pauses in the synthesized output. Pauses are further predicted from prosodic phrase information.
The first basic method is by CART tree. A test is made on each
word to predict it is at the end of a prosodic phrase. The basic CART
tree returns B
or BB
(though may return what you consider
is appropriate form break labels as long as the rest of
your models support it). The two levels identify different
levels of break, BB
being a used to denote a bigger
break (and end of utterance).
The following tree is very simple and simply adds a break after the last word of a token that has following punctuation. Note the first condition is done by a lisp function as we wand to ensure that only the last word in a token gets the break. (Earlier erroneous versions of this would insert breaks after each word in "1984."
This tree is defined festival/lib/phrase.scm in the standard distribution and is certainly a good first step in defining a phrasing model for a new language.(set! simple_phrase_cart_tree
'
((lisp_token_end_punc in ("?" "." ":"))
((BB))
((lisp_token_end_punc in ("'" "\"" "," ";"))
((B))
((n.name is 0) ;; end of utterance
((BB))
((NB))))))
To make a better phrasing model requires more information. As the basic punctuation model underpredicts we need information that will find reasonable boundaries within strings of words. In English, boundaries are more likely between content words and function words, because most function words are before the words they related to, in Japanese function words are typically after their relate content words so breaks are more likely between function words and content words. If you have no data to train from, written rules, in a CART tree, can exploited this fact and give a phrasing model that is better than a punctuation only. Basically a rule could be if the current word is a content word and the next is a function word (or the reverse if that appropriate for a language) and we are more than 5 words from a punctuation symbol them predict a break. We maybe also want to insure that we are also at least five words from predicted break too.
Note the above basic rules aren't optimal but when you are building a new voice in a new language and have no data to train from you will get reasonably far with simple rules like that, such that phrasing prediction will be less of a problem than the other problems you will find in you voice.
To implement such a scheme we need three basic functions: one to
determine if the current word is a function of content word, one to
determine number of words since previous punctuation (or start of
utterance) and one to determine number of words to next punctuation (or
end of utterance. The first of these functions is already provided for
with a feature, through the feature function gpos
. This uses
the word list in the lisp variable guess_pos
to determine the
basic category of a word. Because in most languages the set of function
words is very nearly a closed class they can usually be explicitly
listed. The format of the guess_pos
variable is a list of
lists whose first element is the set name and the rest of the list if
the words that are part of that set. Any word not a member of
any of these sets is defined to be in the set content
. For
example the basic definition for this for English,
given in festival/lib/pos.scm is
The punctuation distance check can be written as a Lisp feature function(set! english_guess_pos
'((in of for in on that with by at from as if that against about
before because if under after over into while without
through new between among until per up down)
(to to)
(det the a an no some this that each another those every all any
these both neither no many)
(md will may would can could should must ought might)
(cc and but or plus yet nor)
(wp who what where how when)
(pps her his their its our their its mine)
(aux is am are was were has have had be)
(punc "." "," ":" ";" "\"" "'" "(" "?" ")" "!")
))
The function looking forward would be(define (since_punctuation word)
"(since_punctuation word)
Number of words since last punctuation or beginning of utterance."
(cond
((null word) 0) ;; beginning or utterance
((string-equal "0" (item.feat word "p.lisp_token_end_punc")) 0)
(t
(+ 1 (since_punctuation (item.prev word))))))
The whole tree using these features that will insert a break at punctuation or between content and function words more than 5 words from a punctuation symbol is as follows(define (until_punctuation word)
"(until_punctuation word)
Number of words until next punctuation or end of utterance."
(cond
((null word) 0) ;; beginning or utterance
((string-equal "0" (token_end_punc word)) 0)
(t
(+ 1 (since_punctuation (item.prev word))))))
To use this add the above to a file in your festvox/ directory and ensure it is loaded by your standard voice file. In your voice definition function. Add the following(set! simple_phrase_cart_tree_2
'
((lisp_token_end_punc in ("?" "." ":"))
((BB))
((lisp_token_end_punc in ("'" "\"" "," ";"))
((B))
((n.name is 0) ;; end of utterance
((BB))
((lisp_since_punctuation > 5)
((lisp_until_punctuation > 5)
((gpos is content)
((n.gpos content)
((NB))
((B))) ;; not content so a function word
((NB))) ;; this is a function word
((NB))) ;; to close to punctuation
((NB))) ;; to soon after punctuation
((NB))))))
(set! guess_pos english_guess_pos) ;; or appropriate for your language
(Parameter.set 'Phrase_Method 'cart_tree)
(set! phrase_cart_tree simple_phrase_cart_tree_2)
A much better method for predicting phrase breaks is using a full statistical model trained from data. The problem is that you need a lot of data to train phrase break models. Elsewhere in this document we suggest the use of a timit style database or around 460 sentences, (around 14500 segments) for training models. However a database such as this as very few internal utterance phrase breaks. An almost perfect model word predict breaks at the end of each utterances and never internally. Even the f2b database from the Boston University Radio New Corpus [ostendorf95] which does have a number of utterance internal breaks isn't really big enough. For English we used the MARSEC database [roach93] which is much larger (around 37,000 words). Finding such a database for your language will not be easy and you may need to fall back on a purely hand written rule system.
Often syntax is suggested as a strong correlate of prosodic phrase. Although there is evidence that it influences prosodic phrasing, there are notable exceptions [bachenko90]. Also considering how difficult it is to get a reliable parse tree it is probably not worth the effort, training a reliable parser is non-trivial, (though we provide a method for training stochastic context free grammars in the speech tools, see manual for details). Of course if your text to be synthesized is coming from a language system such as machine translation or language generation then a syntax tree may be readily available. In that case a simple rule mechanism taking into account syntactic phrasing may be useful
When only moderate amounts of data are available for training a simple CART tree may be able to tease out a reasonable model. See [hirschberg94] for some discussion on this. Here is a short example of building a CART tree for phrase prediction. Let us assume you have a database of utterances as described previously. By convention we build models in directories under festival/ in the main database directory. Thus let us create festival/phrbrk.
First we need to list the features that are likely to be suitable
predictors for phrase breaks. Add these to a file phrbrk.feats,
what goes in here will depend on what you have, full part of speech
helps a lot but you may not have that for your language. The
gpos
described above is a good cheap alternative. Possible
features may be
Given this list you can extract features form your database of utterances with the Festival script dumpfeatsword_break
lisp_token_end_punc
lisp_until_punctuation
lisp_since_punctuation
p.gpos
gpos
n.gpos
festvox/phrbrk.scm should contain the definitions of the functiondumpfeats -eval ../../festvox/phrbrk.scm -feats phrbrk.feats \
-relation Word -output phrbrk.data ../utts/*.utts
until_punctuation
, since_punctuation
and any other Lisp feature functions you define. Next we want to split this data into test and train data. We provide a simple shell script called traintest which splits a given file 9:1, i.e every 10th line is put in the test set.
As we intend to run wagon the CART tree builder on this data we also need create the feature description file for the data. The feature description file consists of a bracketed list of feature name and type. Type may betraintest phrbrk.data
int
float
or categorical where a list of
possible values is given. The script make_wagon_desc
(distributed with the speech tools) will make a reasonable approximation
for this file
This script will treat all features as categorical. Thus anymake_wagon_desc phrbrk.data phrbrk.feats phrbrk.desc
float
or int
features will be treated categorically and
each value found in the data will be listed as a separate item. In our
example lisp_since_punctuation
and lisp_until_punctuation
are actually float (well maybe even int) but they will be listed as
categorically in phrbrk.desc, something like
You should change this entry (by hand) to be...
(lisp_since_punctuation
0
1
2
4
3
5
6
7
8)
...
The script cannot work out the type of a feature automatically so you must make this decision yourself....
(lisp_since_punctuation float )
...
Now that we have the data and description we can build a CART tree. The basic command for wagon will be
You will probably also want to set a stop value. The default stop value is 50, which means there must be at least 50 examples in a group before it will consider looking for a question to split it. Unless you have a lot of data this is probably too large and a value of 10 to 20 is probably more reasonable.wagon -desc phrbrk.desc -data phrbrk.data.train -test phrbrk.data.test \
-output phrbrk.tree
Other arguments to wagon should also be considered. A stepwise approach where all features are tested incrementally to find the best set of features which give the best tree can give better results than simply using all features. Though care should be taken with this as the generated tree becomes optimized from the given test set. Thus a further held our test set is required to properly test the accuracy of the result. In the stepwise case it is normal to split the train set again and call wagon as follows
Stepwise is particularly useful when features are highly correlated with themselves and its not clear which is best general predictor. Note that stepwise will take much longer to run as it potentially must build a large number of trees.traintest phrbrk.data.train
wagon -desc phrbrk.desc -data phrbrk.data.train.train \
-test phrbrk.data.train.test \
-output phrbrk.tree -stepwise
wagon_test -data phrbrk.data.test -desc phrbrk.desc \
-tree phrbrk.tree
Other arguments to wagon can be considered, refer to the relevant chapter in speech tools manual for their details.
However it should be noted that without a good intonation and duration model spending time on producing good phrasing is probably not worth it. The quality of all these three prosodic components is closely related such that if one is much better than there may not be any real benefit.
<<< Previous | Home | Next >>> |
Building lexicons for new languages | Up | Accent/Boundary Assignment |