This chapter discussed some of the options for building waveform synthesizers using unit selection techniques in Festival. This is still very much an on-going research question and we are still adding new techniques as well as improving existing ones often so the techniques described here are not as mature as the techniques as described in previous diphone chapter.
By "unit selection" we actually mean the selection of some unit of speech which may be anything from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple case of this. However typically what we mean is unlike diphone selection, in unit selection there is more than one example of the unit and some mechanism is used to select between them at run-time.
ATR's CHATR hunt96 system and earlier work at that lab nuutalk92 is an excellent example of one particular method for selecting between mutiple examples of a phone within a database. For a discussion of why a more generalized inventory of units is desired see campbell96 though we will reiterate some of the points here. With diphones a fixed view of the possible space of speech units has been made which we all know is not ideal. There are articulatory effects which go over more than one phone, e.g. /s/ can take on artifacts of the roundness of the following vowel even over an itermediate stop, e.g. `spout' vs `spit'. But its not just obvious segmental effects that cause variation in pronunciation, syllable position, word/phrase initial and final position have typically a different level of articulation from segments taken from word internal position. Stressing and accents also cause differences. Rather than try to explicitly list the desired inventory of all these phenomena and then have to record all of them a potential alternative is to take a natural distribution of speech and (semi-)automatifcally find the distinctions that actually exist rather predefining them.
The theory is obvious but the design of such systems and finding the approrpiate selection criteria, weighting the costs of relative candidates is a non-trivial problem. However tecniques like this often produce very high quality, very natural sounding synthesis. However they also can produce some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.
Two forms of unit selection will discussed here, not because we feel they are the best but simply because they are the ones actually implemented by us and hence can be distributed. These should still be considered research systems. Unless you are specifically interested or have the expertise in developing new selection techniques it is not recommended that you try these, if you need a working voice within a month and can't afford to miss that deadline then the diphone option is safe, well tried and stable.
This is a reimplementation of the techniques as described in black97c. The idea is to take a database of general speech and try to cluster each phone type into groups of acoustically similar units based on the (non-acoustic) information available at synthesis time, such as phonetic context, prosodic features (F0 and duration) and higher level features such as stressing, word position, and accents. The actually features used may easily be changed an experimented with as can the definition of the definition of acoustic distance between the units in a cluster.
In some sense this work builds on the results of both the CHATR selection algorithm hunt96 and the work of donovan95, but differs in some important and significant ways. Specifically in contrast to hunt96 this cluster algorithm pre-builds CART trees to select the approriate cluster of candidate phones thus avoiding the computationally expensive function of calculating target costs (through linear regression) at selection time. Secondly because the clusters are built directly from the acoustic scores and target features, a target estimation function isn't required removing the need to calculate weights for each feature. This cluster method differs from the clustering method in donovan95 in that it can use more generalized features in clustering and uses a different acoustic cost function (Donovan uses HMMs), also his work is based on sub-phonetic units (HMM states). Also Donovan selects one candidate while here we select a group of candidates and finds the best overall selection by finding the best path through each set of candidates for each target phone, in a manner similar to hunt96 and iwahashi93 before.
The basic processes involved in building a waveform synthesizer for the clustering algorithm are as follows.
Unlike diphone database which are carefully constructed to ensure specific coverage one of the advantages of unit selection is that a much more general database is desired. However, although voices may be built from existing data not specifically gathered for synthesis there are still factors about the data that will help make better synthesis.
Like diphone databases the more cleanly and carefully the speech is recorded the better the synthesized voice will be. As we are going to be selecting units from different parts of the database the more similar the recordings are, the less likely bad joins will occur. However unlike diphones database prosodic variation is probably a good thing, as it is those variations that can make synthesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone coverage if not complete diphone coverage. Also synthesis using these techniques seem to retain aspects of the original database. If the database is broadcast news stories, the synthesis from it will typically sound like read news stories (or more importantly will sound best when it is reading news stories).
Although it is too early to make definitive statements about what size
and type of data is best for unit selection we do have some rough
guides. A Timit like database of 460 phonetically balanced sentences
(around 14,000 phones) is not an unreasonable first choice. If the
text has not been specifically selected for phonetic coverage a larger
database is probably required, for example the Boston Univeristy Radio
News Corpus speaker f2b
ostendorf95 has been used
relatively successfully. Of course all this depends on what use you
wish to make of the synthesizer, if its to be used in more restrictive
environments (as is often the case) tailoring the database for the task
is a very good idea. If you are going to be reading a lot of telephone
numbers, having a significant number of examples of read numbers will
make synthesis of numbers sound much better.
The database used as an example here is a TIMIT 460 sentence database read by an American male speaker.
Again the notes about recording the database apply, though it will sometimes be the case that the database is already recorded and beyond your control, in that case you will always have something legitimate to blame for poor quality synthesis.
Throughout our dicussion we will assume the following database layout. It is highly recommended that you follow this format otherwise scripts, and examples will fail. There are many ways to organize databases and many of such choices are arbitrary, here is our "arbitrary" layout.
The basic database directory should contain the following directories
bin/
wav/
lab/
wrd/
lar/
pm/
festival/
Other directories will be created for various processing reasons.
In order to make access well defined you need to construct Festival utterance structures for each of the utterances in your database. This (in is basic form) requires labels for: segments, syllables, words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labelled but in most cases that's impractical. There are ways to automatically obtain most of these labels but you should be aware of the inherit errors in the labelling system you use (including labelling systems that involve human labellers). Note that when a unit selection method is to be used that fundamentally uses segment boundaries its quality is going to be ultimately determined by the quality of the segmental labels in the databases.
For the unit selection algorithm described below the segemntal labels should be using the same phoneset as used in the actual synthesis voice. However a more detailed phonetic labelling may be more useful (e.g. marking closures in stops) mapping that information back to the phone labels before actual use. Autoaligned databases typically aren't acurate enough for use in unit selection. Most autoaligners are built using speech recognition technology where actual phone boundaries are not the primary measure of success. General speech recognition systems primarily measure words correct (or more usefully semantically correct) and do not require phone boundaries to be acurate. If the database is to be used for unit selection it is very important that the phone boundaries are accurate. Having said this though, we have successfully used the aligner described in the diphone chpater above to label general utterance where we knew which phone string we were looking for, using such an aligner may be a useful first pass, but the result should always be checked by hand.
It has been suggested that aligning techniques and unit selection training techniques can be used to judge the accuracy of the labels and basically exclude any segments that appear to fall outside the typical range for the segment type. Thus it, is believed that unit selection algorithms should be able to deal with a certain amount of noise in the labelling. This is the desire for researchers in the field, but we are some way from that and the easiest way at present to improve the quality of unit selection algorithms at present is to ensure that segmental labelling is as accurate as possible. Once we have a better handle on selection techniques themselves it will then be possible to start experimenting with noisy labelling.
However it should be added that this unit selection technique (and many others) support what is termed "optimal coupling" (conkie96) where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. This technique is inherently robust to at least a few tens of millisecond boundary labelling errors.
For the cluster method defined here it is best to construct more than simply segments, durations and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing allow a much richer set of features to be used for clusters. See section 7.4 Utterance building for a more general discussion of how to build utterance structures for a database.
In order to cluster similar units in a database we build an acoustic representation of them. This is is also still a research issue but in the example here we will use Mel cepsrtum plus delta Mel cepstrum plus F0. Though this is open for change (and can easily be done so).
Here is an example script which will generate these parameters for a database, it is included in `festvox/src/unitsel/make_mcep' The main loop here generates the cepstrum parameters and the F0 and then combinsthem into a single file with F0 as parameter 0. This format is assumed for the later acoustic measures though the number of cepstrum/delta cepstrum parameters may be changed if desired.
ESTDIR=/usr/awb/projects/speech_tools/main PDA_PARAMS="-fmax 180 -fmin 80" SIG2FV=$ESTDIR/sig2fv SIG2FVPARAMS='-coefs melcep -delta melcep -melcep_order 12 \ -fbank_order 24 -shift 0.01 -factor 2.5 -preemph 0.97' for i in $* do fname=`basename $i .wav` echo $fname $SIG2FV $SIG2FVPARAMS -otype ascii $i -o /tmp/tmp.$$.ascii if [ ! -f festival/f0/$fname.f0 ] then $ESTDIR/pda -s 0.01 -o festival/f0/$fname.f0 -otype ascii \ $PDA_PARAMS wav/$fname.wav fi $ESTDIR/ch_track -pc first -itype ascii -s 0.010 -otype htk \ festival/f0/$fname.f0 /tmp/tmp.$$.ascii \ -o festival/coeffs/$fname.dcoeffs rm /tmp/tmp.$$.* done
The above builds coefficients at fixed frames. We have also experiemented with building parameters pitch synchornously and have found a slight improvement in the usefulness of teh emasure based on this. We do not pretend that this part is particularly neat in the system but it does work. When pitch synchornous parameters are build the cluints module will automatically put the local F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The script in `festvox/src/general/make_lpc' can be used to generate the parameters, assuming you have already generated pitch marks.
Note the secondary advantage of using LPC coefficients is that they are requied any way for LPC resynthesis thus this allows less information about the database to be required at run time. We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be tried. Also a more general duration/number of pitch periods match algorithm is worth defining.
Cluster building is mostly automatic. Of course you need the
clunits
modules compiled into your version of Festival. Version
1.3.1 or later is required, the version of clunits
in 1.3.0 is
buggy and incomplete and will not work. To compile in clunits
,
add
ALSO_INCLUDE += clunits
to the end of your `festival/config/config' file, nad recompile.
To chaeck if an installation already has support for clunits
check the value of the variable *modules*
.
The file `festival/src/modules/clunits/acost.scm' contains the
basic code to build a cluster model for a databases that has utterance
structures and acoustic parameters. The function do_all
will build the distance tables, dump the features and build
the cluster trees. The many parameters are set for the particular database
(and instance of cluster building) through the Lisp variable
clunits_params
. An example is given in
`festival/src/modules/clunits/ked_params.scm' for the KED timit
database.
The function do_all
runs through all the steps but as some
the steps are relatively time consuming there may be times when
each of the steps needs to be run individually. We will go through
each step and at that time explain which parameters affect the
substep.
Ther first stage is to load in all the utterances in the
database, sort them into segment type and name them with individual
names (as <type>_<num>
). This first stage is
required for all other stages so that if you are not running do_all
you still need to run this stage first. This is done by the
calls
(format t "Loading utterances and sorting types\n") (set! utterances (acost:db_utts_load dt_params)) (set! unittypes (acost:find_same_types utterances)) (acost:name_units unittypes)
Though the function do_init
will do the same thing.
This uses the following parameters
name
db_dir
utts_dir
utts_ext
files
For example for the KED example these parameters are
(name 'ked_timit) (db_dir "/usr/awb/data/timit/ked/") (utts_dir "festival/utts/") (utts_ext ".utt") (files ("kdt_001" "kdt_002" "kdt_003" ... ))
The next stage is to load the accoustic parameters and build the distance tables. The acoustic distance between each segment of the same type is calculated and saved in the distance table. Precalculating this saves a lot of time as the cluster will require this number many times.
This is done by the following two function calls
(format t "Loading coefficients\n") (acost:utts_load_coeffs utterances) (format t "Building distance tables\n") (acost:build_disttabs unittypes clunits_params)
The following parameters influence the behaviour.
coeffs_dir
coeffs_ext
get_std_per_unit
t
or nil
. If t
the parameters
for the type of segment are normalized by finding the menas and
standard deviations for the class are used. Thus a mean mahalanobis
euclidean distance is found between units rather than simply
a euclidean distance.
ac_left_context <float>
ac_duration_penality <float>
ac_weights (<float> <float> ...)
An example from KED is
(coeffs_dir "festival/coeffs/") (coeffs_ext ".dcoeffs") (dur_pen_weight 0.1) (get_stds_per_unit t) (ac_left_context 0.8) (ac_weights (1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0))
The next stage is to dump the feature staht will be used to index the
clusters. Remember the clusters are defined with respect to the acoustic
distance between each unit in the cluster, but they are indexed by
these features. These features are those which will be available at
text-to-speech time when no acoustic information is available. Thus
they include things like phonetic and prosodic context rather than
spectral information. The name features may (and probabaly should)
be over general alloing the decision tree building program wagon
to decide which of theses feature actual does have an acoustic
distinction in the units.
The function to dump the features is
(format t "Dumping features for clustering\n") (acost:dump_features unittypes utterances clunits_params)
The parameters which affect this function are
fests_dir
feats
For our KED example these values are
(feats_dir "festival/feats/") (feats (occurid p.name p.ph_vc p.ph_ctype p.ph_vheight p.ph_vlng p.ph_vfront p.ph_vrnd p.ph_cplace p.ph_cvox n.name n.ph_vc n.ph_ctype n.ph_vheight n.ph_vlng n.ph_vfront n.ph_vrnd n.ph_cplace n.ph_cvox segment_duration seg_pitch p.seg_pitch n.seg_pitch R:SylStructure.parent.stress seg_onsetcoda n.seg_onsetcoda p.seg_onsetcoda R:SylStructure.parent.accented pos_in_syl syl_initial syl_final R:SylStructure.parent.syl_break R:SylStructure.parent.R:Syllable.p.syl_break pp.name pp.ph_vc pp.ph_ctype pp.ph_vheight pp.ph_vlng pp.ph_vfront pp.ph_vrnd pp.ph_cplace pp.ph_cvox))
Now that we have the acoustic distances and the feature descriptions of
each unit the next stage is to find a relationship between those
features and the acoustic distances. This we do using the CART tree
builder wagon
. It will find out questions about which features
best minimize the acoustic distance between the units in that class.
wagon
has many options many of which are apporiate to this task
though it is interesting that this learning task is interestingly
closed. That is we are trying to classify all the units in
the database, there is no test set as such. However in synthesis
there will be desired units whose feature vector didn't exist
in the training set.
The clusters are built by the following function
(format t "Building cluster trees\n") (acost:find_clusters (mapcar car unittypes) clunits_params)
The parameters that affect the tree building process are
tree_dir
wagon_field_desc <file>
feats
parameter
above.
wagon_progname <file>
-stepwise
.
wagon_cluster_size <int>
-stop
value).
prune_reduce <int>
Note that as the distance tables can be large there is an alternative function that does both the ditance table and clustering in one, deleting the distance table immediately after use, thus you only need enough disk space for the largest number of phones in any type. To do this
(acost:disttabs_and_clusters unittypes clunits_params)
Removing the calls to acost:build_disttabs
and
acost:find_clusters
.
In our KED example these have the values
(trees_dir "festival/trees/") (wagon_field_desc "festival/clunits/all.desc") (wagon_progname "/usr/awb/projects/speech_tools/bin/wagon") (wagon_cluster_size 10) (prune_reduce 0)
The final stage in building a cluster model is collect the generated trees into a single file and dumping the uniot catalogue, i.e. the list of unit names and their files and position in them. This is doen by the lisp function
(acost:collect_trees (mapcar car unittypes) clunits_params) (format t "Saving unit catalogue\n") (acost:save_catalogue utterances clunits_params)
The only parameter that affect this is
catalogue_dir
name
parameter is used to name the file).
In the KED example this is
(catalogue_dir "festival/clunits/")
There are a number of parameters that are specified with a cluster voice. These are related to the run time aspects of the cluster model. These are
join_weights
ac_weights
that are used in optimal coupling to find the best join point between two
candidate units. This is different from ac_weights
as it
is likely different values are desried, particularl increasing the
F0 value (column 0).
continuity_weight <float>
optimal_coupling <int>
1
this uses optimal coupling and searches the cepstrum
vectors at each join point to find the best possible join point.
This is computationally expensive (as well as having to load in lots
of cepstrum files), but does give better results.
extend_selections <int>
1
then the selected cluster will be extended
to include any unit from the cluster of the previous segments
candidate units that has correct phone type. This is experimental
but has shown its worth and hence is recommend. This means
that instead of selecting just units selection is effectively
selecting the beginings of multiple segment units. This option
encourages far longer units.
pm_coeffs_dir <file>
db_dir
) where the pitchmarks are
pm_coeffs_ext <file>
sig_dir <file>
sig_ext <file>
join_method <method>
simple
, a very naive joining mechanism, and
windowed
, where the ends of the units are windowed using a
hamming window then overlapped (no prosodic modification takes place
though). The other two possible values for this feature are none
which does nothing, and modified_lpc
which uses the standard
UniSyn module to modify the selected units to match the targets.
This cluster method is just a waveform synthesizer it still requires a text analysis and prosodic component. The only restriction is that it must generate the same sort of utterance structures as in your database. This is because it features from utterances of that type which were used to train the selection trees. That is you can't use a front end that uses different relation names and features.
Here we simply use the same front end as ked_diphone
as it
is basically the same speaker.
A simple example of building a cluster unit selection synthesizer is given in section 10 Limited domain synthesis. In that example the features used in selection have been reduced and a few other simplying assumptions have been made but the underlying structure is the same. That is a good example to start from, then change the parameters as fully described above to improve the selection criertia.
As touched on above the choice of an inventory of units can be viewed as a line from a small inventory phones, to diphones, tripohones to arbitrary units. Though the direction you come from influences the selection of the units from the database. CHATR campbell96 lies firmly at the "arbitrary units" end of the spectrum. Although it can exclude bad units from its inventory it is very much `everything minus some' view of the world. Microsoft's Whistler huang97 on the other hand, starts off with a general database base but selects typical units from it. Thus its inventory is substantially smaller than the full general database the units are extracted from. At the other end of the spectrum we have the fixed pre-sepcified inventory like diphone synthesis as has bee described in the previous chapter.
In this section we'll give some examples of moving along the line from the fixed pre-specified inventory to the words the more general inventories but these techniques still have a strong component of prespecification.
Firstly lets us assume you have a general database that is labelled with utterances as described above. We can extract a standard diphone database from this general database, however unless the database was specifically desgined, a general database is unlikely to have diphone coverage. Even when phonetically rich databases are used such as Timit there is likely to be very few vowel-vowel diphones as they are comparatively rare. But as these diphone are rare we may be able to do with out them and hence it is at least an interesting exercise to extract an as complete as possible diphone index from a general database.
The simplest method is to linearly search for all phone-phone pairs in the phone set through all utterances simply taking the first example. Some same code is given in `src/diphone/make_diphs_index.scm'. This basic idea is to load in all the utterances in a database, and index each segment by is phone name and succeeding phone name. Then various selection techniques can be use to select from the multiple candidates of each diphone (or you can split the indexing futher). After selection a diphone index file can be saved.
The utterances to load are identified by a list of fileids. For example if the list of fileids (without parenthesis) is in the file `etc/fileids', the following will builds a diphone index.
festival .../make_diphs_utts.scm ... festival> (set! fileids (load "etc/fileids" t)) ... festival> (make_diphone_index fileids "dic/f2bdiph.est")
Note that as this diphone index will contain a number of holes
you will need to either augment it with `similar' diphones
or process your diphone selections through UniSyn_module_hooks
as described in the previous chapter.
As you complicate the selection, and the number of diphones you used from the database you will need to complicate the names used to identify the diphones themselves. The convention of using underscores for syllable internal consonant clusters and dollars for syllable initial consonants can be followed, but you will need to go further if you wish to start introducing new feature such as phrase finality and stress. Eventually going to a generized naming scheme (type and number) as used by the cluster selection technique described above, will prove worth while. Also using CART trees, through hand written and fully deterministic (one candidate at the leaves), will be a reasonable algorithm to select between hand stipulated alternatives with reasonable backoff strategies.
Another potential direction is to use the acoustic costs used in the clustering methods described in the previous section. These can be used to identify what the most typical unit in a cluster are (the mean distances from all other units are given in the leafs). Pruning these trees until the cluster only contain a single example should help to improve synthesis, in that variation in the feature in the "diphone" index will then be determined by the features specified in the cluster train algorithm. Of course though as you limit the number of distinct units types the more prosodic modification will be required by your signal processing algorithm, which requires that you have good pitch marks.
If you already have an existing database but don't wish to go to full unit selection, such techniques are probably quite feasible and worth further investigation.
Go to the first, previous, next, last section, table of contents.