unit size and type

The basic cluster unit selection code available in festival uses segments as the size of unit. However the acoustic distance measure used in cluster uses significant portions of the previous segment. Thus the cluster unit selection effectively selects diphones from the database.

The type of units the cluster selection code uses is based on the segment name, by default. In the case of limited domain synthesis we have found that constraining this further gives both better, and faster synthesis. Thus we allow for the unit type to be defined by an arbitrary feature. In the default limited domain set up we use

SEGMENT_WORD

That is, the segment plus the word the segment comes from. Note this doesn't mean we are doing word concatenation in our synthesizer. We are still selecting phone units but that the these phone are differentiated depending on the word they come from thus a /t/ from the word "unit" cannot be used to synthesis a /t/ in "table". The primary reason for us doing this was to cut down the search, though it notable improves synthesis quality to. As we have constructed the database to have good coverage this is a practical thing to do.

The feature function clunit_name constructs the unit type for a particular segment item. We have provided the above default (segment name plus (downcased) word name), but it is easy to extend this.

In one domain we have worked in we wish to differentiate between words in different prosody contexts. Particularly we wished to mark words us "questionable" so we can ask users for confirmation. To do this we marked the "questionable" words in the prompts with a question mark prefix. We then recorded them with appropriate intonation and then defined our clunit_name function in include "C_" is the word was prefixed by a question mark. For example the following two prompts will be read in a different manner

theater is Squirrel Hill Theater
theater is ?Squirrel ?Hill ?Theater

Likewise in unit selection the units in the word "Squirrel" will not be used to synthesize the word "?Squirrel" and vice versa. Although crude, this does give simple control over prosody variation though this technique can require the vocabulary of the units to increase to where this technique ceases to be practical.

It would be good if this technique had a back-off strategy where if no unit can be found for a particular word it would allow other words to contribute candidates. This is ultimately what general unit selection is. We do consider this our goal in unit type but in the interest of building quick and reliable limited domain synthesizers we do not yet do this but consider it an area we will experiment with. One specific area that only partially cross this line is in the synthesis of numbers. It seem very reasonable to allow selection of units from simple numbers (e.g. "seven" and "seventy") but we have not experimented on that yet.

One further important point should be highlights about this method for defining unit types. Although including the word name in the unit name does greatly encourage whole words to be selected it does not mean that joins in the synthesize utterances only occur at word boundaries. It is common that contiguous units are selection from different occurrences of the same word. Mid-word (e.g. within vowels, or at stops) joins at stable places are common. The optimal coupling technique selects the best place within a word for the cross over between two different parts of the database.