Earlier work on Indian languages [5] and preliminary experiments with this Hindi database [6] suggested that a syllable based approach to synthesis could lead to more reliable quality. There have been various suggestions on unit size for unit selection systems. [7] and other HMM-based techniques are typically using sub-phonetic units: two or three per phoneme. AT&T's NextGen [8], uses half phones. FestVox's default method uses a phone based technique. However because FestVox supports a method of optimal coupling [9], the join points may be moved within the preceding unit, thus with phone-sized units, something more like diphones are actually selected.
Larger units are also possible, from demi-syllables to syllables and larger. [10] tie the phones to words for domain synthesis, although this is not the same as having word-sized units it is in that direction. The choice of unit size is an optimization problem, the larger the units the lesser are the discontinuities in synthesis but it is harder to ensure general coverage. Smaller units make it easier to cover the space of acoustic units but at the cost of more joins.
The choice of unit size is also related to the language itself. Languages with a very well defined, and a small number of syllables may benefit from a syllable sized unit. As Hindi has a much more regular syllable structure than English we wanted to experiment to find the optimal sized unit for Hindi synthesis.