Clearly, determining the correct voweling is a major consideration for Arabic TTS systems. Kirchhoff et al. [6] describe an approach to automatic romanization for spontaneous speech recognition that achieves 80% token accuracy in generating the correct diacritization as estimated by comparison with manual diacritization. This is an enormous improvement over the 50% accuracy measured for commercially-available diacritizers, which are targeted toward MSA. For TTS, however, a much higher level of accuracy is required; this state-of-the-art result emphasizes the need in synthesis for manual diacritization. Even manual diacritization of dialects, however, is not unambiguous.
Although context-dependent units are generally thought to provide the most natural synthesized sound, a large number of them is needed to accurately cover the phonological and prosodic space of a language. Context-independent diphone units can provide broad coverage with relatively small storage requirements. Elshafei, Al-Muhtaseb, and Al-Ghamdi [5] argue that the degradation in coarticulatory modeling is not as severe for Arabic as for other languages partially due to its consonant-heavy structure. They describe a synthesis system for classical Arabic that uses diphones and a few other specific sub-syllable units. They generate vowels automatically, but require a morpho-syntactic analyzer because the correct phonetic realization (with vowelization and consonant doubling) can only be inferred with information about word classes and dependencies. The vowel inference task is easier for classical Arabic, which is well-described and for which the effects of dialect and spontaneous speech can be ignored.
El-Imam [4] addresses the problem of vowel generation by requiring fully diacritized input text. El-Imam's system has a fixed unit inventory of 452 subphonetic units; 400 representing the basic steady-state and transition units, and 52 representing allophonic variation. Letter-to-sound rules are manually enumerated. Ben Sassi, Braham, and Balghith [9] also specify letter-to-sound rules manually in a neural network based diphone system. Each phoneme is represented by a feature vector, for which length of both vowels and consonants is an element. Diphone feature vectors are compositions of their constituent phonemes. A fully diacritized set of phonetically balanced sentences was used for training in this system.