IIIT-H Indic Speech Databases, IIIT Hyderabad, India.

The IIIT-H Indic speech databases were developed at Speech and Vision Lab, IIIT-H for the purpose of building speech synthesis systems in Indian languages.


Currently the IIIT-H Indic speech databases consist of text and speech data in Bengali, Hindi, Kannada, Malayalam, Marathi, Tamil and Telugu. These languages were chosen, as the total number of Wikipedia articles in each of these languages was more than 10,000 and native speakers of these languages were available in the campus. Each of these languages have several dialects. As an initial approximation, we chose to record the speech in the dialect in which the native speaker was comfortable with.

Text Data

We used Wikipedia articles in Indian languages as our text corpus. A set of 1000 sentences was selected for each language. These sentences were selected to cover 5000 most frequent words in text corpus of the corresponding language. The text data is made available in IT3 (a transliteration scheme) as well as in Unicode (UTF-8 format).

Speech Recording

The speech data was recorded by a native speaker of the language. The recording was done in a studio environment using a standard headset microphone connected to a Zoom handy recorder. We used a handy recorder as it was highly mobile and easy to operate. By using a headset the distance from the microphone to a mouth and recording level was kept constant.


