Building Synthetic Voices
<<< Previous	Diphone databases	Next >>>

Recording the diphones

The object of recording diphones is to get as uniform a set of pronunciations as possible. Your speaker should be relaxed, not be suffering for a cold, or cough, or a hangover. If something goes wrong with the recording and some of the examples need to be re-recorded it is important that the speaker has as similar a voice as with the original recording, waiting for another cold to come along is not reasonable, (though some may argue that the same hangover can easily be induced). Also to try to keep the voice potentially repeatable it is wise to record at the same time of day, morning is a good idea. The points on speaker selection and recording in the previous section should also be borne in mind.

The recording environment should be reconstructable, so that the conditions can be set up again if needed. Everything should be as well-defined as possible, as far as gain settings, microphone distances, and so on. Anechoic chambers are best, but general recording studios will do. We've even done recording in an open room, with care this works (make sure there's little background noise from computers, air conditioning, outside traffic etc). Of course open rooms aren't ideal but they are better than open noisey rooms.

The distance between the speaker and the microphone is crucial. A head mounted mike helps keep this constant; the Shure SM-2 headset, for instance, works well with the mic positioned at 8mm from the lips or so. This can be checked with a ruler. Considering the cost and availability of headmounted microphones and rulers, you should really consider using them. While even fixed microphones like the Shure SM-57 can be used well by professional voice talent, we strongly recommend a good headset mic.

Ultimately, you need to split the recordings into individual files, one for each prompt. Ideally this can be done while recording on a file-by-file basis, but as that may not be practical and some other technique can be used, such as recording onto DAT and transferring the data to disk (and downsampling) later. Files might contain 50-100 nonsense words each. In this case we hand label the words, taking into account any duplicates caused be errors in the recording. The program ch_wave in the Edinburgh Speech Tools (EST) offers a function to split a large file into individual files based on a label file. We can use this to get our individual files. You may also add an identifiable noise during recording and automatically detect that as a split point, as is often done at the Oregon Graduate Instititute.. They typically use two different noises that can easily be distinguished and use one for "OK" and "BAD" this can make the splitting of the files into the individual nonsense words easier. Note you will also need to split the electroglottograph (EGG) signal exactly the same way, if you are using one.

No matter how you split these, you should be aware that there will still often be mistakes, and checking by listening will help.

We now almost always record directly to disk on a computer using a sound card; see the Section called Recording under Unix in the Chapter called Basic Requirements for recording setup details. There can be a reduction in the quality of the recording due to poor quality audio hardware in computers (and often too much noise), though at least the hardware issue is getting to be less of a problem these days. There are lots of advantages to recording directly to disk, as the stage of digitising, transfering and spliting the offline records is laborious and prone to error.

<<< Previous	Home	Next >>>
Defining a diphone list	Up	Labeling the diphones