Building Synthetic Voices | ||
---|---|---|
<<< Previous | Next >>> |
In this chapter we work through a full example of creating a voice given that most of the basic construction work (model building) has been done. Pariticularly this discusses the scheme files, and conventions for keeping a voices together and how you can go about packaging it for general use.
Ultimately a voice in Festival will consist of a diphone database, a lexicon (and lts rules) and a number of scheme files that offer the complete voice. When people other than the developer of a voice wish to use your newly developed voice it is only that small set of files that are required and need to be distributed (freely or otherwise). By convention we have distributed diphone group files, a single file holding the index, and diphone data itself, and a set scheme files that describe the voice (and its necessary models).
Basic skeleton files are included in the festvox distribution. If you are unsure how to go about building the basic files it is recommended you follow this schema and modify these to your particular needs.
By convention a voice name consist of an institution name (like cmu, cstr, etc), if you don't have an insitution just use net. Second you need to identify the language, there is an ISO two letter standard for it fails to distinguish dialects (such as US and UK English) so it need not be strictly followed. However a short identifier for the language is probably prefered. Third you identify the speaker, we have typically used three letter initials which are the initials of the person speaker but any name is reasonable. If you are going to build a US or UK English voice you should look the Chapter called US/UK English Diphone Synthesizer.
The basic processes you will need to address
construct basic template files
generate phoneset definition
generate diphone schema file
generate prompts
record speaker
label nonsense words
extract picthmarks and LPC coeffcient
test phone synthesis
add lexicon/LTS support
add tokenization
add prosody (phrasing, durations and intonation)
test and evaluate voice
package for distribution
As with all parts of festvox: you must set the following enviroment variables to where you have installed versions of the Edinburgh Speech Tools and the festvox distribution
export ESTDIR=/home/awb/projects/speech_tools
export FESTVOXDIR=/home/awb/projects/festvox
In this example we will build a Japanese voice based on awb (a gaijin). First create a directory to hold the voice.
You will need in the regions of 500M of space to build a voice. Actually for Japanese its probably considerably less, but you must be aware that voice building does require disk space.mkdir ~/data/cmu_ja_awb_diphone
cd ~/data/cmu_ja_awb_diphone
Construct the basic directory structure and skeleton files with the command
The three arguments are, institution, language and speaker name.$FESTVOXDIR/src/diphones/setup_diphone cmu ja awb
The next stage is define the phoneset in festvox/cmu_ja_phones.scm. In many cases the phoneset for a language has been defined, and it is wise to follow convention when it exists. Note that the default phonetic features in the skeleton file may need to be modified for other languages. For Japanese, there are standards and here we use a set similar to the ATR phoneset used by many in Japan for speech processing. (This file is included, but not automatically installed, in $FESTVOXDIR/src/vox_diphone/japanese
Now you must write the code that generates the diphone schema file.
You can look at the examples in festvox/src/diphones/*_schema.scm.
This stage is actually the first difficult part, getting
thsi right can be tricky. Finding all possible phone-phone in a language
isn't as easy as it seems (especially as many possible ones
don't actually exist). The file festvox/ja_schema.scm is created
providing the function diphone-gen-list
which returns
a list of nonsense words, each consisting of a list of, list of diphones
and a list of phones in the nonsense word. For example
In addition to generating the diphone schema the ja_schema.scm also should provied the functionsfestival> (diphone-gen-list)
((("k-a" "a-k") (pau t a k a k a pau))
(("g-a" "a-g") (pau t a g a g a pau))
(("h-a" "a-h") (pau t a h a h a pau))
(("p-a" "a-p") (pau t a p a p a pau))
(("b-a" "a-b") (pau t a b a b a pau))
(("m-a" "a-m") (pau t a m a m a pau))
(("n-a" "a-n") (pau t a n a n a pau))
...)
Diphone_Prompt_Setup
, which
is called before generating the prompts, and Diphone_Prompt_Word
,
which is called before waveform synthesis of each nonsense word. Diphone_Prompt_Setup
, should be used to select a speaker to
generate the prompts. Note even though you may not use the prompts when
recording they are necessary for labeling the spoken speech, so you
still need to generate them. If you haeva synthesizer already int eh
language use ti to generate the prompts (assuming you can get it to
generate from phone lists also generate label files). Often the MBROLA
project already has a waveform synthesizer for the language so you can
use that. In this case we are going to use a US English voice
(kal_diphone) to generate the prompts. For Japanese that's probably ok
as the Japanese phoneset is (mostly) a subset of the English phoneset,
though using the generated prompts to prompt the user is probably not a
good idea.
The second function Diphone_Prompt_Word
, is used to map the
Japanese phone set to the US English phone set so that waveform
synthesis will work. In this case a simple map of Japanese phone
to one or more English phones is given and the code simple
changes the phone name in the segment relation (and adds a new
new segment in the multi-phone case).
Now we can generate the diphone schema list.
Its is worth checking etc/jadiph.list by hand to you are sure it contains all the diphone you wish to use.festival -b festvox/diphlist.scm festvox/ja_schema.scm \
'(diphone-gen-schema "ja" "etc/jadiph.list")'
The diphone schema file, in this case etc/jadiph.list, is a fundamentally key file for almost all the following scripts. Even if you generate the diphone list by some method other than described above, you should generate a schema list in exactly this format so that everything esle will work, modifying the other scripts for some other format is almost certainly a waste of your time.
The schema file has the following format
In this case it has 297 nonsense words.( ja_0001 "pau t a k a k a pau" ("k-a" "a-k"))
( ja_0002 "pau t a g a g a pau" ("g-a" "a-g") )
( ja_0003 "pau t a h a h a pau" ("h-a" "a-h") )
( ja_0004 "pau t a p a p a pau" ("p-a" "a-p") )
( ja_0005 "pau t a b a b a pau" ("b-a" "a-b") )
( ja_0006 "pau t a m a m a pau" ("m-a" "a-m") )
( ja_0007 "pau t a n a n a pau" ("n-a" "a-n") )
( ja_0008 "pau t a r a r a pau" ("r-a" "a-r") )
( ja_0009 "pau t a t a t a pau" ("t-a" "a-t") )
...
Next we can generate the prompts and their label files with the following command The to synthesize the prompts
Occasionally when you are building the prompts some diphones requested in the prompt voice don't actually exists (especially when you are doing cross-language prompting). Thus the generated prompt has some default diphone (typically silence-silence added). This is mostly ok, as long as its not happening multiple times in the same nonsence word. The speaker just should be aware that some prompts aren't actually correct (which of course is going to be true for all prompts in the cross-language prompting case).festival -b festvox/diphlist.scm festvox/ja_schema.scm \
'(diphone-gen-waves "prompt-wav" "prompt-lab" "etc/jadiph.list")'
The stage is to record the prompts. See the Section called Recording under Unix in the Chapter called Basic Requirements for details on how to do this under Unix (and in fact other techniques too). This can done with the command
Depending on whether you want the prompts actually to be played or not, you can edit bin/prompt_them to comment out the playing of the prompts.bin/prompt_them etc/jadiph.list
Note a third argument can be given to state which nonse word to begin prompting from. This if you have already recorded the first 100 you can continue with
bin/prompt_them etc/jadiph.list 101
The recorded prompts can the be labeled by
And the diphone index may be built bybin/make_labs prompt-wav/*.wav
bin/make_diph_index etc/awbdiph.list dic/awbdiph.est
If no EGG signal has been collected you can extract the pitchmarks by
If you do have an EGG signal then use the following insteadbin/make_pm_wave wav/*.wav
A program to move the predicted pitchmarks to the nearest peak in the waveform is also provided. This is almost always a good idea, even for EGG extracted pitch marksbin/make_pm lar/*.lar
Getting good pitchmarks is important to the quality of the synthesis, see the Section called Extracting pitchmarks from waveforms in the Chapter called Basic Requirements for more discussion.bin/make_pm_fix pm/*.pm
Because there is often a power mismatch through a set of diphone we provided a simple method for finding what general power difference exist between files. This finds the mean power for each vowel in each file and calculates a factor with respect to the overal mean vowel power. A table of power modifiers for each file can be calculated by
The factors cacluated by this are saved in etc/powfacts.bin/find_powerfactors lab/*.lab
Then build the pitch-synchronous LPC coefficients, which used the power factors if they've been calculated.
bin/make_lpc wav/*.wav
This should get you to the stage where you can test the basic waveform synthesizer. There is still much to do but initial tests (and correction of labeling errors etc) can start now. Start festival as
and then enter string of phonesfestival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)"
festival> (SayPhones '(pau k o N n i ch i w a pau))
In addition to the waveform generate part you must also provide text analysis for your language. Here, for the sake of simplicity we assume that the Japanese is provided in romanized form with spaces between each word. This is of course not the case for normal Japanese (and we are working on a proper Japanese front end). But at present this shows the general idea. Thus we edit festvox/cmu_ja_token.scm and add (simple) support for numbers.
As the relationship between romaji (romanized Japanese) and phones is almost trivial we write a set of letter-to-sound rules, by hand that expand words into their phones. This is added to festvox/cmu_ja_lex.scm.
For the time being we just use the default intonation model, though simple rule drive improvements are possible. See festvox/cmu_ja_awb_int.scm. For duration, we add a mean value for each phone in the phoneset to fextvox/cmu_ja_awb_dur.scm.
These three japanese specific files are included in the distribution in festvox/src/vox_diphone/japanese/.
Now we have a basic synthesizer, although there is much to do, we can now type (romanized) text to it.
festival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)"
...
festival> (SayText "boku wa gaijin da yo.")
The next part is to test and improve these various initial subsystems, lexicons, text analysis prosody, and correct waveform synthesis problem. This is ane endless task but you should spend significantly more time on it that we have done for this example.
Once you are happy with the completed voice you can package it for distribution. The first stage is to generate a group file for the diphone database. This extracts the subparts of the nonsense words and puts them into a single file offering something smaller and quicker to access. The groupfile can be built as follows.
Thefestival festvox/cmu_ja_awb_diphone.scm "(voice_cmu_ja_awb_diphone)"
...
festival (us_make_group_file "group/awblpc.group" nil)
...
us_
in the function names stands for UniSyn
(the unit concatenation subsystem in Festival) and nothing to
do with US English. To test this edit festvox/cmu_ja_awb_diphone.scm and change the choice of databases used from separate to grouped. This is done by commenting out the line (around line 81)
and uncommented the line (around line 84)(set! cmu_ja_awb_db_name (us_diphone_init cmu_ja_awb_lpc_sep))
The next stage is to integrate this new voice so that festival may find it automatically. To do this you should add a symbolic link from the voice directory of Festival's English voices to the directory containing the new voice. Frist cd to festival's voice directory (this will vary depending on where your version of festival is installed)(set! cmu_ja_awb_db_name (us_diphone_init cmu_ja_awb_lpc_group))
creating the language directory if it does not already exists. Add a symbolic link back to where your voice was builtcd /home/awb/projects/festival/lib/voices/japanese/
Now this new voice will be available for anyone runing that version festival started from any directory, without the need for any explicit argumentsln -s /home/awb/data/cmu_ja_awb_diphone
The final stage is to generate a distribution file so the voice may be installed on other's festival installations. Before you do this you must add a file COPYING to the directory you built the diphone database in. This should state the terms and conditions in which people may use, distribute and modify the voice.festival
...
festival> (voice_cmu_ja_awb_diphone)
...
festival> (SayText "ohayo gozaimasu.")
...
Generate the distribution tarfile in the directory above the festival installation (the one where festival/ and speech_tools/ directory is).
cd /home/awb/projects/
tar zcvf festvox_cmu_ja_awb_lpc.tar.gz \
festival/lib/voices/japanese/cmu_ja_awb_diphone/festvox/*.scm \
festival/lib/voices/japanese/cmu_ja_awb_diphone/COPYING \
festival/lib/voices/japanese/cmu_ja_awb_diphone/group/awblpc.group
The completed files from building this crude Japanese example are available at http://festvox.org/examples/cmu_ja_awb_diphone/.
<<< Previous | Home | Next >>> |
Creating support for new Indic languages | Up | US/UK English Diphone Synthesizer |