Go to the first, previous, next, last section, table of contents.


29 Examples

This chapter contains some simple walkthrough examples of using Festival in various ways, not just as speech synthesizer

29.1 POS Example

This example shows how we can use part of the standard synthesis process to tokenize and tag a file of text. This section does not cover training and setting up a part of speech tag set See section 16 POS tagging, only how to go about using the standard POS tagger on text.

This example also shows how to use Festival as a simple scripting language, and how to modify various methods used during text to speech.

The file `examples/text2pos' contains an executable shell script which will read arbitrary ascii text from standard input and produce words and their part of speech (one per line) on standard output.

A Festival script, like any other UNIX script, it must start with the the characters #! followed by the name of the `festival' executable. For scripts the option -script is also required. Thus our first line looks like

#!/usr/local/bin/festival -script

Note that the pathname may need to be different on your system

Following this we have copious comments, to keep our lawyers happy, before we get into the real script.

The basic idea we use is that the tts process segments text into utterances, those utterances are then passed to a list of functions, as defined by the Scheme variable tts_hooks. Normally this variable contains a list of two function, utt.synth and utt.play which will synthesize and play the resulting waveform. In this case, instead, we wish to predict the part of speech value, and then print it out.

The first function we define basically replaces the normal synthesis function utt.synth. It runs the standard festival utterance modules used in the synthesis process, up to the point where POS is predicted. This function looks like

(define (find-pos utt)
"Main function for processing TTS utterances.  Predicts POS and
prints words with their POS"
  (Token utt)
  (POS utt)
)

The normal text-to-speech process first tokenizes the text splitting it in to "sentences". The utterance type of these is Token. Then we call the Token utterance module, which converts the tokens to a stream of words. Then we call the POS module to predict part of speech tags for each word. Normally we would call other modules ultimately generating a waveform but in this case we need no further processing.

The second function we define is one that will print out the words and parts of speech

(define (output-pos utt)
"Output the word/pos for each word in utt"
 (mapcar
  (lambda (pair)
    (format t "%l/%l\n" (car pair) (car (cdr pair))))
  (utt.features utt 'Word '(name pos))))

This uses the utt.features function to extract features from the items in a named stream of an utterance. In this case we want the name and pos features for each item in the Word stream. Then for each pair we print out the word's name, a slash and its part of speech followed by a newline.

Our next job is to redefine the functions to be called during text to speech. The variable tts_hooks is defined in `lib/tts.scm'. Here we set it to our two newly-defined functions

(set! tts_hooks (list find-pos output-pos))

So that garbage collection messages do not appear on the screen we stop the message from being outputted by the following command

(gc-status nil)

The final stage is to start the tts process running on standard input. Because we have redefined what functions are to be run on the utterances, it will no longer generate speech but just predict part of speech and print it to standard output.

(tts_file "-")

29.2 Singing Synthesis

As an interesting example a `singing-mode' is included. This offers an XML based mode for specifying songs, both notes and duration. This work was done as a student project by Dominic Mazzoni. A number of examples wr provided in `examples/songs'. This may be run as

festival> (tts "doremi.xml" 'singing)

Each note can be given a note and a beat value

<?xml version="1.0"?>
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" 
      "Singing.v0_1.dtd"
[]>
<SINGING BPM="30">
<PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH>
<PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH>
<PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH>
<PITCH NOTE="C4"><DURATION BEATS="0.3">fah</DURATION></PITCH>
<PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH>
<PITCH NOTE="E4"><DURATION BEATS="0.3">lah</DURATION></PITCH>
<PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH>
<PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH>
</SINGING>

You can construct multi-part songs by synthesizing each part and generating waveforms, them combining them. For example

text2wave -mode singing america1.xml -o america1.wav
text2wave -mode singing america2.xml -o america2.wav
text2wave -mode singing america3.xml -o america3.wav
text2wave -mode singing america4.xml -o america4.wav
ch_wave -o america.wav -pc longest america?.wav

The voice used to sing is the current voice. Note that the number of syllables in the words must match that at run time, which means thios doesn't always work cross dialect (UK voices sometimes wont work without tweaking.

This technique is basically simple, though is definitely effective. However for a more serious singing synthesizer we recommend you look at Flinger http://cslu.cse.ogi.edu/tts/flinger/, addresses the issues of synthesizing the human singing voice in more detail.


Go to the first, previous, next, last section, table of contents.