This chapter contains some simple walkthrough examples of using Festival in various ways, not just as speech synthesizer
This example shows how we can use part of the standard synthesis process to tokenize and tag a file of text. This section does not cover training and setting up a part of speech tag set See section 16 POS tagging, only how to go about using the standard POS tagger on text.
This example also shows how to use Festival as a simple scripting language, and how to modify various methods used during text to speech.
The file `examples/text2pos' contains an executable shell script which will read arbitrary ascii text from standard input and produce words and their part of speech (one per line) on standard output.
A Festival script, like any other UNIX script, it must start with the
the characters #!
followed by the name of the `festival'
executable. For scripts the option -script
is also
required. Thus our first line looks like
#!/usr/local/bin/festival -script
Note that the pathname may need to be different on your system
Following this we have copious comments, to keep our lawyers happy, before we get into the real script.
The basic idea we use is that the tts process segments text into
utterances, those utterances are then passed to a list of functions, as
defined by the Scheme variable tts_hooks
. Normally this variable
contains a list of two function, utt.synth
and utt.play
which
will synthesize and play the resulting waveform. In this case, instead,
we wish to predict the part of speech value, and then print it out.
The first function we define basically replaces the normal synthesis
function utt.synth
. It runs the standard festival utterance
modules used in the synthesis process, up to the point where POS is
predicted. This function looks like
(define (find-pos utt) "Main function for processing TTS utterances. Predicts POS and prints words with their POS" (Token utt) (POS utt) )
The normal text-to-speech process first tokenizes the text splitting it
in to "sentences". The utterance type of these is Token
. Then
we call the Token
utterance module, which converts the tokens to
a stream of words. Then we call the POS
module to predict part
of speech tags for each word. Normally we would call other modules
ultimately generating a waveform but in this case we need no further
processing.
The second function we define is one that will print out the words and parts of speech
(define (output-pos utt) "Output the word/pos for each word in utt" (mapcar (lambda (pair) (format t "%l/%l\n" (car pair) (car (cdr pair)))) (utt.features utt 'Word '(name pos))))
This uses the utt.features
function to extract features from the
items in a named stream of an utterance. In this case we want the
name
and pos
features for each item in the Word
stream. Then for each pair we print out the word's name, a slash and its
part of speech followed by a newline.
Our next job is to redefine the functions to be called
during text to speech. The variable tts_hooks
is defined
in `lib/tts.scm'. Here we set it to our two newly-defined
functions
(set! tts_hooks (list find-pos output-pos))
So that garbage collection messages do not appear on the screen we stop the message from being outputted by the following command
(gc-status nil)
The final stage is to start the tts process running on standard input. Because we have redefined what functions are to be run on the utterances, it will no longer generate speech but just predict part of speech and print it to standard output.
(tts_file "-")
As an interesting example a `singing-mode' is included. This offers an XML based mode for specifying songs, both notes and duration. This work was done as a student project by Dominic Mazzoni. A number of examples wr provided in `examples/songs'. This may be run as
festival> (tts "doremi.xml" 'singing)
Each note can be given a note and a beat value
<?xml version="1.0"?> <!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" []> <SINGING BPM="30"> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3">fah</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3">lah</DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> </SINGING>
You can construct multi-part songs by synthesizing each part and generating waveforms, them combining them. For example
text2wave -mode singing america1.xml -o america1.wav text2wave -mode singing america2.xml -o america2.wav text2wave -mode singing america3.xml -o america3.wav text2wave -mode singing america4.xml -o america4.wav ch_wave -o america.wav -pc longest america?.wav
The voice used to sing is the current voice. Note that the number of syllables in the words must match that at run time, which means thios doesn't always work cross dialect (UK voices sometimes wont work without tweaking.
This technique is basically simple, though is definitely effective. However for a more serious singing synthesizer we recommend you look at Flinger http://cslu.cse.ogi.edu/tts/flinger/, addresses the issues of synthesizing the human singing voice in more detail.
Go to the first, previous, next, last section, table of contents.