Go to the first, previous, next, last section, table of contents.

21 Diphone synthesizer

NOTE: use of this diphone synthesis is depricated and it will probably be removed from future versions, all of its functionality has been replaced by the UniSyn synthesizer. It is not compiled by default, if required add ALSO_INCLUDE += diphone to your `festival/config/config' file.

A basic diphone synthesizer offers a method for making speech from segments, durations and intonation targets. This module was mostly written by Alistair Conkie but the base diphone format is compatible with previous CSTR diphone synthesizers.

The synthesizer offers residual excited LPC based synthesis (hunt89) and PSOLA (TM) (moulines90) (PSOLA is not available for distribution).

21.1 Diphone database format

A diphone database consists of a dictionary file, a set of waveform files, and a set of pitch mark files. These files are the same format as the previous CSTR (Osprey) synthesizer.

The dictionary file consist of one entry per line. Each entry consists of five fields: a diphone name of the form P1-P2, a filename (without extension), a floating point start position in the file in milliseconds, a mid position in milliseconds (change in phone), and an end position in milliseconds. Lines starting with a semi-colon and blank lines are ignored. The list may be in any order.

For example a partial list of phones may look like.

ch-l  r021   412.035  463.009  518.23  
jh-l  d747   305.841  382.301  446.018 
h-l   d748   356.814  403.54   437.522 
#-@   d404   233.628  297.345  331.327 
@-#   d001   836.814  938.761  1002.48

Waveform files may be in any form, as long as every file is the same type, headered or unheadered as long as the format is supported the speech tools wave reading functions. These may be standard linear PCM waveform files in the case of PSOLA or LPC coefficients and residual when using the residual LPC synthesizer. section 21.2 LPC databases

Pitch mark files consist a simple list of positions in milliseconds (plus places after the point) in order, one per line of each pitch mark in the file. For high quality diphone synthesis these should be derived from laryngograph data. During unvoiced sections pitch marks should be artificially created at reasonable intervals (e.g. 10 ms). In the current format there is no way to determine the "real" pitch marks from the "unvoiced" pitch marks.

It is normal to hold a diphone database in a directory with a number of sub-directories namely `dic/' contain the dictionary file, `wave/' for the waveform files, typically of whole nonsense words (sometimes this directory is called `vox/' for historical reasons) and `pm/' for the pitch mark files. The filename in the dictionary entry should be the same for waveform file and the pitch mark file (with different extensions).

21.2 LPC databases

The standard method for diphone resynthesis in the released system is residual excited LPC (hunt89). The actual method of resynthesis isn't important to the database format, but if residual LPC synthesis is to be used then it is necessary to make the LPC coefficient files and their corresponding residuals.

Previous versions of the system used a "host of hacky little scripts" to this but now that the Edinburgh Speech Tools supports LPC analysis we can provide a walk through for generating these.

We assume that the waveform file of nonsense words are in a directory called `wave/'. The LPC coefficients and residuals will be, in this example, stored in `lpc16k/' with extensions `.lpc' and `.res' respectively.

Before starting it is worth considering power normalization. We have found this important on all of the databases we have collected so far. The ch_wave program, part of the speech tools, with the optional -scaleN 0.4 may be used if a more complex method is not available.

The following shell command generates the files

for i in wave/*.wav
do
   fname=`basename $i .wav`
   echo $i
   lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \
       -r lpc16k/$fname.res -otype htk -rtype nist $i
done

It is said that the LPC order should be sample rate divided by one thousand plus 2. This may or may not be appropriate and if you are particularly worried about the database size it is worth experimenting.

The program `lpc_analysis', found in `speech_tools/bin', can be used to generate the lpc coefficients and residual. Note these should be reflection coefficients so they may be quantised (as they are in group files).

The coefficients and residual files produced by different LPC analysis programs may start at different offsets. For example the Entropic's ESPS functions generate LPC coefficients that are offset by one frame shift (e.g. 0.01 seconds). Our own `lpc_analysis' routine has no offset. The Diphone_Init parameter list allows these offsets to be specified. Using the above function to generate the LPC files the description parameters should include

  (lpc_frame_offset 0)
  (lpc_res_offset 0.0)

While when generating using ESPS routines the description should be

  (lpc_frame_offset 1)
  (lpc_res_offset 0.01)

The defaults actually follow the ESPS form, that is lpc_frame_offset is 1 and lpc_res_offset is equal to the frame shift, if they are not explicitly mentioned.

Note the biggest problem we have in implementing the residual excited LPC resynthesizer was getting the right part of the residual to line up with the right LPC coefficients describing the pitch mark. Making errors in this degrades the synthesized waveform notably, but not seriously, making it difficult to determine if it is an offset problem or some other bug.

Although we have started investigating if extracting pitch synchronous LPC parameters rather than fixed shift parameters gives better performance, we haven't finished this work. `lpc_analysis' supports pitch synchronous analysis but the raw "ungrouped" access method does not yet. At present the LPC parameters are extracted at a particular pitch mark by interpolating over the closest LPC parameters. The "group" files hold these interpolated parameters pitch synchronously.

The American English voice `kd' was created using the speech tools `lpc_analysis' program and its set up should be looked at if you are going to copy it. The British English voice `rb' was constructed using ESPS routines.

21.3 Group files

Databases may be accessed directly but this is usually too inefficient for any purpose except debugging. It is expected that group files will be built which contain a binary representation of the database. A group file is a compact efficient representation of the diphone database. Group files are byte order independent, so may be shared between machines of different byte orders and word sizes. Certain information in a group file may be changed at load time so a database name, access strategy etc. may be changed from what was set originally in the group file.

A group file contains the basic parameters, the diphone index, the signal (original waveform or LPC residual), LPC coefficients, and the pitch marks. It is all you need for a run-time synthesizer. Various compression mechanisms are supported to allow smaller databases if desired. A full English LPC plus residual database at 8k ulaw is about 3 megabytes, while a full 16 bit version at 16k is about 8 megabytes.

Group files are created with the Diphone.group command which takes a database name and an output filename as an argument. Making group files can take some time especially if they are large. The group_type parameter specifies raw or ulaw for encoding signal files. This can significantly reduce the size of databases.

Group files may be partially loaded (see access strategies) at run time for quicker start up and to minimise run-time memory requirements.

21.4 Diphone_Init

The basic method for describing a database is through the Diphone_Init command. This function takes a single argument, a list of pairs of parameter name and value. The parameters are

name: An atomic name for this database.
group_file: The filename of a group file, which may itself contain parameters describing itself
type: The default value is pcm, but for distributed voices this is always lpc.
index_file: A filename containing the diphone dictionary.
signal_dir: A directory (slash terminated) containing the pcm waveform files.
signal_ext: A dot prefixed extension for the pcm waveform files.
pitch_dir: A directory (slash terminated) containing the pitch mark files.
pitch_ext: A dot prefixed extension for the pitch files
lpc_dir: A directory (slash terminated) containing the LPC coefficient files and residual files.
lpc_ext: A dot prefixed extension for the LPC coefficient files
lpc_type: The type of LPC file (as supported by the speech tools)
lpc_frame_offset: The number of frames "missing" from the beginning of the file. Often LPC parameters are offset by one frame.
lpc_res_ext: A dot prefixed extension for the residual files
lpc_res_type: The type of the residual files, this is a standard waveform type as supported by the speech tools.
lpc_res_offset: Number of seconds "missing" from the beginning of the residual file. Some LPC analysis technique do not generate a residual until after one frame.
samp_freq: Sample frequency of signal files
phoneset: Phoneset used, must already be declared.
num_diphones: Total number of diphones in database. If specified this must be equal or bigger than the number of entries in the index file. If it is not specified the square of the number of phones in the phoneset is used.
sig_band: number of sample points around actual diphone to take from file. This should be larger than any windowing used on the signal, and/or up to the pitch marks outside the diphone signal.
alternates_after: List of pairs of phones stating replacements for the second part of diphone when the basic diphone is not found in the diphone database.
alternates_before: List of pairs of phones stating replacements for the first part of diphone when the basic diphone is not found in the diphone database.
default_diphone: When unexpected combinations occur and no appropriate diphone can be found this diphone should be used. This should be specified for all diphone databases that are to be robust. We usually us the silence to silence diphone. No mater how carefully you designed your diphone set, conditions when an unknown diphone occur seem to always happen. If this is not set and a diphone is requested that is not in the database an error occurs and synthesis will stop.

Examples of both general set up, making group files and general use are in

`lib/voices/english/rab_diphone/festvox/rab_diphone.scm'

21.5 Access strategies

Three basic accessing strategies are available when using diphone databases. They are designed to optimise access time, start up time and space requirements.

direct: Load all signals at database init time. This is the slowest startup but the fastest to access. This is ideal for servers. It is also useful for small databases that can be loaded quickly. It is reasonable for many group files.
dynamic: Load signals as they are required. This has much faster start up and will only gradually use up memory as the diphones are actually used. Useful for larger databases, and for non-group file access.
ondemand: Load the signals as they are requested but free them if they are not required again immediately. This is slower access but requires low memory usage. In group files the re-reads are quite cheap as the database is well cached and a file description is already open for the file.

Note that in group files pitch marks (and LPC coefficients) are always fully loaded (cf. direct), as they are typically smaller. Only signals (waveform files or residuals) are potentially dynamically loaded.

21.6 Diphone selection

The appropriate diphone is selected based on the name of the phone identified in the segment stream. However for better diphone synthesis it is useful to augment the diphone database with other diphones in addition to the ones directly from the phoneme set. For example dark and light l's, distinguishing consonants from their consonant cluster form and their isolated form. There are however two methods to identify this modification from the basic name.

When the diphone module is called the hook diphone_module_hooks is applied. That is a function of list of functions which will be applied to the utterance. Its main purpose is to allow the conversion of the basic name into an augmented one. For example converting a basic l into a dark l, denoted by ll. The functions given in diphone_module_hooks may set the feature diphone_phone_name which if set will be used rather than the name of the segment.

For example suppose we wish to use a dark l (ll) rather than a normal l for all l's that appear in the coda of a syllable. First we would define a function to which identifies this condition and adds the addition feature diphone_phone_name identify the name change. The following function would achieve this

(define (fix_dark_ls utt)
"(fix_dark_ls UTT)
Identify ls in coda position and relabel them as ll."
  (mapcar
   (lambda (seg) 
     (if (and (string-equal "l" (item.name seg))
              (string-equal "+" (item.feat seg "p.ph_vc"))
              (item.relation.prev seg "SylStructure"))
      (item.set_feat seg "diphone_phone_name" "ll")))
   (utt.relation.items utt 'Segment))
  utt)

Then when we wish to use this for a particular voice we need to add

(set! diphone_module_hooks (list fix_dark_ls))

in the voice selection function.

For a more complex example including consonant cluster identification see the American English voice `ked' in `festival/lib/voices/english/ked/festvox/kd_diphone.scm'. The function ked_diphone_fix_phone_name carries out a number of mappings.

The second method for changing a name is during actual look up of a diphone in the database. The list of alternates is given by the Diphone_Init function. These are used when the specified diphone can't be found. For example we often allow mappings of dark l, ll to l as sometimes the dark l diphone doesn't actually exist in the database.

Go to the first, previous, next, last section, table of contents.