Labelling Guide for NSW

Background

Although we may think text is made up of words, actually there are often tokens within text that are not simply "words." For example, numbers, abbreviations etc are surprisingly common. What is more the pronunciation of these tokens is not always trivial. Consider the digit string "1985" this will be pronounced differently depending on its context, as a year it is "nineteen eighty-five" while as a number as in "1985 pages" it is "one thousand nine hundred (and) eight-five" while as a telephone number it could be "one nine eight five". Abbreviations are also common e.g. "20GB" "450Mhz". Depending on the type of text the number of "non-standard words" with non-trivial pronunciations can be as much as 50% large. As part of a project to investigate the relationship between wrtten text and the pronunciation of it we wish to label large amounts of text from at least four different domains. Namely, news stories from press wires, some USENET/email data, classified ads and IRC (internet relay chat). The project is to design statistical models to predict the pronuncation of such words for both speech synthesis and for building language models in speech recognition. This project will run at Johns Hopkins University from mid July to the end of August this year.

The labelling task

The labelling task itself is to look at a number of words within a short context (three words at either side) and identify one of around twenty possible labels for that non-stanard word. To aid this the presentation method only presents tokens which might be NSWs though the hueristic for finding them is slight over general such that some identified NSWs are actually just words that aren't in our lexicon. Sometimes (more often in some text types) the token must also be split to identify its pronunciation of its subparts, e.g. "WinNT" consists of and abbreviation "Win" for "Windows" and the part "NT" to be pronounced as a letter sequence.

Simple example

For example the labelling tool, actually a special mode in the Emacs editor, presents each token on a new line surrounded by its context. A guess at the label is given at the start and the labeller must either accept the guess or provide an alternative.
      NUM                    for Bosnia by Oct * 15 * he would go to 109 
      NUM                    no later than Nov * 15 * The United States along $
      NUM                    begin the sale of * 12 * million barrels of oil  $
      ASWD           possibility of doing this * multilaterally * 0 0 0 0 358 
      LSEQ            The Washington Post says * U.S * relations with its alli$
      ASWD         Rosenblatt Stadium in Omaha * Neb * they have never seen 
The first two NSWs are not simple numbers but ordinals as they are dates and hence must be labelled NORD, the third line is a simple number. The fourth "multilaterally" is a standard word but because it is not in our lexicon it appears and potential NSWs, however it is guessed as a word (ASWD). The next line is a letter sequence. The last line "Neb" is an abbreviation for "Nebraska" and hence should be marked as EXPN. Thus after labelling, the above will look like.
NORD  NUM                    for Bosnia by Oct * 15 * he would go to 109 
NORD  NUM                    no later than Nov * 15 * The United States along $
NUM   NUM                    begin the sale of * 12 * million barrels of oil  $
ASWD  ASWD           possibility of doing this * multilaterally * 0 0 0 0 358 
LSEQ  LSEQ            The Washington Post says * U.S * relations with its alli$
EXPN  ASWD         Rosenblatt Stadium in Omaha * Neb * they have never seen 

More complex example

Some NSWs have internal structure. such as "PCCard", "64MB", "LviewPro" these need to be identified more fully. For such NSWs you may select the split option, the character '/', and the labeller will prompt you with the token, you insert spaces at the appropriate boundaries, then HIT return and the sub parts can be labelled. For example
      ASWD               down 110 Preferably a * 4MB * unit with no HD        $
crequires splitting as "4" and "MB" giving
SPLT  ASWD               down 110 Preferably a * 4MB * unit with no HD        $
        4
        MB
which are labelled as NUM and EXPN. Note the EXPN label may sometimes be used for things which could also be split. For every token that is labelled EXPN only one expansion should exist. Such examples are "mg", "kg", "N.Y.", "Capt" but because of they occur so often in news text, "D-Mass", "R-TX" (identifying party and state of US Senators).

Tagging Chart

The labeller runs as a special mode in the Emacs editor, single key strokes add labels to the NSW on the current line.
KeyLabelExplanation Example
mMSPL misspelled word geogaphy
eESPN abbreviation/contraction adv, N.Y, mph
lLSEQ letter sequence CIA, D.C, CDs
aASWD read as word CAT, proper names
fFNSP funny spelling sllloooww, sh*t
xNONE token should be ignored ascii art, formating junk
sSLNT not pronounced punctuation in compounds
nNUM number (cardinal) 12, 45, 1/2
oNORD number (ordinal) May 7, 3rd, Bill Gates III
tNTEL telephone (or part of) 212 555-4523
dNDIG number as digits Room 101,
iNIDE indentifier 747, 386, 8086
,NADDRnumber as street address 5000 Pennsylvania, 4523 Forbes
zNZIPzip code ot PO Box 91020
cNTIMEa (compound) time 3.20, 11:45
CNDATEa (compound) date 2/2/99, 14/03/87 (or US) 03/14/87
uURL url/pathname http://slashdot.org /usr/local
yNYER year(s) 1998 80s 1900s 2003
$MONEYmoney (US or otherwise) $3.45 HK$300, Y20,000
bBMONYmoney tr/m/billions $3.45 billion
%PRCT percentage 75% 3.4%
.OTHERunknown (use sparingly)
SPACE Selects the guessed token
r prompts for user specified token
/ prompt for split of token
Note that labelling should be (primarily) identifying how you would pronounce the token Note if the guess is ROM, a roman numeral, identify its uses as a NUM (as in Word War II) or NORD as in Louis XIV or Louis the XIV. For unusual abbreviations, or ones where the token itself might be ambiguous it is neccesary put the expansion in the label itself. All labels starting with lower case letters are treated as in-line expansions. This seems particular useful with split NSWs.

How run the labeller

The labeller is an special mode in Emacs. To run it you need the script toklabel and the Emacs Lisp file toklab.el. Download these files and save them in a new directory. Edit your copy of toklabel so the value of TOKLABDIR is the name of the directory that contains both the toklabel script and the the toklab.el file. You will be given files like example.feats to label after labelling they will look up example.done. You do this by down loading the file (for example into the same directory as the scripts) and type
./toklabel example.feats
This presents a screen (you may wish to make the window wider) with the tokens in context. Pressing any of the single characters described above will add the appropriate token in column 1. The space key will select the default. For some texts the default will often be right, for some tokens the default will almost always be right, but note for the occasional weird forms, for example numbers that look like years, letter sequences that are really words etc. The list of labels and examples may be obtained in Emacs itself with the command C-h m (described-mode). When you type something you didn't mean to and things go "strange" you can use Emacs' undo feature, available from the Edit menu and as C-_ (that's control underscore). Also you can override the special characters to type your own by preceding them with C-q (though the r key will useually be sufficient).

Who can label

We will pay you for labelling, however, in order to be elegible you must meet the following conditions