Labelling Guide for NSW

Background
Who can label
How to label
Examples

Background

Although we may think text is made up of words, actually there are often tokens within text that are not simply "words." For example, numbers, abbreviations etc are surprisingly common. What is more the pronunciation of these tokens is not always trivial. Consider the digit string "1985" this will be pronounced differently depending on its context, as a year it is "nineteen eighty-five" while as a number as in "1985 pages" it is "one thousand nine hundred (and) eight-five" while as a telephone number it could be "one nine eight five". Abbreviations are also common e.g. "20GB" "450Mhz". Depending on the type of text the number of "non-standard words" with non-trivial pronunciations can be as much as 50% large. As part of a project to investigate the relationship between wrtten text and the pronunciation of it we wish to label large amounts of text from at least four different domains. Namely, news stories from press wires, some USENET/email data, classified ads and IRC (internet relay chat). The project is to design statistical models to predict the pronuncation of such words for both speech synthesis and for building language models in speech recognition. This project will run at Johns Hopkins University from mid July to the end of August this year.

The labelling task

The labelling task itself is to look at a number of words within a short context (three words at either side) and identify one of around twenty possible labels for that non-stanard word. To aid this the presentation method only presents tokens which might be NSWs though the hueristic for finding them is slight over general such that some identified NSWs are actually just words that aren't in our lexicon. Sometimes (more often in some text types) the token must also be split to identify its pronunciation of its subparts, e.g. "WinNT" consists of and abbreviation "Win" for "Windows" and the part "NT" to be pronounced as a letter sequence.

Simple example

For example the labelling tool, actually a special mode in the Emacs editor, presents each token on a new line surrounded by its context. A guess at the label is given at the start and the labeller must either accept the guess or provide an alternative.

      NUM                    for Bosnia by Oct * 15 * he would go to 109 
      NUM                    no later than Nov * 15 * The United States along $
      NUM                    begin the sale of * 12 * million barrels of oil  $
      ASWD           possibility of doing this * multilaterally * 0 0 0 0 358 
      LSEQ            The Washington Post says * U.S * relations with its alli$
      ASWD         Rosenblatt Stadium in Omaha * Neb * they have never seen

The first two NSWs are not simple numbers but ordinals as they are dates and hence must be labelled NORD, the third line is a simple number. The fourth "multilaterally" is a standard word but because it is not in our lexicon it appears and potential NSWs, however it is guessed as a word (ASWD). The next line is a letter sequence. The last line "Neb" is an abbreviation for "Nebraska" and hence should be marked as EXPN. Thus after labelling, the above will look like.

NORD  NUM                    for Bosnia by Oct * 15 * he would go to 109 
NORD  NUM                    no later than Nov * 15 * The United States along $
NUM   NUM                    begin the sale of * 12 * million barrels of oil  $
ASWD  ASWD           possibility of doing this * multilaterally * 0 0 0 0 358 
LSEQ  LSEQ            The Washington Post says * U.S * relations with its alli$
EXPN  ASWD         Rosenblatt Stadium in Omaha * Neb * they have never seen

More complex example

Some NSWs have internal structure. such as "PCCard", "64MB", "LviewPro" these need to be identified more fully. For such NSWs you may select the split option, the character '/', and the labeller will prompt you with the token, you insert spaces at the appropriate boundaries, then HIT return and the sub parts can be labelled. For example

      ASWD               down 110 Preferably a * 4MB * unit with no HD        $

crequires splitting as "4" and "MB" giving

SPLT  ASWD               down 110 Preferably a * 4MB * unit with no HD        $
        4
        MB

which are labelled as NUM and EXPN. Note the EXPN label may sometimes be used for things which could also be split. For every token that is labelled EXPN only one expansion should exist. Such examples are "mg", "kg", "N.Y.", "Capt" but because of they occur so often in news text, "D-Mass", "R-TX" (identifying party and state of US Senators).

Tagging Chart

The labeller runs as a special mode in the Emacs editor, single key strokes add labels to the NSW on the current line.

Key	Label	Explanation	Example
m	MSPL	misspelled word	geogaphy
e	ESPN	abbreviation/contraction	adv, N.Y, mph
l	LSEQ	letter sequence	CIA, D.C, CDs
a	ASWD	read as word	CAT, proper names
f	FNSP	funny spelling	sllloooww, sh*t
x	NONE	token should be ignored	ascii art, formating junk
s	SLNT	not pronounced	punctuation in compounds
n	NUM	number (cardinal)	12, 45, 1/2
o	NORD	number (ordinal)	May 7, 3rd, Bill Gates III
t	NTEL	telephone (or part of)	212 555-4523
d	NDIG	number as digits	Room 101,
i	NIDE	indentifier	747, 386, 8086
,	NADDR	number as street address	5000 Pennsylvania, 4523 Forbes
z	NZIP	zip code ot PO Box	91020
c	NTIME	a (compound) time	3.20, 11:45
C	NDATE	a (compound) date	2/2/99, 14/03/87 (or US) 03/14/87
u	URL	url/pathname	http://slashdot.org /usr/local
y	NYER	year(s)	1998 80s 1900s 2003
$	MONEY	money (US or otherwise)	$3.45 HK$300, Y20,000
b	BMONY	money tr/m/billions	$3.45 billion
%	PRCT	percentage	75% 3.4%
.	OTHER	unknown (use sparingly)
SPACE		Selects the guessed token
r		prompts for user specified token
/		prompt for split of token

Note that labelling should be (primarily) identifying how you would pronounce the token Note if the guess is ROM, a roman numeral, identify its uses as a NUM (as in Word War II) or NORD as in Louis XIV or Louis the XIV. For unusual abbreviations, or ones where the token itself might be ambiguous it is neccesary put the expansion in the label itself. All labels starting with lower case letters are treated as in-line expansions. This seems particular useful with split NSWs.

How run the labeller

The labeller is an special mode in Emacs. To run it you need the script toklabel and the Emacs Lisp file toklab.el. Download these files and save them in a new directory. Edit your copy of toklabel so the value of TOKLABDIR is the name of the directory that contains both the toklabel script and the the toklab.el file. You will be given files like example.feats to label after labelling they will look up example.done. You do this by down loading the file (for example into the same directory as the scripts) and type

./toklabel example.feats

This presents a screen (you may wish to make the window wider) with the tokens in context. Pressing any of the single characters described above will add the appropriate token in column 1. The space key will select the default. For some texts the default will often be right, for some tokens the default will almost always be right, but note for the occasional weird forms, for example numbers that look like years, letter sequences that are really words etc. The list of labels and examples may be obtained in Emacs itself with the command C-h m (described-mode). When you type something you didn't mean to and things go "strange" you can use Emacs' undo feature, available from the Edit menu and as C-_ (that's control underscore). Also you can override the special characters to type your own by preceding them with C-q (though the r key will useually be sufficient).

Who can label

We will pay you for labelling, however, in order to be elegible you must meet the following conditions

Be a graduate student of Edinburgh University
Have permission from your supervisor
In a position for the University to pay you for casual labour without conflicts from your current funders
able to spend at least 10 hours over the next month on this
Have access to a Unix machine with Emacs (19 or 20) installed
Be fluent in English (you don't need to be native)
Complete the test example to a reasonable degree