Labelling Guide for NSW
- Background
- Who can label
- How to label
- Examples
Background
Although we may think text is made up of words, actually there are
often tokens within text that are not simply "words." For example,
numbers, abbreviations etc are surprisingly common. What is more the
pronunciation of these tokens is not always trivial. Consider the
digit string "1985" this will be pronounced differently depending on
its context, as a year it is "nineteen eighty-five" while as a number
as in "1985 pages" it is "one thousand nine hundred (and) eight-five"
while as a telephone number it could be "one nine eight five".
Abbreviations are also common e.g. "20GB" "450Mhz". Depending on the
type of text the number of "non-standard words" with non-trivial
pronunciations can be as much as 50% large.
As part of a project to investigate the relationship between wrtten
text and the pronunciation of it we wish to label large amounts of
text from at least four different domains. Namely, news stories from
press wires, some USENET/email data, classified ads and IRC
(internet relay chat). The project is to design statistical models to
predict the pronuncation of such words for both speech synthesis and
for building language models in speech recognition. This project
will run at Johns Hopkins University from mid July to the end of
August this year.
The labelling task
The labelling task itself is to look at a number of words within a
short context (three words at either side) and identify one of around
twenty possible labels for that non-stanard word. To aid this the
presentation method only presents tokens which might be NSWs though
the hueristic for finding them is slight over general such that some
identified NSWs are actually just words that aren't in our lexicon.
Sometimes (more often in some text types) the token must also be split
to identify its pronunciation of its subparts, e.g. "WinNT" consists
of and abbreviation "Win" for "Windows" and the part "NT" to be
pronounced as a letter sequence.
Simple example
For example the labelling tool, actually a special mode in the Emacs
editor, presents each token on a new line surrounded by its context.
A guess at the label is given at the start and the labeller must
either accept the guess or provide an alternative.
NUM for Bosnia by Oct * 15 * he would go to 109
NUM no later than Nov * 15 * The United States along $
NUM begin the sale of * 12 * million barrels of oil $
ASWD possibility of doing this * multilaterally * 0 0 0 0 358
LSEQ The Washington Post says * U.S * relations with its alli$
ASWD Rosenblatt Stadium in Omaha * Neb * they have never seen
The first two NSWs are not simple numbers but ordinals as they
are dates and hence must be labelled NORD, the third line
is a simple number. The fourth "multilaterally" is a standard
word but because it is not in our lexicon it appears and potential NSWs,
however it is guessed as a word (ASWD). The next line is a letter
sequence. The last line "Neb" is an abbreviation for "Nebraska" and
hence should be marked as EXPN.
Thus after labelling, the above will look like.
NORD NUM for Bosnia by Oct * 15 * he would go to 109
NORD NUM no later than Nov * 15 * The United States along $
NUM NUM begin the sale of * 12 * million barrels of oil $
ASWD ASWD possibility of doing this * multilaterally * 0 0 0 0 358
LSEQ LSEQ The Washington Post says * U.S * relations with its alli$
EXPN ASWD Rosenblatt Stadium in Omaha * Neb * they have never seen
More complex example
Some NSWs have internal structure. such as "PCCard", "64MB",
"LviewPro" these need to be identified more fully. For such NSWs you
may select the split option, the character '/', and the labeller will
prompt you with the token, you insert spaces at the appropriate
boundaries, then HIT return and the sub parts can be labelled.
For example
ASWD down 110 Preferably a * 4MB * unit with no HD $
crequires splitting as "4" and "MB" giving
SPLT ASWD down 110 Preferably a * 4MB * unit with no HD $
4
MB
which are labelled as NUM and EXPN.
Note the EXPN label may sometimes be used for things which could
also be split. For every token that is labelled EXPN only one
expansion should exist. Such examples are "mg", "kg", "N.Y.", "Capt"
but because of they occur so often in news text, "D-Mass", "R-TX"
(identifying party and state of US Senators).
Tagging Chart
The labeller runs as a special mode in the Emacs editor, single
key strokes add labels to the NSW on the current line.
Key | Label | Explanation | Example |
---|
m | MSPL | misspelled word | geogaphy
|
e | ESPN | abbreviation/contraction | adv, N.Y, mph
|
l | LSEQ | letter sequence | CIA, D.C, CDs
|
a | ASWD | read as word | CAT, proper names
|
f | FNSP | funny spelling | sllloooww, sh*t
|
x | NONE | token should be ignored | ascii art, formating junk
|
s | SLNT | not pronounced | punctuation in compounds
|
n | NUM | number (cardinal) | 12, 45, 1/2
|
o | NORD | number (ordinal) | May 7, 3rd, Bill Gates III
|
t | NTEL | telephone (or part of) | 212 555-4523
|
d | NDIG | number as digits | Room 101,
|
i | NIDE | indentifier | 747, 386, 8086
|
, | NADDR | number as street address | 5000 Pennsylvania, 4523 Forbes
|
z | NZIP | zip code ot PO Box | 91020
|
c | NTIME | a (compound) time | 3.20, 11:45
|
C | NDATE | a (compound) date | 2/2/99, 14/03/87 (or US) 03/14/87
|
u | URL | url/pathname | http://slashdot.org /usr/local
|
y | NYER | year(s) | 1998 80s 1900s 2003
|
$ | MONEY | money (US or otherwise) | $3.45 HK$300, Y20,000
|
b | BMONY | money tr/m/billions | $3.45 billion
|
% | PRCT | percentage | 75% 3.4%
|
. | OTHER | unknown (use sparingly) |
|
SPACE | | Selects the guessed token |
|
r | | prompts for user specified token
|
/ | | prompt for split of token
|
Note that labelling should be (primarily) identifying how you would
pronounce the token
Note if the guess is ROM, a roman numeral, identify its uses as
a NUM (as in Word War II) or NORD as in Louis XIV or Louis the XIV.
For unusual abbreviations, or ones where the token itself might
be ambiguous it is neccesary put the expansion in the label
itself. All labels starting with lower case letters are treated as
in-line expansions. This seems particular useful with split
NSWs.
How run the labeller
The labeller is an special mode in Emacs. To run it you need the
script toklabel and the Emacs Lisp file toklab.el. Download these files and save
them in a new directory. Edit your copy of toklabel so the value of TOKLABDIR is the
name of the directory that contains both the toklabel script and the
the toklab.el file. You will be given files like example.feats to label after labelling they
will look up example.done. You do this by
down loading the file (for example into the same directory as the
scripts) and type
./toklabel example.feats
This presents a screen (you may wish to make the window wider) with
the tokens in context. Pressing any of the single characters
described above will add the appropriate token in column 1. The
space key will select the default. For some texts the default
will often be right, for some tokens the default will
almost always be right, but note for the occasional weird forms,
for example numbers that look like years, letter sequences that
are really words etc. The list of labels and examples may
be obtained in Emacs itself with the command C-h m (described-mode).
When you type something you didn't mean to and things go "strange"
you can use Emacs' undo feature, available from the Edit menu and
as C-_ (that's control underscore). Also you can override the
special characters to type your own by preceding them with C-q (though
the r key will useually be sufficient).
Who can label
We will pay you for labelling, however, in order to be elegible you
must meet the following conditions
- Be a graduate student of Edinburgh University
- Have permission from your supervisor
- In a position for the University to pay you for casual labour without
conflicts from your current funders
- able to spend at least 10 hours over the next month on this
- Have access to a Unix machine with Emacs (19 or 20) installed
- Be fluent in English (you don't need to be native)
- Complete the test example to a reasonable degree