next up previous
Next: About this document ... Up: pos_phrase_nice Previous: Acknowledgements

Appendix: Tagsets

The full initial 37 tagset set found from the WSJ is given in table 6. The best tagset, shown in table 7, is formed by collapsing these tags into 23 tags. As ex, fw and 2 are typically unreliably predicted, and quite rare, they are not included in the implementation released with Festival, and POS tags of this type, if predicted, are treated as part of the nn_nnp_nnps_nns group. Although this marginally reduces the accuracy for our test set it reduces the size of models and hence seems worthwhile in a run-time system.


Table 6: The original WSJ tagset
Tag Function
CC coordinating conjunction
CD cardinal number
DT determiner
EX existential ``there''
FW foreign word
IN preposition
JJ adjective
JJR adjective, comparative
JJS adjective, superlative
MD modal
NN non-plural common noun
NNP non-plural proper noun
NNPS plural proper noun
NNS plural common noun
of the word ``of''
PDT pre-determiner
POS posessive
PRP pronoun
puncf final punctuation (period,
  question mark and exclamation mark)
punc other punction
hline RB adverb
RBR adverb, comparative
RBS adverb, superlative
RP particle
TO the word ``to''
UH interjection
VB verb, base form
VBD verb, past tense
VBG verb, gerund or present participle
VBN verb, past participle
VBP verb, non-3rd person
VBZ verb, 3rd person
WDT wh-determiner
WP wh-pronoun
WRB wh-adverb
sym symbol
2 ambiguously labelled


Table 7: The best clusterd tagset. The names directly show the clustering of the original WSJ set.
Tag
cc
cd
dt
ex
fw
in
jj_jjr_jjs
md
nn_nnp_nnps_nns
of
pdt
pos
prp
punc_puncf
rb_rbr_rbs_rp
to
uh
vb_vbd_vbg_vbn_vbp_vbz
wdt
wp
wrb
sym
2


next up previous
Next: About this document ... Up: pos_phrase_nice Previous: Acknowledgements
Alan W Black
1999-03-20