next up previous
Next: Tagset size Up: Experiments and Results Previous: Experiments and Results

Some simple algorithms

As with all models, there are trade-offs between complexity, both in time and space, and ease of implementation. The table below gives results from some simple algorithms tested on our data. The first inserts a phrase break deterministically after all punctuation while the second inserts a phrase break after all content words that are succeeded by a function word (e.g. as suggested by [10]).


We can see that the punctuation-model conservatively assigns breaks at positions that are almost always correct, but misses many others. The content/function model gets many more correct but at the cost of massive over insertion.

Within our basic model there are a number variables to investigate, including POS tagset size, size of POS window for POS sequence model, and size of n-gram for phrase break model.

Alan W Black
Tue Jul 1 17:09:00 BST 1997