The standard model uses 23 POS tags in a POS sequence model of length 3. This gives 12167 different possible observations. With a training set of 31,707 words it is clear that there will be a large number of POS sequences which never occur or occur only once. In the basic model, sequences with zero counts are assigned a small fixed floor probability. These cases are not particularly important as the chances of breaks and non-breaks being inserted is now governed by the phrase model. More worrying are single occurrences. If a POS sequence is observed only once and with a break at the juncture, this will be assigned the same probability as when a large number of breaks and zero non-breaks are observed for a POS sequence. Clearly the second case is a better indicator that the POS sequence in question really does carry a high likelihood of a break.
To counter this problem we employ a smoothing technique which adjusts the frequency counts of rare and non-occurring POS sequences. First Good-Turing (explained in Church and Gale church&gale:91) smoothing is used to adjust the frequency counts of all occurrences for the break and non-break model. This effectively gives zero counts a small value and reduces the counts of rare cases. Next a form of backing-off is applied whereby a juncture likelihood P(ck-1, ck, ck+1 | jk) is discarded if its adjusted frequency count falls below a threshold, and the estimate P(ck, ck+1 | jk) is used instead. A threshold of 3 usually gave the best results. Table 2 gives the results comparing the V23 tagset under smoothing and no smoothing. The smoothed POS sequence models with the 6-gram phrase break model are significantly better than the unsmoothed equivalents with both word and break accuracy increasing at only a slight word insertion decrease.
The table shows that smoothing significantly increases performance when used with a high order n-gram phrase break model.