In this assignment, you will build a Hidden Markov Model and use it to tag words with their parts of speech. This assignment description is long, but contains lots of specific directions and hints so that you won't get stuck. Don't be intimidated—your program will be a lot shorter than this description! Just read carefully.
In this assignment, you will do supervised learning, estimating the parameters p(tag| previous tag) and p(word|tag) from a training set of already-tagged text. Some smoothing is necessary. You will then evaluate the learned model by finding the Viterbi tagging (i.e., best tag sequence) for some test data and measuring how many tags were correct.
For speed and simplicity, you will use relatively small datasets, and a bigram model instead of a trigram model. You will also ignore the spelling of words (useful for tagging unknown words). All these simplifications hurt accuracy.1) So overall, your percentage of correct tags will be in the low 90's instead of the high 90's that are reported for state-of-the-art POS taggers.
Programming language: You must do this assignment using Python. Jason Eisner's final program (for both this assignment and the next) was about 250 lines in Perl, and it handled problem 2 in about 15 seconds and 22M memory, and problem 1 of the next assignment (four iterations) in about 17.5 minutes and 131M memory. (A simple pruning step cuts the latter time to 6 minutes with virtually no effect on results, but you are not required to implement this.) Jason Baldridge's final program was about 300 lines in Python, and it handled problem 2 in about 4 seconds and 15M memory, and problem 1 of the next assignment (four iterations) in about 7.5 minutes and 100M memory on the UT Computational Linguistics lab computers.2) Note implementations done in compiled languages like Java would probably run far faster than either of these.
How to hand in your work: The procedure will be similar to previous assignments. You must test that your programs run on the lab machines with no problems before submitting them. You may prefer to develop them on the lab machines in the first place, since there are copies of the data already there.
Data: There are three datasets, contained in the archive ALL_DATA_TAGGING.tgz. They are as follows:
You only need to hand in results on the en dataset. The others are just for your convenience in testing your code, and for the extra credit problem.
FIGURE 1: English part-of-speech tagset.: The tags in the en
dataset. These come from longer, more specific tag names stripped down
to their first letters. For example, all kinds of nouns (formerly
Each dataset consists of three files:
The file format is quite simple. Each line has a single word/tag pair separated by the
Notation: The discussion and figures use the following notation. You might want to use the same notation in your program.
Spreadsheets: You are strongly encouraged to test your code using the artificial ic dataset. This dataset is small and should run fast. More important, it is designed so you can check your work: when you run the forward-backward algorithm, the initial parameters, intermediate results, and perplexities should all agree exactly with the results on the spreadsheet we used in class.
For this assigment, use the Viterbi version of Jason Eisner's spreadsheet. Excel 2000 and Excel 97 display these spreadsheets correctly. The experiments we did in class are described at http://www.cs.jhu.edu/~jason/papers/eisner.tnlp02.pdf.
Here is a suggested work plan to help you complete this assignment more quickly and easily:
Write a bigram Viterbi tagger that can be run as follows:
$ icvtag ictrain ictest
You may want to review the slides on Hidden Markov Model tagging, and perhaps the textbook exposition in Jurafsky & Martin.
For now, you should use naive unsmoothed estimates (i.e., maximum-likelihood estimates).
Your program must print two lines summarizing its performance on the test set, in the following format:
Tagging accuracy: 92.48% (known: 95.99% novel: 56.07%)
You are also free to print out whatever other information is useful to
you, including the tags your program picks, its accuracy as it goes
along, various probabilities, etc. A common trick, to give yourself
something to stare at, is to print a period to standard error (
In the required output illustrated above, each accuracy number considers some subset of the test tokens and asks what percentage of them received the correct tag:
The perplexity per tagged test word (e.g., 1577.499) is defined as4)
where t0, t1, t2 … tn is the winning tag sequence that your tagger assigns to test data (with
Here are some suggestions that will make your life easier (read carefully!).
• Make sure you really understand the algorithm before you start coding! Write pseudocode before you write the details. Work out an example on paper if that helps. Play with the spreadsheet. Review the reading or the slides. Read this handout more than once, and ask questions. Coding should be a few straightforward hours of work if you really understand everything and can avoid careless bugs.
Your program should go through the following steps:
• Don't bother to train on each sentence separately, or to tag each sentence separately. Just treat the train file as one long string that happens to contain some
Tagging sentences separately would save you memory, since then you could throw away each sentence (and its tag probabilities and backpointers) when you were done with it. But why bother if you seem to have enough memory? Just pretend it's one long sentence.
FIGURE 2: The Viterbi algorithm: Pseudocode for the Viterbi tagging algorithm. μt(i) is the probability of the best path from the start state (
• Figure 2 refers to a “tag dictionary” that stores all the possible tags for each word. As long as you only use the
• Before you start coding, make a list of the data structures you will
need to maintain, and choose names for those data structures as well as
their access methods. For example, you will have to look up certain
values of c(…). So write down, for example, that you will store the count c(ti-1, ti) in a table
• You will need some multidimensional tables, indexed by strings and/or
integers, to store the training counts and the path probabilities.
• Small probabilities should be stored in memory as log-probabilities. This is actually crucial to prevent underflow.
Check your work as follows.
You will hand in the program icvtag for this problem.
Now, you will improve your tagger so that you can run it on real data. For this, you will copy icvtag from the previous problem to vtag and modify it, such that you will be able to train and test on English POS tagging data, like this:
This means using a proper tag dictionary (for speed) and smoothed probabilities (for accuracy).9) Your tagger should beat the following “baseline” result:
Tagging accuracy: 92.48% (known: 95.99% novel: 56.07%)
This baseline result came from a stupid unigram tagger (which just
tagged every known word with its most common part of speech from
training data, ignoring context, and tagged all novel words with
You are required to use a “tag dictionary” — otherwise your tagger will be much too slow. Each word has a list of allowed tags, and you should consider only those tags. That is, don't consider tag sequences that are incompatible with the dictionary, even if they have positive smoothed probability. See the pseudocode in Figure 2.
Derive your tag dictionary from the training data. For a known word,
allow only the tags that it appeared with in the training set. For an
unknown word, allow all tags except
How about smoothing? Just to get the program working on the en dataset, you could use some very simple form of smoothing at first. For example, add-one smoothing on ptw will take care of the novel words in entest, and you could get away with no smoothing at all on ptt.
However, the version of the program that you submit should use the following type of smoothing. It is basically just add-λ smoothing with backoff, but λ is set higher in contexts with a lot of “singletons”—words that have only occurred once—because such contexts are likely to have novel words in test data. This is called “one-count” smoothing.10)
First let us define our backoff estimates.
Do you see why it's okay to back off to this totally unsmoothed, maximum likelihood estimate?11) (See below for why the denominator is n rather than n+1, even though there are n+1 tokens t0, t1, …,tn.)
• Next, let:
This backoff estimate uses add-one-smoothing. n and V denote the number of word tokens and types, respectively, that were observed in training data. (In addition, V includes an OOV type. Again, see below for why the token count is taken to be n even though there are n+1 tokens t0, t1, …,tn.) Notice that according to this formula, any novel word has count 0 and backoff probability . In effect, we are treating all novel words as if they had been replaced in the input by a single special word OOV. That way we can pretend that the vocabulary is limited to exactly V types, one of which is the unobserved OOV.
Now for the smoothed estimates. Define a function sing that counts singletons. Let:
= number of tag types t such that
= number of word types w such that
There is an easy way to accumulate these singleton counts during training. Whenever you increment c(t,w) or c(t,t), check whether it is now 1 or 2. If it is now 1, you have just found a new singleton and you should increment the appropriate singleton count. If it is now 2, you have just lost a singleton and you should decrement the appropriate singleton count.
• Notice that singtw(·|
Note that λ will be higher for singtw(·|
If one doesn't pay respect to the difference between open and closed classes, then novel words will often get tagged as
• If λ=0 because there are no singletons, some probabilities can still
work out to 0/0. A trick to avoid this is to add a very small number,
Note: The ic dataset happens to have no singletons at all, so you will always back off to the unsmoothed estimate on this dataset. Therefore, your results on the ic dataset should not change from problem 1.
• If you want to be careful and obtain precisely the sample results
provided in this assignment, you should ignore the training file's very
first or very last
Why count this way? Because doing so makes the smoothed (or unsmoothed) probabilities sum to 1 as required.13)
When reestimating counts using raw data in problem 3, you should similarly ignore the initial or final
Turn in the source code for your smoothing version of
1) But another factor helps your accuracy measurement: you will also use a smaller-than-usual set of tags. The motivation is speed, but it has the side effect that your tagger won't have to make fine distinctions.
2) Apart from the fact that these numbers are from different machines at different times with different languages, the difference in times between the two Jason's is probably due mostly to the fact that Baldridge's precomputes transition and emission probabilities before tagging, whereas Eisner's computes them on demand That means probably recomputing many smoothed probabilities – but it is conceptually the most straightforward solution, so you should probably start with that.
3) No one in NLP tries to take credit for tagging
4) The wi,ti notation was discussed above and refers here to test data. Why are we computing the perplexity with exp and log base e instead of base 2? It doesn't matter, as the two bases cancel each other out: , so this really is perplexity as we've defined it. Why is the corpus probability in the formula conditioned on w0,t0? Because you knew in advance that the tagged test corpus would start with
5) A uniform probability distribution over the 7 possible tagged words (
6) The Viterbi version of the spreadsheet is almost identical to the forward-backward version. However, it substitutes “max” for ”+”, so instead of computing the forward probability α, it computes the Viterbi approximation μ.
7) To check your work, you only have to look at iteration 0, at the left of the spreadsheet. But for your interest, the spreadsheet does do reestimation. It is just like the forward-backward spreadsheet, but uses the Viterbi approximation. Interestingly, this approximation prevents it from really learning the pattern in the ice cream data, especially when you start it off with bad parameters. Instead of making gradual adjustments that converge to a good model, it jumps right to a model based on the Viterbi tag sequence. This sequence tends never to change again, so we have convergence to a mediocre model after one iteration. This is not surprising. The forward-backward algorithm is biased toward interpreting the world in terms of its stereotypes and then uses those interpretations to update its stereotypes. But the Viterbi approximation turns it into a blinkered fanatic that is absolutely positive that its stereotypes are correct, and therefore can't learn much from experience.
8) You won't be able to check your backpointers directly. Backpointers would be clumsy to implement in Excel, so to find the best path, the Viterbi spreadsheet instead uses μ and probabilities, which are the Viterbi approximations to the forward and backward probabilities α and β. This trick makes it resemble the original spreadsheet more. But backpointers are conceptually simpler, and in a conventional programming language they are both faster and easier for you to implement.
9) On the ic dataset, you were able to get away without smoothing because you didn't have sparse data. You had actually observed all possible “words” and “tags” in ictrain .
10) Many smoothing methods use the probability of singletons to estimate the probability of novel words, as in Good-Turing smoothing. The “one-count” method is due to Chen and Goodman, who actually give it in a more general form where λ is a linear function of the number of singletons. This allows some smoothing to occur (λ>0) even if there are no singletons (sing=0). Chen and Goodman recommend using held-out data to choose the slope and intercept of the linear function.
11) It's because tags are not observed in the test data, so we can safely treat novel tag unigrams as impossible (probability 0). This just means that we will never guess a tag that we didn't see in training data—which is reasonable. By contrast, it would not be safe to assign 0 probability to novel words, because words are actually observed in the test data: if any novel words showed up there, we'd end up computing p(t,w)=0 probability for every tagging t of the test corpus w. So we will have to smooth ptw-backoff(wi|ti) below; it is only ptw-backoff(ti|ti-1) that can safely rule out novel events.
12) If you want to be anal-retentive, the ptw routine should specially force λ=0 for the tag
13) The root of the problem is that there are n+1 tagged words but only n tag-tag pairs. Omitting one of the boundaries arranges that , , and all equal n, just as . To see how this works out in practice, consider the unsmoothed estimate ptt(ti|ti-1) = c(ti-1, ti)/c(ti-1): i can only range from 1 to n in the numerator (i≠ 0), so for consistency it should only be allowed to range from 1 to n in the denominator c(ti-1) and in the numerator of the backoff estimate /c(ti)/n. Hence the stored count c(t) should only count n of the n+1 tags, omitting one of the boundary tags