Assignments‎ > ‎

HW4: HMM Tagging

Part-of-speech tagging with hidden Markov models

In this assignment, you will build a Hidden Markov Model and use it to tag words with their parts of speech. This assignment description is long, but contains lots of specific directions and hints so that you won't get stuck. Don't be intimidated—your program will be a lot shorter than this description! Just read carefully.

In this assignment, you will do supervised learning, estimating the parameters p(tag| previous tag) and p(word|tag) from a training set of already-tagged text. Some smoothing is necessary. You will then evaluate the learned model by finding the Viterbi tagging (i.e., best tag sequence) for some test data and measuring how many tags were correct.

For speed and simplicity, you will use relatively small datasets, and a bigram model instead of a trigram model. You will also ignore the spelling of words (useful for tagging unknown words). All these simplifications hurt accuracy.1) So overall, your percentage of correct tags will be in the low 90's instead of the high 90's that are reported for state-of-the-art POS taggers.


Programming language: You must do this assignment using Python. Jason Eisner's final program (for both this assignment and the next) was about 250 lines in Perl, and it handled problem 2 in about 15 seconds and 22M memory, and problem 1 of the next assignment (four iterations) in about 17.5 minutes and 131M memory. (A simple pruning step cuts the latter time to 6 minutes with virtually no effect on results, but you are not required to implement this.) Jason Baldridge's final program was about 300 lines in Python, and it handled problem 2 in about 4 seconds and 15M memory, and problem 1 of the next assignment (four iterations) in about 7.5 minutes and 100M memory on the UT Computational Linguistics lab computers.2) Note implementations done in compiled languages like Java would probably run far faster than either of these.

How to hand in your work: The procedure will be similar to previous assignments. You must test that your programs run on the lab machines with no problems before submitting them. You may prefer to develop them on the lab machines in the first place, since there are copies of the data already there.

Data: There are three datasets, contained in the archive ALL_DATA_TAGGING.tgz. They are as follows:

  • ic : Ice cream cone sequences with 1-character tags (C , H ). Start with this easy dataset.
  • en : English word sequences with 1-character tags (documented in Figure 1).
  • cz : Czech word sequences with 2-character tags. If you want to see the accented characters more-or-less correctly, look at the files in Emacs.)

You only need to hand in results on the en dataset. The others are just for your convenience in testing your code, and for the extra credit problem.

FIGURE 1: English part-of-speech tagset.: The tags in the en dataset. These come from longer, more specific tag names stripped down to their first letters. For example, all kinds of nouns (formerly NN , NNS , NNP , NNPS ) are simply tagged as N in this assignment. Using only the first letters reduces the number of tags, speeding things up. (However, it results in a couple of unnatural categories, C and P.)

Tag Description
C Coordinating conjunction or Cardinal number
D Determiner
E Existential there
F Foreign word
I Preposition or subordinating conjunction
J Adjective
L List item marker (a., b., c., …) (rare)
M Modal (could, would, must, can, might …)
N Noun
P Pronoun or Possessive ending ('s) or Predeterminer
R Adverb or Particle
S Symbol, mathematical (rare)
T The word to
U Interjection (rare)
V Verb
W wh-word (question word)
### Boundary between sentences
, Comma
. Period
: Colon, semicolon, or dash
- Parenthesis
' Quotation mark
$ Currency symbol

Each dataset consists of three files:

  • train : tagged data for supervised training (en provides 4,000–100,000 words)
  • test : tagged data for testing (25,000 words for en ); your tagger should ignore the tags in this file except when measuring the accuracy of its tagging
  • raw : untagged data for reestimating parameters (100,000 words for en)

The file format is quite simple. Each line has a single word/tag pair separated by the / character. (In the raw file, only the word appears.) Punctuation marks count as words. The special word ### is used for sentence boundaries, and is always tagged with ###.

Notation: The discussion and figures use the following notation. You might want to use the same notation in your program.

  • Whichever string is being discussed (whether it is from train , test , or raw ) consists of n+1 words, Graph.
  • The corresponding tags are t0, t1tn. We have wi/ti=### for i=0 for i=n and probably also for some other values of i.
  • “tt” names tag-to-tag transition probabilities, as in ptt(ti|ti-1)
  • “tw” names tag-to-word emission probabilities, as in ptw(wi|ti)

Spreadsheets: You are strongly encouraged to test your code using the artificial ic dataset. This dataset is small and should run fast. More important, it is designed so you can check your work: when you run the forward-backward algorithm, the initial parameters, intermediate results, and perplexities should all agree exactly with the results on the spreadsheet we used in class.

For this assigment, use the Viterbi version of Jason Eisner's spreadsheet. Excel 2000 and Excel 97 display these spreadsheets correctly. The experiments we did in class are described at

Suggested Work Plan

Here is a suggested work plan to help you complete this assignment more quickly and easily:

  1. Overview
    1. Read the overview material above.
    2. Briefly look at the data files.
    3. Play with the spreadsheets from class, and study Figure 2. Repeat until you understand the algorithms.
  2. Problem 1
    1. Read problem 1 carefully.
    2. Implement the unsmoothed Viterbi tagger vtag for problem 1. Follow Figure 2 and the implementation suggestions.
    3. Run vtag on the ic (ice cream) dataset. Check your that your tagging accuracy and perplexity match the numbers provided. Check your tag sequence against the Viterbi spreadsheet as described.
  3. Problem 2
    1. Read problem 2's smoothing method carefully.
    2. Improve vtag to do smoothing.
    3. Run vtag with smoothing on the en (English) dataset. Create the README requested at the end of problem 2.

Problem 1 [40 points]

Write a bigram Viterbi tagger that can be run as follows:

$ icvtag ictrain ictest

You may want to review the slides on Hidden Markov Model tagging, and perhaps the textbook exposition in Jurafsky & Martin.

For now, you should use naive unsmoothed estimates (i.e., maximum-likelihood estimates).

Your program must print two lines summarizing its performance on the test set, in the following format:

  Tagging accuracy: 92.48%  (known: 95.99% novel: 56.07%)  
Perplexity per tagged test word: 1577.499

You are also free to print out whatever other information is useful to you, including the tags your program picks, its accuracy as it goes along, various probabilities, etc. A common trick, to give yourself something to stare at, is to print a period to standard error (stderr or cerr ) every 100 words.

In the required output illustrated above, each accuracy number considers some subset of the test tokens and asks what percentage of them received the correct tag:

  • The overall accuracy (e.g., 92.48%) considers all word tokens, other than the sentence boundary markers ###.3)
  • The known-word accuracy (e.g., 95.99%) considers only tokens of words (other than ###) that also appeared in train .
  • The novel-word accuracy (e.g., 56.07%) considers only tokens of words that did not also appear in train . (These are very hard to tag, since context is the only clue to the correct tag. But they constitute about 9% of all tokens in entest, so it is important to tag them as accurately as possible.)

The perplexity per tagged test word (e.g., 1577.499) is defined as4)


where t0, t1, t2tn is the winning tag sequence that your tagger assigns to test data (with t0=tn=w0=wn=###). The perplexity is high because it considers the model's uncertainty about predicting both the word and its tag. (We are not computing perplexity per word, or per tag, but rather per tagged word.)

Here are some suggestions that will make your life easier (read carefully!).

• Make sure you really understand the algorithm before you start coding! Write pseudocode before you write the details. Work out an example on paper if that helps. Play with the spreadsheet. Review the reading or the slides. Read this handout more than once, and ask questions. Coding should be a few straightforward hours of work if you really understand everything and can avoid careless bugs.

Your program should go through the following steps:

  1. Read the train data and store the counts in global tables. (Your functions for computing probabilities on demand, such as ptw, should access these tables. In problem 2, you will modify those functions to do smoothing.)
  2. Read the test data w into memory.
  3. Follow the Viterbi algorithm pseudocode in Figure 2 to find the tag sequence t that maximizes p(t,w).
  4. Compute and print the accuracy and perplexity of the tagging. (You can compute the accuracy at the same time as you extract the tag sequence while following backpointers.)

• Don't bother to train on each sentence separately, or to tag each sentence separately. Just treat the train file as one long string that happens to contain some ### words. Similarly for the test file.

Tagging sentences separately would save you memory, since then you could throw away each sentence (and its tag probabilities and backpointers) when you were done with it. But why bother if you seem to have enough memory? Just pretend it's one long sentence.

FIGURE 2: The Viterbi algorithm: Pseudocode for the Viterbi tagging algorithm. μt(i) is the probability of the best path from the start state (### at time 0) to state t at time i. In other words, it maximizes p(t1,w1,t2,w2,…,ti,wi|t0,w0) over all possible choices of t1,…,ti such that ti=t.

Viterbi pseudocode

• Figure 2 refers to a “tag dictionary” that stores all the possible tags for each word. As long as you only use the ic dataset, the tag dictionary is so simple that you can specify it directly in the code: tag_dict(###) = {###}, and tag_dict(w)={C ,H} for any other word w. In the next problem, you'll generalize this to derive the tag dictionary from training data.

• Before you start coding, make a list of the data structures you will need to maintain, and choose names for those data structures as well as their access methods. For example, you will have to look up certain values of c(…). So write down, for example, that you will store the count c(ti-1, ti) in a table count_tt whose elements have names like count_tt(”D”,”N”). When you read the training data you will increment these elements.

• You will need some multidimensional tables, indexed by strings and/or integers, to store the training counts and the path probabilities. (E.g., count_tt(”D”,”N”) above, and μD(5) in Figure 2.) There are various easy ways to implement these:

  • a hash table indexed by a single string that happens to have two parts, such as D/N or 5/D. This works well, and is especially memory-efficient since no space is wasted on nonexistent entries.
  • a hash table of arrays. This wastes a little more space.
  • an ordinary multidimensional array (or array of arrays). This means you have to convert strings (words or tags) to integers and use those integers as array indices. But this conversion is a simple matter of lookup in a hash table. (High-speed NLP packages do all their internal processing using integers, converting to and from strings only during I/O.)
  • Warning: You should avoid an array of hash tables or a hash table of hash tables. It is slow and wasteful of memory to have many small hash tables. Better to combine them into one big hash table as described in the first bullet point above.

• Small probabilities should be stored in memory as log-probabilities. This is actually crucial to prevent underflow.

  • This assignment will talk in terms of probabilities, but when you see something like p := p·q you should implement it as something like lp := lp + log(q), where lp is a variable storing log(p).
  • Tricky bit: If p is 0, what should you store in lp? How can you represent that value in your program? (You are welcome to use any trick or hack that works.)
  • Suggestion: Use natural logarithms (loge) because they are simpler, slightly faster, and less error-prone than log2. (Although it is conventional to report log-probability using log2, you can use whatever representation you like internally, and convert it later with the formula Graph. Anyway, you are not required to report any log-probabilities for this assignment. See footnote 4 on calculating perplexity from a natural logarithm.)

Check your work as follows. icvtag ictrain ictest should yield a tagging accuracy of 87.88%, with no novel words and a perplexity per tagged word of 3.620.5) You can use the Viterbi version of the spreadsheet to check your μ probabilities and your tagging:6)

ictrain has been designed so that your initial supervised training on it will yield the initial parameters from the spreadsheet (transition and emission probabilities).

ictest has exactly the data from the spreadsheet. Running your Viterbi tagger on these data should produce the same values as the spreadsheet's iteration 0:7)

  • μ probabilities for each day
  • weather tag for each day (shown on the graph)8)
  • perplexity per tagged word: see upper right corner of spreadsheet There is actually a point where a tie arises – if the tie is broken in one direction you get the above results and if it is broken the other way, you end up getting one more tag correct (for an accuracy of 90.91%).

You will hand in the program icvtag for this problem.

Problem 2 [60 points]

Now, you will improve your tagger so that you can run it on real data. For this, you will copy icvtag from the previous problem to vtag and modify it, such that you will be able to train and test on English POS tagging data, like this:

$ vtag entrain entest

This means using a proper tag dictionary (for speed) and smoothed probabilities (for accuracy).9) Your tagger should beat the following “baseline” result:

  Tagging accuracy: 92.48%  (known: 95.99% novel: 56.07%)  
Perplexity per tagged test word: 1577.499

This baseline result came from a stupid unigram tagger (which just tagged every known word with its most common part of speech from training data, ignoring context, and tagged all novel words with N ). This baseline tagger does pretty well because most words are easy to tag. To justify using a bigram tagger, you must show it can do better!

You are required to use a “tag dictionary” — otherwise your tagger will be much too slow. Each word has a list of allowed tags, and you should consider only those tags. That is, don't consider tag sequences that are incompatible with the dictionary, even if they have positive smoothed probability. See the pseudocode in Figure 2.

Derive your tag dictionary from the training data. For a known word, allow only the tags that it appeared with in the training set. For an unknown word, allow all tags except ###. (Hint: During training, before you add an observed tag t to tag_dict(w) (and before incrementing c(t,w), check whether c(t,w) > 0 already. This lets you avoid adding duplicates.)

How about smoothing? Just to get the program working on the en dataset, you could use some very simple form of smoothing at first. For example, add-one smoothing on ptw will take care of the novel words in entest, and you could get away with no smoothing at all on ptt.

However, the version of the program that you submit should use the following type of smoothing. It is basically just add-λ smoothing with backoff, but λ is set higher in contexts with a lot of “singletons”—words that have only occurred once—because such contexts are likely to have novel words in test data. This is called “one-count” smoothing.10)

First let us define our backoff estimates.

• Let:


Do you see why it's okay to back off to this totally unsmoothed, maximum likelihood estimate?11) (See below for why the denominator is n rather than n+1, even though there are n+1 tokens t0, t1, …,tn.)

• Next, let:


This backoff estimate uses add-one-smoothing. n and V denote the number of word tokens and types, respectively, that were observed in training data. (In addition, V includes an OOV type. Again, see below for why the token count is taken to be n even though there are n+1 tokens t0, t1, …,tn.) Notice that according to this formula, any novel word has count 0 and backoff probability Graph. In effect, we are treating all novel words as if they had been replaced in the input by a single special word OOV. That way we can pretend that the vocabulary is limited to exactly V types, one of which is the unobserved OOV.

Now for the smoothed estimates. Define a function sing that counts singletons. Let:

Graph = number of tag types t such that Graph

Graph = number of word types w such that Graph

There is an easy way to accumulate these singleton counts during training. Whenever you increment c(t,w) or c(t,t), check whether it is now 1 or 2. If it is now 1, you have just found a new singleton and you should increment the appropriate singleton count. If it is now 2, you have just lost a singleton and you should decrement the appropriate singleton count.

• Notice that singtw(·|N) will be high because many nouns only appeared once. This suggests that the class of nouns is open to accepting new members and it is reasonable to tag new words with N too. By contrast, singtw(·|D) will be 0 or very small because the class of determiners is pretty much closed—suggesting that novel words should not be tagged with D. We will now take advantage of these suggestions.

• Let



Note that λ will be higher for singtw(·|N) than for singtw(·|D). Hence ptw(·|N) allows more backoff, other things equal, and so assigns a higher probability to novel words.

If one doesn't pay respect to the difference between open and closed classes, then novel words will often get tagged as D (for example) in order to make neighboring words happy. Such a tagger does worse than the baseline tagger (which simply tags all novel words with the most common singleton tag, N )!

• If λ=0 because there are no singletons, some probabilities can still work out to 0/0. A trick to avoid this is to add a very small number, like 1e-100, to λ before using it. Then in the 0/0 case you will get the backoff probability.12)

Note: The ic dataset happens to have no singletons at all, so you will always back off to the unsmoothed estimate on this dataset. Therefore, your results on the ic dataset should not change from problem 1.

• If you want to be careful and obtain precisely the sample results provided in this assignment, you should ignore the training file's very first or very last ### when you accumulate counts during training. So there are only n word/tag tokens, not n+1. This counting procedure also slightly affects c(t), c(w), and c(t,w).

Why count this way? Because doing so makes the smoothed (or unsmoothed) probabilities sum to 1 as required.13)

When reestimating counts using raw data in problem 3, you should similarly ignore the initial or final ### in the raw data.

Turn in the source code for your smoothing version of vtag. In README give your observations and results, including the output from running vtag entrain entest (or at least the required lines from that output). How much did your tagger improve on the accuracy and perplexity of the baseline tagger?

1) But another factor helps your accuracy measurement: you will also use a smaller-than-usual set of tags. The motivation is speed, but it has the side effect that your tagger won't have to make fine distinctions.
2) Apart from the fact that these numbers are from different machines at different times with different languages, the difference in times between the two Jason's is probably due mostly to the fact that Baldridge's precomputes transition and emission probabilities before tagging, whereas Eisner's computes them on demand That means probably recomputing many smoothed probabilities – but it is conceptually the most straightforward solution, so you should probably start with that.
3) No one in NLP tries to take credit for tagging ### correctly with ###!
4) The wi,ti notation was discussed above and refers here to test data. Why are we computing the perplexity with exp and log base e instead of base 2? It doesn't matter, as the two bases cancel each other out: Graph, so this really is perplexity as we've defined it. Why is the corpus probability in the formula conditioned on w0,t0? Because you knew in advance that the tagged test corpus would start with ###—your model is only predicting the rest of that corpus. (The model has no parameter that would even tell you p(w0,t0). Instead, Figure 2, line 2, explicitly hard-codes your prior knowledge that t0=###.
5) A uniform probability distribution over the 7 possible tagged words (###, 1/C , 1/H , 2/C , 2/H , 3/C , 3/H) would give a perplexity of 7, so 3.620 is an improvement.
6) The Viterbi version of the spreadsheet is almost identical to the forward-backward version. However, it substitutes “max” for ”+”, so instead of computing the forward probability α, it computes the Viterbi approximation μ.
7) To check your work, you only have to look at iteration 0, at the left of the spreadsheet. But for your interest, the spreadsheet does do reestimation. It is just like the forward-backward spreadsheet, but uses the Viterbi approximation. Interestingly, this approximation prevents it from really learning the pattern in the ice cream data, especially when you start it off with bad parameters. Instead of making gradual adjustments that converge to a good model, it jumps right to a model based on the Viterbi tag sequence. This sequence tends never to change again, so we have convergence to a mediocre model after one iteration. This is not surprising. The forward-backward algorithm is biased toward interpreting the world in terms of its stereotypes and then uses those interpretations to update its stereotypes. But the Viterbi approximation turns it into a blinkered fanatic that is absolutely positive that its stereotypes are correct, and therefore can't learn much from experience.
8) You won't be able to check your backpointers directly. Backpointers would be clumsy to implement in Excel, so to find the best path, the Viterbi spreadsheet instead uses μ and Graph probabilities, which are the Viterbi approximations to the forward and backward probabilities α and β. This trick makes it resemble the original spreadsheet more. But backpointers are conceptually simpler, and in a conventional programming language they are both faster and easier for you to implement.
9) On the ic dataset, you were able to get away without smoothing because you didn't have sparse data. You had actually observed all possible “words” and “tags” in ictrain .
10) Many smoothing methods use the probability of singletons to estimate the probability of novel words, as in Good-Turing smoothing. The “one-count” method is due to Chen and Goodman, who actually give it in a more general form where λ is a linear function of the number of singletons. This allows some smoothing to occur (λ>0) even if there are no singletons (sing=0). Chen and Goodman recommend using held-out data to choose the slope and intercept of the linear function.
11) It's because tags are not observed in the test data, so we can safely treat novel tag unigrams as impossible (probability 0). This just means that we will never guess a tag that we didn't see in training data—which is reasonable. By contrast, it would not be safe to assign 0 probability to novel words, because words are actually observed in the test data: if any novel words showed up there, we'd end up computing p(t,w)=0 probability for every tagging t of the test corpus w. So we will have to smooth ptw-backoff(wi|ti) below; it is only ptw-backoff(ti|ti-1) that can safely rule out novel events.
12) If you want to be anal-retentive, the ptw routine should specially force λ=0 for the tag ### in order to prevent any smoothing of ptw(·|###): it is a fact of nature that ptw(###|###) = 1 exactly.
13) The root of the problem is that there are n+1 tagged words but only n tag-tag pairs. Omitting one of the boundaries arranges that Graph, Graph, and Graph all equal n, just as Graph. To see how this works out in practice, consider the unsmoothed estimate ptt(ti|ti-1) = c(ti-1, ti)/c(ti-1): i can only range from 1 to n in the numerator (i≠ 0), so for consistency it should only be allowed to range from 1 to n in the denominator c(ti-1) and in the numerator of the backoff estimate /c(ti)/n. Hence the stored count c(t) should only count n of the n+1 tags, omitting one of the boundary tags ### (it doesn't matter which boundary since they are identical). Now notice that c(t) is also used in the denominator of ptw(wi|ti); since we have just decided that this count will only consider i from 1 to n in the denominator, we have to count the numerator c(ti,wi) and the numerator of the backoff estimate /c(wi)/n in the same way, for consistency.
Jason Baldridge,
Apr 3, 2011, 7:46 AM