## Part-of-speech tagging with hidden Markov models- Author: Jason Eisner, Johns Hopkins University
- Adapted by: Jason Baldridge, UT Austin
- See the the original homework (this assignment is the supervised training portion of that homework, problems 1 and 2)
In this assignment, you will build a Hidden Markov Model and use it to tag words with their parts of speech. This assignment description is long, but contains lots of specific directions and hints so that you won't get stuck. Don't be intimidated—your program will be a lot shorter than this description! Just read carefully.
In this assignment, you will do
For speed and simplicity, you will use relatively small datasets, and a
bigram model instead of a trigram model. You will also ignore the
spelling of words (useful for tagging unknown words). All these
simplifications hurt accuracy. ## Preliminaries
**ic**: Ice cream cone sequences with 1-character tags (`C` ,`H` ). Start with this easy dataset.**en**: English word sequences with 1-character tags (documented in Figure 1).**cz**: Czech word sequences with 2-character tags. If you want to see the accented characters more-or-less correctly, look at the files in Emacs.)
You only need to hand in results on the
Each dataset consists of three files: **train**: tagged data for supervised training (**en**provides 4,000–100,000 words)**test**: tagged data for testing (25,000 words for**en**); your tagger should ignore the tags in this file except when measuring the accuracy of its tagging**raw**: untagged data for reestimating parameters (100,000 words for**en**)
The file format is quite simple. Each line has a single word/tag pair separated by the
- Whichever string is being discussed (whether it is from
**train**,**test**, or**raw**) consists of*n+1*words, . - The corresponding tags are
*t*_{0},*t*_{1}…*t*_{n}. We have*w*_{i}*/t*_{i}=### for*i=0*for*i=n*and probably also for some other values of*i*. - “tt” names
*t*ag-to-*t*ag*transition*probabilities, as in*p*_{tt}(*t*_{i}|*t*_{i-1}) - “tw” names
*t*ag-to-*w*ord*emission*probabilities, as in*p*_{tw}(*w*_{i}|*t*_{i})
For this assigment, use the Viterbi version of Jason Eisner's spreadsheet. Excel 2000 and Excel 97 display these spreadsheets correctly. The experiments we did in class are described at http://www.cs.jhu.edu/~jason/papers/eisner.tnlp02.pdf. ## Suggested Work PlanHere is a suggested work plan to help you complete this assignment more quickly and easily: - Overview
- Read the overview material above.
- Briefly look at the data files.
- Play with the spreadsheets from class, and study Figure 2. Repeat until you understand the algorithms.
- Problem 1
- Read problem 1 carefully.
- Implement the unsmoothed Viterbi tagger
`vtag` for problem 1. Follow Figure 2 and the implementation suggestions. - Run
`vtag` on the**ic**(ice cream) dataset. Check your that your tagging accuracy and perplexity match the numbers provided. Check your tag sequence against the Viterbi spreadsheet as described.
- Problem 2
- Read problem 2's smoothing method carefully.
- Improve
`vtag` to do smoothing. - Run
`vtag` with smoothing on the**en**(English) dataset. Create the README requested at the end of problem 2.
## Problem 1 [40 points]Write a bigram Viterbi tagger that can be run as follows: $ icvtag ictrain ictest You may want to review the slides on Hidden Markov Model tagging, and perhaps the textbook exposition in Jurafsky & Martin. For now, you should use naive unsmoothed estimates (i.e., maximum-likelihood estimates).
Your program must print two lines summarizing its performance on the Tagging accuracy: 92.48% (known: 95.99% novel: 56.07%)
You are also free to print out whatever other information is useful to
you, including the tags your program picks, its accuracy as it goes
along, various probabilities, etc. A common trick, to give yourself
something to stare at, is to print a period to standard error (
In the required output illustrated above, each accuracy number considers some subset of the - The overall accuracy (e.g., 92.48%) considers all word tokens, other than the sentence boundary markers
`###` .^{3)} - The known-word accuracy (e.g., 95.99%) considers only tokens of words (other than
`###` ) that also appeared in**train**. - The novel-word accuracy (e.g., 56.07%) considers only tokens of words that did
*not*also appear in**train**. (These are very hard to tag, since context is the only clue to the correct tag. But they constitute about 9% of all tokens in`entest` , so it is important to tag them as accurately as possible.)
The perplexity per tagged test word (e.g., 1577.499) is defined as
where Here are some suggestions that will make your life easier (read carefully!). • Make sure you really understand the algorithm before you start coding! Write pseudocode before you write the details. Work out an example on paper if that helps. Play with the spreadsheet. Review the reading or the slides. Read this handout more than once, and ask questions. Coding should be a few straightforward hours of work if you really understand everything and can avoid careless bugs. Your program should go through the following steps: - Read the
**train**data and store the counts in global tables. (Your functions for computing probabilities on demand, such as*p*_{tw}, should access these tables. In problem 2, you will modify those functions to do smoothing.) - Read the
**test**data**w**into memory. - Follow the Viterbi algorithm pseudocode in Figure 2 to find the tag sequence
**t**that maximizes*p*(**t**,**w**). - Compute and print the accuracy and perplexity of the tagging. (You can compute the accuracy at the same time as you extract the tag sequence while following backpointers.)
• Don't bother to train on each sentence separately, or to tag each sentence separately. Just treat the Tagging sentences separately would save you memory, since then you could throw away each sentence (and its tag probabilities and backpointers) when you were done with it. But why bother if you seem to have enough memory? Just pretend it's one long sentence.
• Figure 2 refers to a “tag dictionary” that stores all the possible tags for each word. As long as you only use the
• Before you start coding, make a list of the data structures you will
need to maintain, and choose names for those data structures as well as
their access methods. For example, you will have to look up certain
values of
• You will need some multidimensional tables, indexed by strings and/or
integers, to store the training counts and the path probabilities.
(E.g., - a hash table indexed by a single string that happens to have two parts, such as
`D/N` or`5/D` . This works well, and is especially memory-efficient since no space is wasted on nonexistent entries. - a hash table of arrays. This wastes a little more space.
- an ordinary multidimensional array (or array of arrays). This means you have to convert strings (words or tags) to integers and use those integers as array indices. But this conversion is a simple matter of lookup in a hash table. (High-speed NLP packages do all their internal processing using integers, converting to and from strings only during I/O.)
*Warning*: You should**avoid**an array of hash tables or a hash table of hash tables. It is slow and wasteful of memory to have many small hash tables. Better to combine them into one big hash table as described in the first bullet point above.
• Small probabilities should be stored in memory as log-probabilities. This is actually crucial to prevent underflow. - This assignment will talk in terms of probabilities, but when you see something like
*p := p·q*you should implement it as something like*lp := lp*+ log(*q*), where*lp*is a variable storing log(*p*). **Tricky bit:**If*p*is 0, what should you store in*lp*? How can you represent that value in your program? (You are welcome to use any trick or hack that works.)*Suggestion*: Use natural logarithms (log_{e}) because they are simpler, slightly faster, and less error-prone than log_{2}. (Although it is conventional to*report*log-probability using log_{2}, you can use whatever representation you like internally, and convert it later with the formula . Anyway, you are not required to report any log-probabilities for this assignment. See footnote 4 on calculating perplexity from a natural logarithm.)
Check your work as follows.
- μ probabilities for each day
- weather tag for each day (shown on the graph)
^{8)} - perplexity per tagged word: see upper right corner of spreadsheet There is actually a point where a tie arises – if the tie is broken in one direction you get the above results and if it is broken the other way, you end up getting one more tag correct (for an accuracy of 90.91%).
You will hand in the program icvtag for this problem. ## Problem 2 [60 points]Now, you will improve your tagger so that you can run it on real data. For this, you will copy icvtag from the previous problem to vtag and modify it, such that you will be able to train and test on English POS tagging data, like this:
This means using a proper tag dictionary (for speed) and smoothed probabilities (for accuracy). Tagging accuracy: 92.48% (known: 95.99% novel: 56.07%)
This baseline result came from a stupid unigram tagger (which just
tagged every known word with its most common part of speech from
training data, ignoring context, and tagged all novel words with You are required to use a “tag dictionary” — otherwise your tagger will be much too slow. Each word has a list of allowed tags, and you should consider only those tags. That is, don't consider tag sequences that are incompatible with the dictionary, even if they have positive smoothed probability. See the pseudocode in Figure 2.
Derive your tag dictionary from the training data. For a known word,
allow only the tags that it appeared with in the training set. For an
unknown word, allow all tags except
How about smoothing? Just to get the program working on the
However, the version of the program that you submit should use the
following type of smoothing. It is basically just add-λ smoothing with
backoff, but λ is set higher in contexts with a
lot of “singletons”—words that have only occurred once—because such
contexts are likely to have novel words in test data. This is called
“one-count” smoothing. First let us define our backoff estimates. • Let:
Do you see why it's okay to back off to this totally unsmoothed, maximum likelihood estimate? • Next, let:
This backoff estimate uses add-one-smoothing.
Now for the smoothed estimates. Define a function
= number of tag types
= number of word types
There is an easy way to accumulate these singleton counts during training. Whenever you increment
• Notice that • Let
Note that λ will be higher for
If one doesn't pay respect to the difference between open and closed classes, then novel words will often get tagged as
• If λ=0 because there are no singletons, some probabilities can still
work out to 0/0. A trick to avoid this is to add a very small number,
like
• If you want to be careful and obtain precisely the sample results
provided in this assignment, you should ignore the training file's very
first or very last
Why count this way? Because doing so makes the smoothed (or unsmoothed) probabilities sum to 1 as required.
When reestimating counts using
Turn in the source code for your smoothing version of FOOTNOTES ^{1)}
But another factor helps your accuracy measurement: you will also use a
smaller-than-usual set of tags. The motivation is speed, but it has the
side effect that your tagger won't have to make fine distinctions.^{2)}
Apart from the fact that these numbers are from different machines at
different times with different languages, the difference in times
between the two Jason's is probably due mostly to the fact that
Baldridge's precomputes transition and emission probabilities before
tagging, whereas Eisner's computes them on demand That means probably
recomputing many smoothed probabilities – but it is conceptually the
most straightforward solution, so you should probably start with that.^{3)}
No one in NLP tries to take credit for tagging `###` correctly with `###` !^{4)}
The w_{i},t_{i} notation was discussed above and refers here to test data. Why are we computing the perplexity with exp and log base e instead of base 2? It doesn't matter, as the two bases cancel each other out: , so this really is perplexity as we've defined it. Why is the corpus probability in the formula conditioned on w_{0},t_{0}? Because you knew in advance that the tagged test corpus would start with `###` —your model is only predicting the rest of that corpus. (The model has no parameter that would even tell you p(w_{0},t_{0}). Instead, Figure 2, line 2, explicitly hard-codes your prior knowledge that t_{0}=`###` .^{5)}
A uniform probability distribution over the 7 possible tagged words (`###` , `1/C` , `1/H` , `2/C` , `2/H` , `3/C` , `3/H` ) would give a perplexity of 7, so 3.620 is an improvement.^{6)}
The Viterbi version of the spreadsheet is almost identical to the
forward-backward version. However, it substitutes “max” for ”+”, so
instead of computing the forward probability α, it computes the Viterbi
approximation μ.^{7)}
To check your work, you only have to look at iteration 0, at the left of
the spreadsheet. But for your interest, the spreadsheet does do
reestimation. It is just like the forward-backward spreadsheet, but
uses the Viterbi approximation. Interestingly, this approximation prevents
it from really learning the pattern in the ice cream data, especially
when you start it off with bad parameters. Instead of making gradual
adjustments that converge to a good model, it jumps right to a model
based on the Viterbi tag sequence. This sequence tends never to change
again, so we have convergence to a mediocre model after one iteration.
This is not surprising. The forward-backward algorithm is biased toward
interpreting the world in terms of its stereotypes and then uses those
interpretations to update its stereotypes. But the Viterbi
approximation turns it into a blinkered fanatic that is absolutely
positive that its stereotypes are correct, and therefore can't learn
much from experience.^{8)}
You won't be able to check your backpointers directly. Backpointers
would be clumsy to implement in Excel, so to find the best path, the
Viterbi spreadsheet instead uses μ and
probabilities, which are the Viterbi approximations to the
forward and backward probabilities α and β. This trick makes it resemble
the original spreadsheet more. But backpointers are conceptually
simpler, and in a conventional programming language they are both faster
and easier for you to implement.^{9)}
On the ic dataset, you were able to get away without
smoothing because you didn't have sparse data. You had actually
observed all possible “words” and “tags” in ictrain .^{10)}
Many smoothing methods use the probability of singletons to estimate the
probability of novel words, as in Good-Turing smoothing. The
“one-count” method is due to Chen and Goodman, who actually give it in a
more general form where λ is a linear function of the number of
singletons. This allows some smoothing to occur (λ>0) even if there
are no singletons (sing=0). Chen and Goodman recommend using held-out data to choose the slope and intercept of the linear function.^{11)}
It's because tags are not observed in the test data, so we can safely
treat novel tag unigrams as impossible (probability 0). This just means
that we will never guess a tag that we didn't see in training
data—which is reasonable. By contrast, it would not be safe to assign 0
probability to novel words, because words are actually observed in the test data: if any novel words showed up there, we'd end up computing p(t,w)=0 probability for every tagging t of the test corpus w. So we will have to smooth p_{tw-backoff}(w_{i}|t_{i}) below; it is only p_{tw-backoff}(t_{i}|t_{i-1}) that can safely rule out novel events.^{12)}
If you want to be anal-retentive, the p_{tw} routine should specially force λ=0 for the tag `###` in order to prevent any smoothing of p_{tw}(·|`###` ): it is a fact of nature that p_{tw}(`###` |`###` ) = 1 exactly.^{13)}
The root of the problem is that there are n+1 tagged words but only n tag-tag pairs. Omitting one of the boundaries arranges that , , and all equal n, just as . To see how this works out in practice, consider the unsmoothed estimate p_{tt}(t_{i}|t_{i-1}) = c(t_{i-1}, t_{i})/c(t_{i-1}): i can only range from 1 to n in the numerator (i≠ 0), so for consistency it should only be allowed to range from 1 to n in the denominator c(t_{i-1}) and in the numerator of the backoff estimate /c(t_{i})/n. Hence the stored count c(t) should only count n of the n+1 tags, omitting one of the boundary tags `###` (it doesn't matter which boundary since they are identical). Now notice that c(t) is also used in the denominator of p_{tw}(w_{i}|t_{i}); since we have just decided that this count will only consider i from 1 to n in the denominator, we have to count the numerator c(t_{i},w_{i}) and the numerator of the backoff estimate /c(w_{i})/n in the same way, for consistency. |

Assignments >