This assignment is based on Jason Eisner's language modeling homework. Many thanks to Jason E for making this and other materials for teaching NLP available!
The first thing to do is to download the PDF of the homework (which is the same as for HW1) and the code and data tarball from Blackboard. Note that the PDF discusses directories on machines at Johns Hopkins University -- everything you need is in the tarball, and it is should be obvious how to find the files you need to do the problems. However, if you have any doubts or questions, don't hesitate to ask the instructor or the TA.
Then, do problems 6, 7, 8, 9, 10, and 13. Note: problem 10c is not extra credit, so you must do this to obtain all points for the problem. Also, for problem 13, note that the "extra credit" is not extra credit -- you must do the entire problem, including what is called extra credit, to obtain full points.
Point values for the problems are as follows:
In answering this, you should consider providing evidence based on the files themselves. Here are some handy Unix commands that will help you out here.
Convert all spaces to newlines so that you can get one word per line, and then get a uniq sort of them to get just the types:
$ tr -s ' ' '\n' < /groups/classes/nlp/ngrams/All_Training/switchboard-small | sort -u
Remember that to redirect output to a file, you can use
$ cat /groups/classes/nlp/ngrams/All_Training/switchboard-small > mycopy_of_switchboard_small
Compare the intersection of two files (here we assume that the files contain one word per line in order for this to be helpful):
$ comm -12 file1 file2
You might have a look at some basic tips on graph such data with gnuplot and xgraph. If you want more power (and don't mind learning a new, slightly weird programming language), check out R. Lots of information and examples available on the site of the analyzing linguistic data class Jason Baldridge and Katrin Erk taught in 2009.
Alternatively, you could plot your data in Open Office, Google spreadsheets, or Microsoft excel.