Assignments‎ > ‎

Sentiment analysis


For this homework, you will complete a sentiment analysis system for tweets by implementing features that will be used by supervised machine learners.

To complete the homework, you need to obtain the code and data, nlp-classify. I have made the code, and nearly all of the data, freely available so that others outside the class can use it for self-study, especially to learn about maxent classification with OpenNLP. The recommended way of getting the code and data is through the Mercurial repository:
$ hg clone
You'll need to install Mercurial if you don't have it already. If you haven't used version control systems before, I strongly encourage you to give this a try! If you prefer not to try out Mercurial, you can download the latest version of the code via the tip of the nlp-classify repository. You can see more information and instructions on the download page on Bitbucket for nlp-classify.

You'll also need to grab the Health Care Reform dataset from Blackboard: hcr-train-dev-v0.1.tgz. Copy that file to nlp-classify/data, and do:

$ tar xzf hcr-train-dev-v0.1.tgz

NOTE: you must not distribute the HCR twitter dataset to anyone outside the class -- I plan to make it available for download for others eventually, but cannot do so at this time. If you have any questions, just ask. 

Your submission will be:
  • a file called ANSWERS that is plain text and provides the answers to the written parts of this assignment. The problem descriptions clearly state where a written answer is expected.
  • a zip or tar.gz file containing the files ANSWERS,, and The name of the file should be in the format <lastname>_<firstname>, e.g. Submit this on Blackboard.
If you have any questions or problems with any of the materials, don't hesitate to ask!

Tip: Look over the entire homework before starting on it. Then read through each problem carefully, in its entirety, before answering questions and doing the implementation. Note that the homework is long, but a lot of that is explanation to help you solve the problems!

: Follow the instructions in the README in order to get your environment set up.

Tip: Check out Bo Pang and Lillian Lee's book: Opinion Mining and Sentiment Analysis (free online!)

Warning: there is a mix of software and resources in this homework, and you should not assume you can use them outside of the academic context. In other words, there are several things here that you cannot use in closed commercial applications. Licenses for each for resource are stated in the files themselves, or in the README. The core code is licensed under the GNU General Public License, which means you must follow the rules of that license: in sum, if you use it in any other application, then that application must also be released according to the GPL license.

Finally: if possible, don't print this homework out! Just read it online, which ensures you'll be looking at the latest version of the homework (in case there are any corrections), you can easily cut-and-paste and follow links, and you won't waste paper.

Problem 1 - Use OpenNLP Maxent on the tennis dataset [5 pts]

This problem will get you using the OpenNLP Maxent classifier on the tennis dataset from HW2. The native format of the tennis dataset is ready to use with OpenNLP Maxent.

All of the instructions assume you are running the commands from the top-level directory of classify (e.g. CLASSIFY_DIR if you followed the directions in the README). First, verify that things are working with the naive Bayes implementation provided in the download:

$ ./ -t data/tennis/train -p data/tennis/test -l 5.0 | ./ -g data/tennis/test
Accuracy: 76.92

Next try it out with maxent.

$ mkdir out
$ classify train data/tennis/train out/tennis_model.txt
Indexing events using cutoff of 1

    Computing event counts...  done. 14 events
    Indexing...  done.
Sorting and merging events... done. Reduced 14 events to 14.
Done indexing.
Incorporating indexed data for training... 
    Number of Event Tokens: 14
        Number of Outcomes: 2
      Number of Predicates: 10
Computing model parameters...
Performing 100 iterations.
  1:  .. loglikelihood=-9.704060527839234    0.35714285714285715
  2:  .. loglikelihood=-7.977369481882935    0.7142857142857143
  3:  .. loglikelihood=-7.274865786406583    0.8571428571428571
  4:  .. loglikelihood=-6.820070735855129    0.8571428571428571
  5:  .. loglikelihood=-6.494901774488233    0.8571428571428571
<more lines>
 75:  .. loglikelihood=-5.081164775456419    0.8571428571428571
 76:  .. loglikelihood=-5.081044303386861    0.8571428571428571
 77:  .. loglikelihood=-5.080931523277981    0.8571428571428571
 78:  .. loglikelihood=-5.080825909041938    0.8571428571428571
 79:  .. loglikelihood=-5.080726973756942    0.8571428571428571

This output shows the log likelihood (log prob) of the training dataset (data/tennis/train) and the accuracy on the training set after every iteration of the parameter setting algorithm. Note that even though it states that it is doing 100 iterations, it stops early if the change in log likelihood is below a threshold.

You can see the resulting model by inspecting out/tennis_model.txt, which contains information about the features and the parameters values that have been learned from the training data. Next, we'll use this model to classify test items.

Tip: You can suppress the output by directing it to /dev/null, e.g.: classify train data/tennis/train out/tennis_model.txt > /dev/null

Classify the items in the test set as follows:

$ classify apply out/tennis_model.txt data/tennis/test
No 0.8436671652082209 Yes 0.15633283479177895
Yes 0.9430013482443135 No 0.05699865175568643
Yes 0.716243615645877 No 0.28375638435412287
No 0.8285289429029252 Yes 0.17147105709707472
Yes 0.7592631514228292 No 0.24073684857717087
No 0.5624507721921893 Yes 0.43754922780781064
Yes 0.5321312571177222 No 0.4678687428822778
Yes 0.9067302278999085 No 0.09326977210009141
Yes 0.5766008802968591 No 0.423399119703141
Yes 0.8365744412930616 No 0.16342555870693856
Yes 0.7788844854214183 No 0.2211155145785817
No 0.6254365317232463 Yes 0.3745634682767537
Yes 0.5215105433689039 No 0.4784894566310961

The output is the same as you saw before with your naive Bayes implementation. That means you can funnel it right over to the scoring program used for HW2:

$ classify apply out/tennis_model.txt data/tennis/test | ./ -g data/tennis/test
Accuracy: 76.92

Like naive Bayes, a maxent model can be smoothed. The underlying mechanism is quite different from add-lambda smoothing for naive Bayes. Instead of adding virtual counts, smoothing in maxent involves constraining parameter values; the result is that model parameters don't get as extreme as they would without smoothing. Another way of thinking about it is that we don't trust the training data fully, so we don't make the model parameters a perfect fit to the training data and instead ensure that they don't move to far from zero, obeying a Gaussian (normal) distribution.

The smoothing value you can provide to OpenNLP maxent is the standard deviation of that Gaussian distribution, sigma. Larger values for sigma perform less smoothing in that they allow the parameters to be more extreme, while smaller values keep the parameters closer to zero. Here's how you specify it:

$ classify train -sigma 1.0 data/tennis/train out/tennis_model.txt

In the problems for this homework, a suggested set of values to try for sigma are [.01, .1, .5, 1.0, 5.0, 10, 50, 100]. You are of course welcome to explore other values, including others in this range.

Do the following exercises, the main purpose of which is to ensure your setup is working properly.

Part (a) [1 pts]. Written answer. What is the accuracy when you use a sigma value of .1?

Part (b) [2 pts]. Written answer. What is the accuracy when you use naive Bayes with a lambda of 5.0, trained on data/tennis/test and tested on data/tennis/train?

Part (c) [2 pts]. Written answer. What is the accuracy when you use maxent with a sigma of .01, trained on data/tennis/test and tested on data/tennis/train?

Problem 2 - Prepositional Phrase Attachment with maxent [20 pts]

For the tennis dataset, the items are already in the required format, but for the PPA dataset you worked with in HW2, we need to create the events (the features and the outcome) in that same format. Not by accident, the Python script you wrote for HW2,, already does the job of extracting features from the PPA dataset into the required format.  Copy your solution over to CLASSIFY_DIR now.

Next, have a look at, which implements several extended features for the PPA task. It is by no means exhaustive, but it performs respectably. Here it is with naive Bayes:

$ ./ -i data/ppa/training -e > out/ppa.train
$ ./ -i data/ppa/devset -e > out/
$ ./ -t out/ppa.train -p out/ -l 1.0 | ./ -g out/
Accuracy: 82.82

Part (a) [5 pts]. Written answer. What is the accuracy on the dev set when you use (a) your features and (b) the extended features of with naive Bayes? (Feel free to try different values of lambda.) Briefly describe the differences between the features you used and those in Is there anything obvious in the different features that could be responsible for any differences in accuracy?

Part (b) [5 pts]. Written answer. Now use maxent to obtain a score on the dev set with all three features sets (a) simple features, (b) extended features and (c) your extended features. You'll want to find a good sigma for each feature set. Note also that you can increase the number of iterations with the option -maxit, followed by a number ('-maxit 1000' is a good choice). Write down your best score with each feature set, and what your values for sigma and maxit were.

Part (c) [10 pts]. Written answer. Based on the parameters for lambda, sigma, and maxit that you found to be most optimal, run naive Bayes and maxent on the PPA test set. Report the results again for all three feature sets, using a table like the following:

 Model/Features ParametersSimple
NLP Extended
Your extended

Did the relative performance of the different models from the dev set to the test set stay the same? Briefly describe any differences you notice.

Tip: What you did for this problem can be used for any kind of standard labeling task you might be interested in. You just need to produce feature files in the formats we did here, and you can then get results back and use them as you like, etc.

Interlude: Introducing the twitter sentiment datasets

We now turn to the sentiment analysis task: predicting the sentiment of tweets. There are two datasets: the Debate08 (Obama-McCain) dataset and Health Care Reform (HCR) dataset. The Debate08 dataset comes from the following papers:

Tweet the Debates: Understanding Community Annotation of Uncollected Sources David A. Shamma; Lyndon Kennedy; Elizabeth F. Churchill, ACM Multimedia, ACM, 2009.

Characterizing Debate Performance via Aggregated Twitter Sentiment Nicholas A. Diakopoulos; David A. Shamma, CHI 2010, ACM, 2010

This dataset can be found in data/debate08. It has been split into train/dev/test XML files that you'll be using for the obvious purposes. See the script if you want to see the details of how the raw annotations were processed to create the files you'll be using.

The HCR dataset is currently in development, so it must remain internal to UT Austin and can't be used by others at the moment. The data split you'll be using includes:
  • data/hcr/train.xml: the set of tweets annotated by the Language and Computers class (and reviewed by Jason)
  • data/hcr/dev.xml: the set of tweets annotated by the Natural Language Processing class (and reviewed by Jason)
There is a separate blind test set that we'll be using to score your best configuration (we'll post the scores to the entire class).

The Python script is your entry into training and evaluating different models in the context of twitter sentiment analysis. The script actually hides a number of the details of running various models for you, including making it so you don't have to run a command for training, another for applying, doing evaluation, etc. Many of these details are contained in the helper script -- feel free to have a look at that if you are interested.

You will modify the file for your submission, as instructed in the following problems.

You should go ahead and look at these files, and then look at the Tweet object in to understand how you can access information about a tweet in your code.

Problem 3 - Sentiment analysis: majority class baseline [5 pts]

One of the most important things to do when working on empirical natural language processing is to compare your results to reasonable baselines to ensure that the effort you are putting into some fancy model is better than a super simple approach. This is the "do the dumb thing first" rule, so let's do that.

If you have training data, then the easiest rule to follow is to find the majority class label and use that for labeling new instances. Using standard Unix tools, you can find this out for data/debate08/train.xml as follows:

$ grep 'label="' data/debate08/train.xml | cut -d ' ' -f4 | sort | uniq -c
    369 label="negative"
    143 label="neutral"
    283 label="positive"

So, negative is the majority label. However, you need to compute this in the Python code based on a list of tweet objects. Look at the method majority_class_baseline in -- the stub implementation returns the label neutral. You can see the result of running with this stub implementation as follows:

$ ./ -t data/debate08/train.xml -e data/debate08/dev.xml -m maj
Polarity Evaluation
17.74 Overall Accuracy
0.00 0.00 0.00 negative
17.74 100.00 30.13 neutral
0.00 0.00 0.00 positive
5.91 33.33 10.04 Average

Subjectivity Evaluation
17.74 Overall Accuracy
17.74 100.00 30.13 neutral
0.00 0.00 0.00 subjective
8.87 50.00 15.06 Average

The output shows evaluation of both full polarity classification (positive, negative, neutral) and subjectivity classification, where subjective is positive and negative and objective is neutral. Right now the detailed results aren't very interesting because we're not predicting more than a single label, in this case neutral, so we'll discuss what these mean more in the next problem. For now, note that the overall accuracy of assigning every tweet in the evaluation set (dev.xml) the label neutral is quite poor. Overall accuracy is computed simply as the number of tweets that received the correct label, divided by the total number of tweets.

Implementation. Fix the majority_class_baseline method so that it computes the majority class label from the tweet set that is given to it as an argument and returns that majority class label. Then run as shown above. Writing. What is the overall accuracy when you do this?

Of course, predicting everything to be one label or the other is a pretty useless thing to do in terms of the kinds of latter processing and analysis one might do based on sentiment analysis. That means it is very important to not just consider the overall accuracy, but also to pay attention to the performance for each label. The lexicon based method we consider next enables prediction of any label, and thus allows us to start considering label-level performance and trying to find models that not only have good overall accuracy, but which also do a good job at finding instances of each label.

Problem 4 - Lexicon ratio baseline [10 pts]

Another reasonable baseline is to use a polarity lexicon to find the number of positive tokens and negative tokens and pick the label that has more tokens. The lexicon we'll use is Bing Liu's Opinion Lexicon, which you can find in the following files:
  • data/resources/positive-words.txt
  • data/resources/negative-words.txt
Look at the header of one of those files for a few comments on what is in them. Note that there are misspelled words in these lists, which is intentional.

For this problem, you'll modify the method lexicon_ratio_baseline in The method is currently set up to predict neutral for all tweets. Run it like this:

$ ./ -e data/hcr/dev.xml -m lex

Note that no training material is needed because this is only based on the words that are present and the polarity lexicon entries. The results are the same as the pick-neutral-for-every-item baseline you fixed in the previous problem. Let's fix the lexicon baseline next.

Part (a) [3 pts]. Implementation. Change the lexicon_ratio_baseline method so that it computes the number of positive tokens and negative tokens, using the sets pos_words and neg_words (which have been read in from the Opinion Lexicon). See the instructions in the file. Note that this is all you should do. The code after that handles ties and allows the neutral label to be predicted. Also note that you could surely improve on this method with some more effort, but the point here is to get a reasonable baseline, not make a great lexicon based classifier. After you've done this, your output should like like the following:

$ ./ -e data/debate08/dev.xml -m lex
Polarity Evaluation
37.23 Overall Accuracy
76.43 23.57 36.03 negative
23.31 82.98 36.39 neutral
47.06 36.00 40.79 positive
48.93 47.52 48.21 Average

Subjectivity Evaluation
48.55 Overall Accuracy
23.31 82.98 36.39 neutral
91.81 41.13 56.81 subjective
57.56 62.06 59.72 Average

At this point, let's stop and look at the results in  more detail. The overall accuracy is lower that what you should have gotten for the majority class baseline (which should be in the 55-65% range). However, the lexicon ratio method can predict any of the labels, which leads to more interesting patterns. Note the following:
  • Polarity and Subjectivity evaluations are done on exactly the same output. The difference is that for the latter, the labels positive and negative are mapped to subjective. This measures how well we are doing at distinguishing opinion-bearing tweets from tweets that express a statement of fact.
  • P, R and F stand for Precision, Recall and F-score. For example, for the label neutral, the precision is the number of tweets correctly identified as neutral divided by all the tweets that the system classified as neutral. The recall is the number of items correctly identified as neutral divided by all the tweets that the gold standard annotations say are neutral. F-score is the harmonic mean of precision and recall: F = 2*P*R/(P+R).
  • For each evaluation type, an average of P/R/F is provided. Note that the average F-score is not computed as an average of the F-scores for the labels, but is computed from the average P and R values.
  • The values we'll care about the most in final evaluations are the F-score average and the overall accuracy. However, it is important to consider precision and recall individually for different kinds of tasks.
  • Even though the overall accuracy is a lot lower than the majority class baseline, the output is far more meaningful; this shows in the label-level results, and the P/R/F averages, which are much higher than for the majority class baseline.
You may have noticed that the tweet content is tokenized on white space. This is sub-optimal. For example, it means that for a tweet like:

RT @hippieprof: RT @loudhearted: RT @quaigee: I find that the ppl who decry ''socialism'' loudest have no idea what the word means. #hcr #p2

the tokens produced are:

['RT', '@hippieprof:', 'RT', '@loudhearted:', 'RT', '@quaigee:', 'I', 'find', 'that', 'the', 'ppl', 'who', 'decry', "''socialism''", 'loudest', 'have', 'no', 'idea', 'what', 'the', 'word', 'means.', '#hcr', '#p2']

Obviously there are some problems here that are going to lead to loss of generalization for features, including tokens like "@hippieprof:", which should be "@hippieprof" and ":", "means." -> "means" ".", etc. Included in the code base is (from Brendan O'Connor's TweetMotif source code) which provides tokenization tailored for twitter. The tokens produced by for the above example are:

[u'RT', u'@hippieprof', u':', u'RT', u'@loudhearted', u':', u'RT', u'@quaigee', u':', u'I' , u'find', u'that', u'the', u'ppl', u'who', u'decry', u"''", u'socialism', u"''", u'loudest', u'have', u'no', u'idea', u'what', u'the', u'word', u'means', u'.', u'#hcr', u'#p2']

Much better! (Note that the "u" in front of each string indicates it is a Unicode string. This won't cause you any extra work or problems.)

Part (b) [3 pts]. Implementation. Modify the lexicon_ratio_baseline method so that it uses tokens from This is very simple, but you must figure out how to do this for yourself. Run the lexicon classifier again -- you should see overall accuracy for polarity go to 44.03. 

Part (c) [4 pts]. Written answer.  Note that the overall Subjectivity score improved by over 6% over using white-space tokens for the Debate08 data. However, the average Subjectivity P/R values are only a little bit lower with white-space tokens (for HCR, they actually are higher). Explain exactly what is going on here. (In other words, you must demonstrate that you understand what these various numbers actually mean.)

Overall, this is clearly a poor set of results, but that is okay -- it's just a baseline! Let's do better with models acquired from the training data.

Problem 5 - Supervised models for sentiment classification [30 pts]

Now that we have done a couple of simple sanity checks to see what we should expect to do at least as well as (and hopefully much better), we can turn to machine learning from labeled training examples. The code is already set up to do classification. If no options are provided, a maxent model is trained and used, with a sigma of 1.0:

$ ./ -t data/debate08/train.xml -e data/debate08/dev.xml 
Polarity Evaluation
55.97 Overall Accuracy
62.22 72.91 67.14 negative
29.57 24.11 26.56 neutral
54.05 40.00 45.98 positive
48.61 45.67 47.10 Average

Subjectivity Evaluation
76.35 Overall Accuracy
29.57 24.11 26.56 neutral
84.26 87.61 85.91 subjective
56.91 55.86 56.38 Average

Already looking much better than the baselines! Try it with naive Bayes, using the option --model=nb (-m nb). For Debate08 with just the unigram features, naive Bayes gets a bit better result than maxent, and for HCR it is a bit worse than maxent. (And these values will change with different smoothing values -- see part (a).)

Tip: you can use the --verbose (-v) flag to see more output.

For this problem, you'll improve the extraction of features and determine a good value for smoothing (for both maxent and naive Bayes). The option --smoothing (-s) to allows you to specify this for both model types (the value will be used to set sigma for maxent and lambda for naive Bayes).

Part (a) [3 pts]. Written answer. Find a better smoothing value than the default (1.0) for both maxent and naive Bayes for both the Debate08 and HCR datasets. Write down what your best values are for each model and paste the output for both in your ANSWERS file. You should find a good balance between the overall accuracy and the average F-score. Is either of the model types a clear winner after you've done this optimization?

Part (b) [7 pts]. Implementation. Improve the handling of tokens for creating unigram features. Things you should experiment with are lower casing all tokens, getting the stems, and excluding stop words from being features. (See comments in the code.) You'll want to consider a different smoothing value than what you had before. (Reality check: my scores for this step are in in the 65% range for Polarity overall accuracy and 58% for F-score average for maxent, and 63% Polarity overall accuracy and 51% F-score average for naive Bayes.)

Part (c) [20 pts]. Implementation. Go crazy adding more features in the extended features area of the extract_features method. You can specify that these features are to be used with the --extendedfeatures (-x) option. Consider bigrams, trigrams, using the polarity lexicon, regular expressions that detect patterns like "loooove", "LOVE", "*love*", presence of emoticons, etc. Find new smoothing values that improve scores on the development set.

Note: you cannot use the "target" values as features, since these were created by human annotators.

Writing. Describe the features that you used and why, giving examples from the training set. Include the output from your best model.

For comparison, here are my best results on data/debate08/dev.xml:

Polarity Evaluation
70.44 Overall Accuracy
71.93 85.24 78.02 negative
57.02 48.94 52.67 neutral
76.47 52.00 61.90 positive
68.48 62.06 65.11 Average

Subjectivity Evaluation
84.40 Overall Accuracy
57.02 48.94 52.67 neutral
89.32 92.05 90.66 subjective
73.17 70.49 71.81 Average

Hopefully some of you will beat this!

Problem 6 - Two stage classification [10 pts]

You can do two-stage classification by using the --twostage (-w) option. This first stage does classification between subjective (positive/negative) and objective (neutral) tweets, and then passes the subjective ones on to a model trained on just positive and negative examples (and which assigns only those labels).

Written answer. Explore single stage and two stage classification with your extended features with both naive Bayes and maxent, including adjustments to the smoothing values, for both the Debate08 and HCR datasets.  Describe what you found -- were there any configurations where two stage model was better than one stage, or vice versa? Did you find differences between the two datasets with respect to one verses two stage classification? Based on the results for both Polarity and Subjectivity Evaluation, can you describe any general trends with the two strategies?

Note: there is no right answer here. There are reasons to think that a two stage model might work better, but it is an empirical question, and is subject both to the feature set you are using and the datasets being examined.

Problem 7 - Confidence thresholding [5 pts]

Supervised classification algorithms output a probability distribution over the labels they predict, and the probabilities associated with those labels provide information about the confidence of the classification. For example, a classifier might predict the following distributions for two tweets:

Tweet 1:
 Positive Negative Neutral
 .34 .33 .33

Tweet 2:
 Positive Negative Neutral
 .10 .85 .05

For tweet 1, the model is quite uncertain -- it seems to prefer the positive label, but only by a small amount. However, for tweet 2, the classifier is very confident that the label is negative. In general, higher confidence correlates with higher accuracy, which means that one can use confidence thresholds to increase precision, usually at the expense of recall. Whether or not this is desirable may depend a great deal on the application needs. For example, if you are using sentiment analysis to get an aggregate measure of sentiment about some topic, you can probably use a lower threshold (or no threshold at all). But if the goal is to identify individual tweets that express positive opinions for automatic retweeting or display on a company's web page (a bad idea, but it's an example, okay?), then high precision is very important.

You can specify a threshold for polarity classification with the --pthreshold (-p) option to Recall that this is a three way classification, so any value less than or equal to 1/3 is the same as no threshold at all.

Part (a) [2 pts]. Written answer. Try different values of the polarity threshold with your best model on Debate08. What is the highest average Polarity precision you can get without allowing the recall on any individual label to drop below 10%? Give the threshold you used to obtain this result and paste the output in your answer.

With two stage classification, another threshold can be used with subjectivity classification. Only the items which are labeled subjective in the first stage are considered for classification as positive/negative in the next stage, so the conservative thing to do with respect to subjectivity is to pass the buck and only commit to objective (neutral) labels with high confidence. So the subjectivity threshold, invoked with the -sthreshold (-r) option, gives the minimum confidence for labeling items as objective, such that any item that is labeled subjective below the threshold or is simply labeled objective, will be receive the label neutral. All other tweets are put aside for classification with the second stage polarity classifier.

Note that in two stage classification, the second stage polarity classifier is only choosing between two labels (positive and negative), so the threshold values should be above .5 rather than .33.

Part (b) [3 pts]. Written answer. Try different values of the subjectivity threshold and polarity threshold for the two stage strategy with your best model on Debate08. What is the highest average Polarity precision you can get without allowing the recall on any individual label to drop below 10%? Give the thresholds you used to obtain this result and paste the output in your answer. Did having the two thresholds make it easier or harder to coax this out of the analyzer?

Problem 8 - Reporting and analysis [20 pts]

We'll wrap up with a summary of your best results and a look at the output of your best model.

Part (a) [5 pts] For both data/debate08/dev.xml and data/hcr/dev.xml, state your best score for each model and configuration, including which parameters were involved. Do it as a table:

 Model ConfigParameter
Overall Accuracy
Polarity Average
Subjectivity Overall Accuracy
Subjectivity Average F-score
 NBayes 1 stage
 NBayes 2 stage
 Maxent 1 stage
 Maxent 2 stage

For Debate08, run all of the above model/configurations with these same parameters on data/debate08/test.xml and produce a table of those results. Writing. Did the models maintain the same relative performance as they had on the development set? Are the differences in performance about the same, or have they increased?

Writing. For HCR, state your best model and which model and parameters we should use to calculate your submission's performance on the held-out test set.

Extra credit
[up to 10 extra points]. The submission with the best overall accuracy for Polarity Evaluation gets 5 extra credit points, and the submission with the best average F-score for Polarity Evaluation gets 5 extra credit points.

The option --detailed (-d) outputs the correctly resolved tweets, the incorrectly resolved ones, and the tweets for which the system abstained (e.g. because of thresholds).

Part (b) [10 points] Written answer. Obtain the detailed output for your best system for data/hcr/dev.xml. Look at at least 20 of the incorrect ones and discuss the following:
  1. Which tweets, if any, do you think have the wrong gold label?
  2. Which tweets, if any, are too complex or subtle for the simple positive/negative/neutral distinction? (For example, they are positive toward one thing and negative toward another, even in the same breath.)
  3. Which tweets, if any, have a clear sentiment value, but are linguistically too complex for the kind of classifiers you are using here?
  4. Which tweets, if any, do you think the system should have gotten? Are there any additional features you could easily add that could fix them (provided sufficient training data)?
For each of these, paste the relevant tweets you discuss into your ANSWER file.

Part (c) [5 points] Written answer. Based on your experience creating features with the resources made available to you and having looked at the errors in detail, describe 2-3 additional strategies you think would help in this context, such as other forms of machine learning, additional linguistic processing, etc. Roughly speak(off the cuff), how much effort do you think it would take to implement these ideas?

Extra credit - Train on noisy emoticon-labeled tweets [up to 30 additional pts]

Look in data/emoticon -- you'll see:
  • happy.txt: 2000 tweets that have a smiley emoticon
  • sad.txt: 2000 tweets that have a frowny emoticon
  • neutral.txt: 2000 tweets that don't have smilies or frownies or certain subjective terms (it's noisy, so it is very much neutral-ISH)
Part 1. Write a script to produce data/emoticon/train.xml file from the above files in the format that expects. All the tweets in happy.txt should be labeled positive, those in sad.txt should be labeled negative, and those in neutral.txt should be labeled, well, neutral. So, this is clearly an attempt to get annotations for free -- and there are indications that it should work, e.g. see the technical report by Go et al 2009: Twitter Sentiment Classification using Distant Supervision

Part 2. Use data/emoticon/train.xml as a training source and evaluate on data/debate08/dev.xml and data/hcr/dev.xml. 

Writing. Discuss the results, with reference to both datasets. Does it at least beat a lexicon based classifier? How close does it come to the supervised models? Does the effectiveness of the noisy labels vary with respect to the model? Pay particular attention to the label-level precision and recall values. Are they more balanced or less balanced than the results from models trained on human annotated examples? If there are differences, can you think of reasons why?

Part 3. I haven't actually run the above experiment myself, but I'm willing to bet the results aren't as good as with the models trained on human annotated examples. So, perhaps there is a way to take advantage of both the human annotations and this larger set of noisily labeled examples. Actually, there are many ways of doing this -- here you'll do the very simple strategy of just concatenating the two training sets. You can do this without changing the files by using the --auxtrain (-a) option. As an example, you can evaluate on data/debate08/dev.xml using a model trained on both data/debate08/train.xml and data/hcr/train.xml as follows:

$ ./ -t data/debate08/train.xml -e data/debate08/dev.xml -a data/hcr/train.xml

You'll probably find that you need to adjust the smoothing value to get better results. Try this strategy for both Debate08 and HCR, using data/emoticon/train.xml.

Part 4. Determine the best parameter values and run the evaluation for (a) using just emoticon training, (b) using both human annotated and emoticon training, for both naive Bayes and maxent (optional: try two-way classification too), and fill in a table like 8(a) for both  data/debate08/dev.xml and data/hcr/dev.xml. Did you get better results than the best results for 8(a)? Fill in the table for  data/debate08/train.xml and discuss what comes of that.

General comment. For this extra credit, things are much more free form -- go crazy and impress me a lot, or do the bare minimum and get a few points. It's up to you. I'm happy to discuss ideas! (For example, you could create your own emoticon set, with more examples, or do a better job at identifying actually neutral items -- you just can't do any hand labeling.)

Make sure to include in your submission.

Copyright 2011 Jason Baldridge

The text of this homework is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License. Attribution may be provided by linking to and to this original homework.

Please email Jason at with suggestions, improvements, extensions and bug fixes.