Data homework 2: Sentiment analysis

Distributed Sep 28; due before class on Oct 5

Problem 1

Figures 1 and 2 depict the distribution of some scalar modifiers relative to categories at IMDB and Experience Project. (The leftmost panels are good outside the scope of negation.) In both, the x-axis gives the categories, and the y-axis gives the probability of the category given the word. The vertical bars mark 95% confidence intervals (tiny for IMDB), and the horizontal gray line is the value we would expect if the word were equally frequent in all categories. (For more on the calculations and visualizations, see the Sep 28 handout on classifiers.)

Your task: First, describe the basic pattern that you see, noting any linguistically interesting sub-patterns. A lot could be said, but we're imagining one medium-sized paragraph on this. Second, offer a hypothesis about what the underlying causes of these distributions are. If all is going well, this will tie in with your description. Third, speculate as to what the corresponding negative data might look like for the Experience Project corpus (say, good in the scope of negation, depressing, bad, and terrible).

Problem 2

The file imdb-advadj-with-ratings.csv contains 91,713 adverb–adjective pairs from the short summaries attached to user-supplied movie reviews at IMDB.com, along with the associated start rating (1-10 stars). In addition, column 2 collapses the ratings into three categories: for rating R, if R ≥ 8, then Pos; if R ≤ 3, then Neg, else Neutral. Finally, column 5 gives the classification of the adjective according to the Harvard Inquirer: Positiv, Negativ, or Objectiv (if the adjective is listed as neither Positiv nor Negativ). Here's a sample to illustrate the format:

Rating Polarity Adverb Adjective AdjPolarity
8 Pos really knowing Objectiv
5 Neutral nearly identical Objectiv
3 Neg amazingly dull Negativ
7 Neutral not bad Negativ
8 Pos especially great Positiv
10 Pos utterly charming Objectiv

Your task: The feature AdjPolarity alone does modestly well at predicting Polarity (micro-averaged precision/recall/F1 = 0.45). Propose at least two additional features that you think can improve performance. Exceptional answers will actually test out the proposed features.

Note: The raw star ratings are off-limits as features! Feel free to ignore them. We kept them in case you are feeling particularly ambitious.

Note: This is a real-world data set. There might be some junk adverb–adjective pairs in the mix, and the file inherits the rating-scale imbalances of the larger corpus.