CS 224U / LING 188/288, Spring 2014, Stanford

All these problems concern sentiment analysis, but the underlying issues are common wherever one is dealing with naturalistic corpora.

Question 1 [2 points]

Below are the distributions of reviews in corpora derived from Amazon.com in which each text has associated with it a star rating, 1-5 stars (1 most negative, 5 most positive).

Your task: identify one problem that the nature of these distributions might cause for a classifier predicting the rating attached to a given text. (1-3 sentence response.)

Amazon reviews
1-star 29,642 39,383 2,984 3,973
2-star 32,602 48,455 1,880 4,166
3-star100,272 90,528 2,646 8,708
4-star160,817148,260 4,427 18,960
5-star204,461237,83915,774 43,331

CSV version of the above table and the table in question 3

Question 2 [2 points]

One common response to major class imbalances like the above is to artificially balance the training/testing data, by sampling from the categories.

Your task: identify two problems that this might pose for training and/or testing on real data. (2-4 sentence response.)

Question 3 [2 points]

Here is the distribution of reviews in a corpus derived from the website RateBeer. Whereas Amazon.com is an online store, RateBeer is basically a social-networking site on which beer enthusiasts share information (via short reviews and ratings) about beers. The site members vary greatly in their expertise about beer, they interact with each other a lot, they tend to write a lot of reviews, and their tastes evolve over time.

1-star 74,508
2-star 196,397
3-star 722,797
5-star 382,754

Your task: given what you know about Amazon.com and RateBeer.com, offer two conjectures about why the overall rating distributions might be so different (comparing the tables in problem 1 with the above table). Assume that the differences are not due to problematic sampling from the sites' overall data. (1-3 sentence response.)

Question 4 [2 points]

The file imdb-advadj-with-ratings.csv.zip contains 91,713 adverb–adjective pairs from the short summaries attached to user-supplied movie reviews at IMDB.com. The file can be read with any program that reads tabular data (Excel, OpenOffice, R, SPSS, ...). The format is as follows (with all values comma-separated and no value containing a comma itself):

Here's a sample of the format:

Rating Polarity Adverb Adjective AdjPolarity
8 Pos really knowing Objectiv
5 Neutral nearly identical Objectiv
3 Neg amazingly dull Negativ
7 Neutral not bad Negativ
8 Pos especially great Positiv
10 Pos utterly charming Objectiv

Your task: The feature AdjPolarity alone does modestly well at predicting Polarity (micro-averaged precision/recall/F1 = 0.45). Propose two additional features that you think can improve performance. Exceptional answers will actually test out the proposed features.

Note: The raw star ratings are off-limits as features! Feel free to ignore them. We kept them in case you are feeling particularly ambitious.

Note: This is a real-world data set. There might be some junk adverb–adjective pairs in the mix, and the file inherits the rating-scale imbalances of the larger corpus (similar to what we saw in question 1).

Question 5 [2 points]

Many algorithms for building large sentiment lexicons classify simple strings, ignoring grammatical information like part of speech as well as contextual information about where and how the strings were used. Thus, they will likely miss the reliable sentiment contrast between the adjective gross (as in yucky) and the noun gross (as in profits), and the fact that sensitive is likely positive when it describes a novelist but negative when it describes a bruise.

Your task: Think up four examples in which grammatical information or contextual information is important for sentiment classification. For each one, include an example sentence that highlights the contrast you identified. If your examples are from a language other than English, please provide English glosses. (Non-English data are encouraged!)