Data homework 1: Sentiment lexicons and classifiers

Distributed Sep 21; due before class on Sep 28

Problem 1

Many algorithms for building large sentiment lexicons classify simple strings, ignoring grammatical information like part of speech as well as contextual information about where and how the strings were used. Thus, they will likely miss the reliable sentiment contrast between the adjective gross (as in yucky) and the noun gross (as in profits), and the fact that sensitive is likely positive when it describes a novelist but negative when it describes a bruise.

Your task: Find three original examples in which grammatical information or contextual information is important for sentiment classification. For each one, include an example sentence that highlights the contrast you identified. If your examples are from a language other than English, please provide English glosses.

Problem 2

In class, we discussed a simple procedure for beginning with seeds sets of positive and negative synsets and using WordNet's various relations of similarity and difference to extend those seed sets. There is in principle no reason to limit oneself to positive and negative; any opposing seed sets should do.

Your task: Think up your own lexical opposition and construct some seed sets (as strings, perhaps with part-of-speech classifications). Go to this Web-based implementation of the propagation algorithm and see how your seed set behaves. Do the results look useful? How many iterations does it take for the two lists to share members? Which WordNet relations seem to lead to the overlap? In your write-up, give your seed-sets, articulate the basic opposition you have in mind, and write a few sentences addressing these questions (and perhaps others).

Problem 3

Hatzivassiloglou and McKeown (1997) use coordinators and other local morpho-syntactic features to predict the polarity of adjectives. This problem asks you to explore and extend some of their ideas. The data file is imdb-adjcoord.csv (download as a ZIP archive), which contains data on over 100,000 coordinated adjective tokens drawn from the user-supplied reviews at This file can be read with any program that reads tabular data (Excel, OpenOffice, R, SPSS, ...). The format is as follows (with all values comma-separated and no value containing a comma itself):

Column Name Description
1 PolMatch 1 if the cols. 3 and 8 values match, else 0.
2 Adj1 The left adj in the coordination.
3 Adj1Pol The polarity of Adj1 according to the Harvard Inquirer.
4-6 Adj1Prefix, Adj1Stem, Adj1Suffix A heuristic parse of Adj1. For example, disinterested → (dis, interest, ed) and unhappy → (un, happy,  )
7 Adj2 The right adj in the coordination.
8 Adj2Pol The polarity of Adj2 according to the Harvard Inquirer.
9-11 Adj2Prefix, Adj2Stem, Adj2Suffix A heuristic parse of Adj2.
12 BaseMatch 1 if col. 5 matches col. 10, else 0
13 Coord The coordinator: and, but, nor, or, versus, or yet.
14 Conj1Neg 1 if the first conjunct was negated with not or n't, else 0. (It is generally hard to tell whether such negation scopes over the entire conjunction or just the first conjunct.)
15 Conj2Neg 1 if the second conjunct was negated with not, else 0.
16 ClauseType The clause type of the containing sentence: DEC(larative); INT(errogative); EXC(clamative), based on the final punctuation tag.
17 Genre The genre of the movie under review.
18 Year The release year of the movie under review.
19 Rating The rating of the review containing this coordination: 1...10.
20 Helpful The percentage of registered users who found the review containing this coordination helpful: [0,1].

There are lots of hypotheses one can address using this database. For example, a hypothesis based on Hatzivassiloglou and McKeown is that but is a strong indicator of contrasting sentiment, whereas and is a strong indicator of matching sentiment. The following table of counts from the database supports this view:

Coordinator Same Polarity Different polarity Total
and 70,296 (78%) 19,803 (22%) 90,099
but 1,707 (36%) 2,981 (64%) 4,688

Your task: Formulate a novel hypothesis and answer it with evidence from the database. The above and/but example illustrates what we have in mind. Note: You needn't work to predict the values in PolMatch. Take any perspective on the data that you like.

Problem 4

At the website, users supply very short stories meant to highlight moments that "ruined their day". At the website It Made My Day, users supply very short stories meant to highlight "little moments of WIN". Suppose your task was to write a classifier that, for any text T drawn from the union of these two sites' stories, predicted which site T came from.

Your task: Propose three features that you think might be useful for such a classifier.

Important: We assume for this task that the compulsory prefix "Today" and all tokens of "FML" have been removed from the FMyLife texts and that all tokens of "IMMD" have been removed from the It Made My Day texts.

Note: The FMyLife site is pretty depressing/shocking to read. Our apologies for that. Not all sentiment can be positive sentiment.