Developing adjective scales from user-supplied textual metadata

This page provides data and associated documentation for this talk:

Christopher Potts. 2011. Developing adjective scales from user-supplied textual metadata. NSF Workshop on Restructuring Adjectives in WordNet. Arlington,VA, September 30–Oct 1.

The goal of the talk is to develop and evaluate methods for using naturally occurring metadata (star ratings on service and product reviews) to inform WordNet annotators in constructing modifier scales.

Data

File (zipped CSV file): wn-asr-multicorpus.csv.zip

  Column name Explanation
1 Word In the format WORD/tag where tag is a or r
2 Rating 1..10 for IMDB; 1..5 for the other corpora
3 Category Rating on the scale -0.5..0.5
4 Count Token count for Word in reviews with Rating in Corpus
5 Total Total token count for words in reviews with Rating in Corpus
6 Corpus IMDB, Goodreads, OpenTable, Amazon/Tripadvisor

Single-word assessment values

File (zipped CSV file): wn-asr-multilevel-assess.csv.zip

  Column name Explanation
1 Word In the format WORD/tag where tag is a or r
2-5 fit1.coef1, fit1.coef1.p, fit1.coef2, fit1.coef2.p The linear model coefficients with associated p-values; the fitted values can be obtained with invlogit(fit1.coef1 + fit1.coef2*x)
6-8 fit1.aic, fit1.bic, fit1.loglik Values for assessing the goodness of fit for the linear model (Akaike Information Criterion, Bayesian Information Criterion, Log-Likelihood)
9-14 fit2.coef1, fit2.coef1.p, fit2.coef2, fit2.coef2.p, fit2.coef3, fit2.coef3.p The quadratic model coefficients with associated p-values; the fitted values can be obtained with invlogit(fit1.coef1 + fit1.coef2*x + fit1.coef2*x2)
15-17 fit2.aic, fit2.bic, fit2.loglik Values for assessing the goodness of fit for the quadratic model (Akaike Information Criterion, Bayesian Information Criterion, Log-Likelihood)
18 Inquirer The Harvard Inquirer classification: Positiv, Negativ, Neutral; NA iff the word is not in the Harvard Inquirer
19 SentiWordNetPositive The SentiWordNet positive score: [0-1] or NA iff the word is not in SentiWordNet
20 SentiWordNetNegative The SentiWordNet negative score: [0-1] or NA iff the word is not in SentiWordNet
21 SentiWordNetPolarity positive if SentiWordNetPositive > SentiWordNetNegative; negative if SentiWordNetPositive < SentiWordNetNegative, else neutral; NA iff the word is not in SentiWordNet
22 MicroWNOpPositive The MicroWNOp positive score: [0-1] or NA iff the word is not in MicroWNOp
23 MicroWNOpNegative The MicroWNOp negative score: [0-1] or NA iff the word is not in MicroWNOp
24 MicroWNOpPolarity positive if MicroWNOpPositive > MicroWNOpNegative; negative if MicroWNOpPositive < MicroWNOpNegative, else neutral; NA iff the word is not in MicroWNOpNegative
25 MqapPolarity positive, negative, or neutral; NA iff the word is not in the MQAP subjectivity lexicon
26 MqapStrength 1 if the strength is weaksubj; 2 if the strength is strongsubj; NA iff the word is not in the MQAP subjectivity lexicon
27 Predicted If Model == Linear, then positive if fit1.coef2 (column 4) is ≥ 0, else negative; if Model == Quadratic, then positive if fit2.coef2 (column 11) is ≥ 0, else negative; if Model == None, then neutral
28 Model Values: Linear, Quadratic, None. The preferred model choice: if only one is significant, then it is chosen; if both are significant, then we pick the one with the greater log-likelihood (columns 8 and 17); if neither model is significant, then we choose None. Throughout, the p-value threshold is < 0.05.
29 RawScore fit1.coef2 (column 4) if Model == Linear; fit2.coef2 (column 11) if Model == Quadratic; else 0
30 NormedScore RawScore z-score adjusted relative to the population of significant coefficients for fit1.coef2 or fit2.coef2, depending on which value is in RawScore

Word comparison assessment values

File (zipped CSV file): wn-asr-multilevel-cmp.csv.zip

     
1. Word In the format WORD/tag where tag is a or r
2 SimWord In the format WORD/tag where tag is a or r; this word is related to Word via the WordNet similar_to relation
3 Polarity The polarity assigned both by the MPQA subjectivity lexicon and the method proposed in the talk. (We limit attention to pairs where this category value is agreed upon; the classification experiments assess the agreement level for this problem.)
4 MqapWordStrength The MPQA strength for Word: 1 == weaksubj; 2 == strongsubj
5 MqapSimStrength The MPQA strength for SimWord: 1 == weaksubj; 2 == strongsubj
6 WordScore Our predicted score for Word; same as NormedScore from wn-asr-multilevel-assess.csv
7 SimScore Our predicted score for SimWord; same as NormedScore from wn-asr-multilevel-assess.csv
8 MqapCmp Comparison value from MPQA: stronger if MqapWordStrength > MqapSimStrength; weaker if MqapWordStrength < MqapSimStrength; same otherwise
9 PredictedCmpInformal Comparison value for our informal method: stronger if WordScore > SimScore; weaker if WordScore < SimScore; same otherwise
10-11 category.coef, category.p Coefficient and p-value for the basic Category predictor in the comparison model
12-13 interaction.coef, interaction p Coefficient and p-value for the interaction term Category*Stronger in the comparison model
14 PredictedCmpFormal if category.p ≥ 0.05 or interaction p ≥ 0.05, same; else if sign(category.coef) == sign(interaction.coef), stronger; else weaker