CS 224U / LING 188/288, Spring 2014, Stanford

The following is a very small, invented word × document matrix (words A, B, C, D; documents d1 ... d5):

d1 | d2 | d3 | d4 | d5 | |
---|---|---|---|---|---|

A | 10 | 15 | 0 | 9 | 10 |

B | 5 | 8 | 1 | 2 | 5 |

C | 14 | 11 | 0 | 10 | 9 |

D | 13 | 14 | 10 | 11 | 12 |

(A CSV version of the matrix for use with spreadsheet and matrix programs. Also, the code for questions 3 and 4 contains functions that will solve this problem and the next for you.)

Your tasks:

- For each word A, B, C, and D, calculate and provide its Euclidean distance from word A, and then use those values to rank all of the words with respect to their closeness to A (closest = 1; farthest = 4). (For each word, you should provide its distance and rank.)
- Normalize each row of the matrix by length (definition below), recalculate the Euclidean distances of all the words from word A, and recalculate the ranking with respect to A. (For each word, you should provide its distance and rank. You needn't provide the length-normalized matrix.)
- If the ranking changed between step 1 and step 2, provide a brief explanation for the nature of that change, and try to articulate why it changed. If the ranking did not change, provide a brief explanation for why normalization did not have an effect here.

- Euclidean distance
- The Euclidean distance between vectors \(x\) and \(y\) of dimension \(n\) is \( \sqrt{\sum_{i=1}^{n} |x_{i} - y_{i}|^{2}} \)

- Length (L2) normalization
- Given a vector \(x\) of dimension \(n\), the normalization of \(x\) is a vector \(\hat{x}\) also of dimension \(n\) obtained by dividing each element of \(x\) by \(\sqrt{\sum_{i=1}^{n} x_{i}^{2}}\)

Here is another invented word × document matrix (words A, B, C; documents d1, d2, d3):

d1 | d2 | d3 | |
---|---|---|---|

A | 1 | 0 | 0 |

B | 1000 | 1000 | 4000 |

C | 1000 | 2000 | 999 |

Calculate the pointwise mutual information (PMI) for cells (A, d1) and (B, d3), as defined in equation 4 of Turney and Pantel (p. 157). Give these values, and articulate, in 1-2 sentences, what is problematic about the values obtained.

This question and the next require a little code distribution. You can download it for use on your machine (it requires Python with Numpy and Scipy): cs224u-hw1.zip. Alternatively, you can run it from your Stanford Unix account from this directory: /afs/ir/class/cs224u/hwcode/hw1/. You don't have to know Python; you'll need it just to make some function calls.

To start exploring this method, navigate to the directory containing the code and data distribution, enter the Python shell by typing python at the prompt, and then execute these commands, which load the code and matrix:

- from vsm import *
- m = Matrix('imdb-wordword.csv')

The file imdb-wordword.csv contains a (2998 × 2998) word × word matrix, where the count in cell m[i, j] is the number of times that word i and word j occurred together in the same text (a review from IMDB).

The function neighbors lets you see the words that are closest to a given word according to a distance measure:

- neighbors(m, 'scary', distfunc=cosine_distance, n=5)
- [('scary', 0.0), ('scare', 0.0033059029955925245), ('thing', 0.0040114240503390519), ('just', 0.0041900487203675452), ('cheesy', 0.0042465840121400644)]
- neighbors(m, 'scary', distfunc=euclidean_distance, n=5)
- [('scary', 0.0), ('seriously', 2734.9711150211442), ('basically', 2777.8320683583447), ('annoying', 2809.6780954408282), ('ridiculous', 2811.8454793960495)]

The code also has two basic reweighting functions: tfidf and pmi. The pmi implementation has options for specifying Positive PMI and performing the discounting procedure that Turney and Pantel describe. Here are example function calls (which might take a few minutes to finish):

- p = pmi(m, positive=False, discounting=False)
- t = tfidf(m)

These new matrices behave like the original, so you can do things like this:

- neighbors(p, 'scary', distfunc=cosine_distance, n=5)
- [('scary', 2.2204460492503131e-16), ('horror', 0.44045320622678008), ('just', 0.52984219031468927), ('gore', 0.54364357758755311), ('creepy', 0.54628422820431433)]
- neighbors(t, 'scary', distfunc=euclidean_distance, n=5)
- [('scary', 0.0), ("it's", 6.2215154001902769e-05), ('blood', 6.6180613306240124e-05), ('suspense', 6.7280395622054464e-05), ('m', 6.9636551744935624e-05)]

Finally, the code implements Latent Semantic Analysis (truncated singular value decomposition). Here's how to use it; k gives the number of column dimensions to use:

- sem = lsa(m, k=100)
- neighbors(sem, 'scary', distfunc=cosine_distance, n=5)
- [('scary', 0.0), ('boring', 0.001122878286416662), ('scare', 0.0011598457991651712), ('cheesy', 0.0012451811570568516), ('lame', 0.0015101458334525475)]

LSA can in principle be combined with (preceded by) either of the reweighting schemes discussed above, and it is in principle compatible with either distance metric on vectors.

**Your task**: Sample around for a little while,
seeing what neighbors turns up for different parts of the lexicon,
and then venture a proposal about how best to build a VSM from
this data:

- Which words did you use to explore? (List them.)
- Which weighting scheme should we use, if any? (If you choose PMI, be sure to say which options you favor.)
- Should we run LSA? If so, what value of k seems best?
- Which distance metric should we use?

The idea is that these should all work together to deliver the final VSM that we use for semantics. You need only give answers. We're not requesting justification.

The semantic orientation method of Turney and Littman 2003 is offered as a general method for building lexicons for any desired semantic dimension using just similarity relations on vector representations. Here are the steps:

- Define two seed-sets \(S_{1}\) and \(S_{2}\) of words (they should be opposing in some way that is appropriate for your matrix).
- Pick a vector distance measure \(dist\).
- For the word \(w\) of interest, use \(dist\) to get the sum of all distances between \(w\) and the words in \(S_{1}\). Do the same for \(S_{2}\).
- The score is the sum for \(S_{1}\) minus the sum for \(S_{2}\).

- Semantic orientation summarized
- For a given distance metric \(dist\), word \(w\), and seed sets \(S_{1}\) and \(S_{2}\), \[ (\sum_{w' \in S_{1}} dist(w, w')) - (\sum_{w' \in S_{2}} dist(w, w')) \]

The code vsm.py implements this as semantic_orientation, which ranks the entire vocabulary according to the two seed sets. Here is a call with the default arguments, which use Turney and Littman's original seed-sets:

- m = Matrix('imdb-wordword.csv')
- so = semantic_orientation(m, seeds1=['bad', 'nasty', 'poor', 'negative', 'unfortunate', 'wrong', 'inferior'], seeds2=['good', 'nice', 'excellent', 'positive', 'fortunate', 'correct', 'superior'], distfunc=cosine_distance)
- so[ : 5] # words most associated with seeds1
- [('1/10', -0.01734046484573426), ('suck', -0.015119973084056548), ('gonna', -0.014471191150799867), ('crap', -0.014289001238634969), ('renting', -0.013775117639892809)]
- so[-5: ] # words most associated with seeds2
- [('breathtaking', 0.015729692669502304), ('titanic', 0.015771864952280112), ('victoria', 0.016104061915204637), ('powell', 0.019246824409916319), ('columbo', 0.019610017731074403)]

**Your task:**

- Make sure you have a VSM set up in the way you advocated for in question 3.
- Pick one semantic opposition (distinct from pos/neg sentiment) that works well in this VSM, built from this data. Name the opposition, give your seed sets, and paste in the top 5 and bottom 5 values, as above.
- As in 2, but now with a semantic opposition that seems not to work at all with this VSM and method.