The following is a very small, invented word × document matrix (words A, B, C, D; documents d1 ... d5):

d1 | d2 | d3 | d4 | d5 | |
---|---|---|---|---|---|

A | 10 | 15 | 0 | 9 | 10 |

B | 5 | 8 | 1 | 2 | 5 |

C | 14 | 11 | 0 | 10 | 9 |

D | 13 | 14 | 10 | 11 | 12 |

(A CSV version of the matrix for use with spreadsheet and matrix programs.)

Your tasks:

- (2.5 points) For each word A, B, C, and D, calculate and provide its Euclidean distance from word A, and then use those values to rank all of the words with respect to their closeness to A (closest = 1; farthest = 4). (For each word, you should provide its distance and rank.)
- (2.5 points) Normalize each row of the matrix by length (definition below), recalculate the Euclidean distances of all the words from word A, and recalculate the ranking with respect to A. (For each word, you should provide its distance and rank. You needn't provide the length-normalized matrix.)
- (2 points) If the ranking changed between step 1 and step 2, provide a brief explanation for the nature of that change, and try to articulate why it changed. If the ranking did not change, provide a brief explanation for why normalization did not have an effect here.

- Euclidean distance
- The Euclidean distance between vectors \(x\) and \(y\) of dimension \(n\) is \( \sqrt{\sum_{i=1}^{n} |x_{i} - y_{i}|^{2}} \)

- Length (L2) normalization
- Given a vector \(x\) of dimension \(n\), the normalization of \(x\) is a vector \(\hat{x}\) also of dimension \(n\) obtained by dividing each element of \(x\) by \(\sqrt{\sum_{i=1}^{n} x_{i}^{2}}\)

(3 points) Here is another invented word × document matrix (words A, B, C; documents d1, d2, d3):

d1 | d2 | d3 | |
---|---|---|---|

A | 1 | 0 | 0 |

B | 1000 | 1000 | 4000 |

C | 1000 | 2000 | 999 |

Calculate the pointwise mutual information (PMI) for cells (A, d1) and (B, d3), as defined in equation 4 of Turney and Pantel (p. 157). What is problematic about the values obtained? How might we address the problem, so that the PMI values are more intuitive?