HOMEWORK 2

We're going to build a supervised WSD system to disambiguate two senses of the made-up word vek, as described in Section 20.2 of Jurafsky & Martin. The annotated training corpus consists of just five sentences:
```
          foo vek¹ bar baz
          foo foo vek¹ zub
          zip vek¹ zip zub
          foo vek² zip zip
          foo zub vek² bar 
```
Our feature representation will be exceedingly simple: we'll use a bag of words containing the one word before and one word after the target word. That is, for every word w_i in our vocabulary, we'll have a boolean feature f_i indicating whether w_i appears as one of these two context words. Assume that the vocabulary includes every word observed in the training data.

Using this feature representation, we will train a standard Naive Bayes classifier on our small training corpus, and then use the classifier to make disambiguation predictions on test examples.
1. Write down the equation that our Naive Bayes classifier will use to make predictions.
2. What is the prior probability of sense 1?
3. What is conditional probability of feature foo, given sense 1?
4. Suppose we apply our trained classifier to the following test example.
```
              zip vek^? zub 
```
  Which sense will our classifier predict to be more likely?
5. What specific probabilities will our classifier assign to each of the two senses?
6. [Extra credit] Now suppose we apply our trained classifier to this test example.
```
              foo vek^? baz 
```
  What difficulties might we encounter, and how can we address them?

We've put a 1GB fragment of Wikipedia here. Please download it to some convenient location and unzip it using the unzip command.

By grepping through the unzipped XML file, you can find occurrences of (Wikipedia-style) hyperlinks like the ones that [Mihalcea 2007] used to create her sense-annotated corpus. For example, try the following grep command to find hyperlinks with anchor text "bar":
```
          egrep -o "\[{2}[a-z\(\) \-\_]+\|bar\]{2}" enwiki-20081008-pages-articles-first-1gb.xml 
```
Play around with this command and some variations to get familiar with the possibilities. (Pro tip: add "| sort | uniq -c" to the end of your command line to combine and count duplicates.)

Then, please find a noun having the following properties:
- It has at least five senses in WordNet.
- It has at least five senses in (our fragment of) Wikipedia, using Mihalcea's strategy for extracting sense annotations from Wikipedia.
- It does not appear in Table 2 of [Mihalcea 2007].
Show us how you'd map the Wikipedia senses you've discovered into WordNet senses. We're looking for mappings that look like this:
[[bar (law)|bar]] --> 'legal_profession.n.01' (the canonical name for the synset)
or
[[bar (law)|bar]] --> 'the body of individuals qualified to practice law in a particular jurisdiction' (synset definition of the sort one gets from the Web interface)
Be sure to list any Wikipedia senses that don't map to any WordNet sense, and conversely, any WordNet senses to which no Wikipedia senses are mapped.