CS224U Homework 2 (Spring 2014)

Homework 2

CS224U, Stanford, Spring 2014

This homework has three parts. There are a lot of words here, but don't be intimidated — there is not actually that much work to do.

Part 1: Bootstrapping for relation extraction

In this part, we're going to experiment with a bootstrapping approach to relation extraction, as depicted in Figure 22.16 of Jurafsky & Martin.

Let's suppose we have a brilliant idea for a startup which will be to literature what IMDb is to movies. (OMG, genius, we're gonna be rich!) One table in our database will relate authors to the novels they have written:

author	novel
Ayn Rand	Atlas Shrugged
Joseph Heller	Catch-22
Vladimir Nabokov	Lolita
James Joyce	Ulysses
(etc.)	(etc.)

In order to populate this table, we're going to try to extract (author, novel) tuples from natural language text on the web, using phrasal patterns containing wildcards, as described in section 22.2.2 of Jurafsky & Martin.

Google makes it easy to search for phrasal patterns on the web using a feature called "fill-in-the-blank" search, in which you use the * character to represent one or more wildcard tokens. As a warm-up, try these queries on Google:

"* is a novel by *"
"* wrote the novel *"
"the novel * was written by *"

(Make sure you include the quotation marks in your query.) Spend some time looking through the results. Note how a portion of each returned snippet appears in bold. Google is trying to guess at the appropriate boundaries for the matching phrase, and it doesn't always get it right. Note also that Google sometimes allows minor variations in the match. For example, the query "* is a novel by *" yields as a match "David's Buick is a new novel by Bob Saar". Let's not get hung up on these quirks here — they need not distract us from the main thrust of this exercise.

OK, now that you've warmed up: suppose we're going to begin populating our database using the first pattern above, namely "* is a novel by *". In our initial attempt, we'll retrieve all search results for that query, take the matching text from each snippet, and then construct a tuple from the subphrases corresponding to each wildcard (again, without fretting too much about how we find the right phrase boundaries — let's just assume we can do that reasonably well).

Answer each of the following questions in just a couple of sentences — no need to write a treatise here.

[1 point] Show one or two examples where the resulting tuple is close to something we'd want to put in our database, but not quite right. Can you suggest some simple strategies or heuristics to fix such examples up?
[1 point] Show one or two examples where the extracted tuple is definitely not something we'd want to put in our database, and there's no easy way to repair it. What strategies might we use to identify and filter out such tuples?
[0 points] Show one or two examples where our strategy works like a charm: we discover a new (author, novel) tuple to add to our database, and both names are clean and correct as is.

OK, suppose that we've found effective (if not perfect) solutions to the problems highlighted in questions 1 and 2, and we've used our first pattern to find several thousand (author, novel) tuples. But we know we're still missing many thousands more, so we're going to use the tuples we already have to identify additional extraction patterns. This time, enter the following search query:

"Ayn Rand" "Atlas Shrugged"

Again, spend some time perusing the results. Do the results suggest any good patterns that might be used to identify additional tuples?

[1 point] Find an example of a pattern that is likely to be precise (i.e., matches to the pattern are likely to be valid instances of the relation) but unproductive (i.e., we're unlikely to find many other matches). What strategies might we use to eliminate (or improve) such patterns?
[1 point] Find an example of a pattern that is likely to be productive (i.e., we expect to find lots of matches to the pattern) but imprecise (i.e., many or most of the matches will not be valid instances of the relation). What strategies might we use to eliminate (or improve) such patterns?
[0 points] Find an example of a pattern that looks likely to succeed. Then, use this pattern to search for some new tuples! Did it work as well as you expected?

Part 2: Playing with ReVerb

There is a demo of the ReVerb system online at:

http://openie.cs.washington.edu/

To get warmed up, spend a few minutes playing with the demo. Try using it to answer the following relational queries:

Who founded Stanford?
Who discovered America?
What did Edison invent?
What cures cancer?
What's the relationship between Brad Pitt and Angelina Jolie?
Who invented the internet?

Try a few more queries of your own invention. Fun, huh? OK, now you're ready to answer a couple of easy questions. Again, just a couple of sentences is enough.

[½ point] What's an especially amusing "fact" that you discovered in your explorations? (By "fact", we really mean falsehood. In case that wasn't clear. And your answer should be based on a new query, not one of the queries listed above.) Explain how ReVerb might have been fooled into accepting this "fact".
[½ point] Suggest a strategy for avoiding problems like the one you described in the last question.

Part 3: Evaluation metrics for clustering

Background

A crucial ingredient of the Yao et al. 2012 paper you read for today is the clustering of path senses into semantic relations. In this exercise, we'll take a closer look at the clustering evaluation metrics used by Yao et al. to measure their success: the pairwise metrics and the B³ metrics. Clustering tasks crop up in many different areas of NLP and NLU (including documentation classification, coreference resolution, topic modeling, and automatic thesaurus construction), so it's useful to know something about how clustering models are evaluated.

Assume that we have a set of n items to be clustered, and two clusterings (that is, partitions) of the items: a predicted clustering which is to be evaluated, and a target clustering against which to perform the evaluation. Both the pairwise approach and the B³ approach reformulate the clustering task as a binary classification task (or a set of such tasks). By comparing the binary labels derived from the predicted clustering to the binary labels derived from the target clustering, we can then compute precision and recall, just as in a standard evaluation of a binary classifier. However, the two approaches differ in how they define this reformulation.

The pairwise approach reformulates the clustering task as a binary classification task over pairs of items. Given a clustering, we label each of the n(n − 1)/2 pairs of items true if the two items are in the same cluster, and false otherwise. We then compute precision and recall by comparing the pairwise labels derived from the predicted clustering to those derived from the target clustering.

The B³ approach reformulates the clustering task as a set of binary classification tasks, each focused around a single item. Given a clustering and a focus item x, we label every item (including x itself) true if it appears in the same cluster as x, and false otherwise. We then compute per-item precision and recall by comparing the labels derived from the predicted clustering to those derived from the target clustering. Finally, we compute aggregate precision and recall by averaging per-item precision and recall over all items.

Lastly, in both approaches we compute F₁ as the harmonic mean of precision and recall.

The problem

Suppose we have a set of five items to cluster: {koala, ostrich, porpoise, salmon, tuna}.

The target clustering is {{koala, porpoise}, {ostrich}, {salmon, tuna}}. (In other words, we want to cluster the critters by taxonomic class.)

We'll evaluate three possible clusterings against this target clustering:

naïve: the clustering {{koala}, {ostrich}, {porpoise, salmon, tuna}}. (This differs from the target clustering by putting the porpoise with the fish.)
minimal: a clustering which puts every item into its own cluster: {{koala}, {ostrich}, {porpoise}, {salmon}, {tuna}}.
maximal: a clustering which puts every item into one cluster: {{koala, ostrich, porpoise, salmon, tuna}}.

We'll evaluate each clustering using both the pairwise metrics and the B³ metrics.

[0 points] Without doing any math, which of the three clusterings seems like the best match to the target clustering?
[0 points] Which seems like the worst match?

[2 points] OK, here comes the math. Use the pairwise metrics to fill in the following table.

	pairwise metrics
clustering	precision	recall	F₁
naïve
minimal
maximal

[½ point] Do the results of the pairwise evaluation match your intuitions? What's expected, and what's surprising?

[2 points] Now use the B³ metrics to fill in the following table.

	B³ metrics
clustering	precision	recall	F₁
naïve
minimal
maximal

[½ point] Do the results of the B³ evaluation match your intuitions? What's expected, and what's surprising? How do the results compare to the results of the pairwise evaluation?

Appendix: precision, recall, and F₁

In case you're unfamiliar with the concepts of precision, recall, and F₁, here's a quick explanation.

To evaluate a binary classifier, we compare the predicted labels to the target labels for every item in our test set. We begin by compiling counts, over the test set, of each of the four possible outcomes:

true positive (TP): predicted label is true and target label is true
false positive (FP): predicted label is true and target label is false
false negative (FN): predicted label is false and target label is true
true negative (TN): predicted label is false and target label is false

We then compute precision and recall from the counts of these outcomes.

P = #TP / (#TP + #FP)
R = #TP / (#TP + #FN)

Finally, F₁ is defined as the harmonic mean of precision and recall:

F₁ = 2PR / (P + R)

For example, suppose we have a test set of 10 items, with the following predicted and target labels:

item	predicted	target	outcome
1	T	F	FP
2	T	T	TP
3	F	F	TN
4	T	T	TP
5	F	T	FN
6	F	F	TN
7	T	T	TP
8	T	F	FP
9	F	F	TN
10	T	T	TP

The outcome counts are:

outcome	count
TP	4
FP	2
FN	1
TN	3

The precision, recall, and F₁ are therefore:

P = 4 / (4 + 2) = 0.667
R = 4 / (4 + 1) = 0.800
F₁ = 2 · 0.667 · 0.800 / (0.667 + 0.800) = 0.727

More background can be found in the Wikipedia articles on precision and recall and F₁.