CS 224N (Ling 237) -- Natural Language Processing -- FAQs

CS 224N -- Ling 237: Natural Language Processing
Spring 2004
FAQs

FAQs

Homework #7 - Computational Semantics

Problem 2

This sentence involves plural noun phrases without determiners. You should basically ignore the plurality, but you will need an NP -> N' rule which does the right thing. You can regard the sentence as synonymous with 'Which customers purchased a red leather jacket?' (or 'Which customers purchased some red leather jackets?').

Homework #6 - PCFGs

Problem 1

Part (b): Calculating the Error Rate
In the training corpus, the string 'aa' occurs 5 times with tree-1, the string 'bb' occurs 5 times with tree-2, and string 'ba' occurs 1 time with tree-3. OK. We know that. But we also have a model that we created for the first part of the question. Perhaps there is more than one way to create the strings 'aa', 'bb', or 'ba'. How does our model match the training data? Does the model predict (i.e. have the highest probability for) the same trees for the strings? If the model predicts a different tree than the one in the training corpus, than the model is giving us an error. The error rate is the total number of errors / total number of trees for strings.

Problem 2

What is interesting about this problem is not that you're looking for the probability of the w_i and w_j under node N^1, but that you're looking for N^1 as their SINGLE common ancestor. That means that there are no intermediate nodes which dominate both w_i and w_j. This makes the question a little bit tricky.

Problem 3

You should have the rule S -> B A in the set of rules and not S -> A B. This rule (S -> B A) will be observed once in the training data.

Homework #4 - WSD

It's a very sensible and good thing to use something like a POS tagger, and see if you can get value from that for improving WSD, and to report experiments on that. But to move to the limiting case, downloading and running someone else's WSD system and turning in the results wouldn't be a very good approach. But it'd be fine to download an SVM package, say, and concentrate on finding useful features with suitable experiments.
We are trying to determine what value we should use for the vocabulary size when we smooth our model. What is a reasonable method for determining the vocabulary size?
Well, one easy way to approach this is to say that your vocabularly of interest is just words seen with some sense in the training data. There are other words, but in statistics terms, you're just defining the event model so as to only include words in that subset. Another way is to add one unknown word even that collapses all unseen words. One could choose instead a large value, but it's hard to choose on principle. You could just regard it as a smoothing parameter to be optimized (which is a perfectly good thing to do).
What is available in terms of morphological analyzers?
Under src you can find a C morph-1.5 which will remove inflectional endings, or a java morpha-2004-04-18 which will do the same (but really only providing you've already POS tagged things. Or you could use a stemmer which also more heuristically remove derivational endings. If you search for "porter stemmer java" on the web, you'll find one of those.

Homework #3

Problem 1

The definition of linear discounting should have been in terms of P(w_n|w_1, ... w_{n-1}) and C(w_n|w_1, ... w_{n-1}) [not P(w_1, ..., w_n)]. The online copy has been corrected.

Homework #2

Problem 4

There should be no apostrope at the end of the (Leftcorner attach) line. Instead, the gamma should be a gamma-bar.

Programming Project #1

Java programming help

A useful website that contains many Java-related links is the class site for CS 108: Object Oriented Programming. Check out the links available. You can find help for both the novice and advanced Java programmer.

Grammar file

The gfile in /afs/ir/class/cs224n/pp1 is in CNF. Making a parser that deals with unary, etc. rules is extra credit.

Extra credit

Some ways to earn extra credit are:

Make a generalized CKY parser that can handle empties and unaries.
Provide a runtime graph showing that runtime goes up roughly cubicly with sentence length

Printing Parse Trees

For printing out the parse trees, you can either use the bracketing scheme on p.98 of the textbook, or you can use an indented list
If you want to use the bracketing scheme, please look at p.98 of the textbook. This is in chapter 3. You can access this online from any Stanford computer.
Here is an example of the bracketing scheme following the example in the book:
[S [NP [AT The] [NNP children]] [VP [VBD ate] [NP [AT the] [NN cake]]]]

Parsing versus Recognizing: Recording/Printing Parse Trees

The algorithm in the handwritten parsing handout is essentially a CKY recognizer, rather than a parser. To check that parsing is working, you want a parser that prints out parses. (It's not strictly necessary for just counting the number of parses, but it's almost certainly wise for debugging.)
For a parser, there are then two ways to proceed.
1. The one that I think is conceptually easiest is that each time you enter a cell in the table, you record in a data structure what two things you put together to make it. These can be pointers back to the items in the chart that you combined.
2. The other is not to do this and then to do a read-out of the parses from the table (if I made an S from 1-5, let me consider again every splitpoint, and the rules for an S, and see how I could have made it, and then recurse on doing this). This may seem a stupid duplication of work, but (i) it doesn't change the asymptotic time complexity, and (ii) it is more economical of memory (it _does_ lower the aymptotic memory bound). Nevertheless, for the assignment, I'd recommend the first strategy.
What about the parenthetical note "without duplication!" in the last line of the algorithm? If there's more than one way to represent a given category in a given cell, don't I need to record that? Yes, you need to record it, if adopting strategy one above. But, to keep the parser cubic in time, you need to carefully differentiate entering things in the parse triangle, and recording ways of making that thing. To get a cubic time parser, it is crucial that you only put a nonterminal, say, XP, in the table once, but if following strategy one, you will want to record in a list the various ways of making it.

Part 1.2 - "correct sentences"

What do you mean by 'correct sentences'?
The basic idea behind this part of the assignment is to develop a grammar/lexicon/corpus of sentences that shows us that your parser works. It should correctly parse grammatical sentences and deal with any ambiguities that exist in the sentences. The 4 sentences listed in this section are just examples of grammatical sentences, some of which have ambiguities, and therefore would have multiple parses, and some which would have only one parse. Your parser should also return 0 parses for any ungrammatical sentences in your corpus of sentences (and make sure you put some ungrammatical examples in the corpus!) One example of an ungrammatical sentence is the example *Chris saw.
When I use the term corpus of sentences, I am referring to the mysentences file.
Do we have to handle the sentences shown in the example?
You do not need to have these exact sentences in your corpus of sentences. These are just examples. Of course, there is no restriction against using these sentences, either. Just make sure you have some of your own sentences, too, and make sure that you explain why you chose the sentences you did in your report. Basically answer the questions "What were you testing for with these sentences? How does this show that your parser works?"

Homework #1:

Looking At Corpora

References to wsj7_001 should instead refer to WS960102.

Problem 1

You only need to tokenize the text inside the <text> tags. By ignoring the rest, you're really only ignoring the headline to the <text> and some metadata.
It would be great to get rid of all the extra html tags, but not necessary

Part b) You may either give 2 instances for each method (for a total of four instances) or you may give 2 instances where the two methods are both wrong.

Problem 2

Part iii) of the question reads "What are the 10 most common word bigrams and 10 most common word trigrams in the two corpora ...".
Two corpora? As far I can tell, there's only one corpora -- the wsj.words files we generate from the six WS* input files. Which second corpora is this question referring to?

Yes, there was a typo in a few places in the assignment. This assignment only involves ONE corpus. Please ignore references to more than one corpus. There is now a new version of the assignment online.

Problem 3

In terms of answering exactly what's asked in the questions, the mention of count-counts is technically an aside, and you do not strictly have to do it.

But, the fact that you can seems a bit of a neat observation, and it is fairly related to material discussed in the section of the book given to you: for example it gives you precisely what's found in Table 1.2 (although it's called "frequency of frequencies" there).