CS 224N (Ling 237) -- Natural Language Processing -- FAQs

CS 224N -- Ling 237: Natural Language Processing
Spring 2002
FAQs

FAQs

Homework #4:

For question #6, we don't have lexical entries for "saw", "with", or "a". How should we treat these words?

Please replace the word "a" in the sentence with "the". (Kathy saw the woman with the telescope.)
The word "saw" should have a lexical entry like "respect".
The word "with" should have a lexical entry like "in".

Homework #3:

Don't we need the words "as", "condiment", and "side" to be in the grammar?

Yes -- you should add these three words to the grammar. ("condiment" and "side" as nouns, "as" as a preposition)

Word Sense Disambiguation Project:

Why don't we have any test data? Are we supposed to use the data in the "wordsense" directory?

The data in the "wordsense" folder should be used only as additional data if you have time. It is old data that is not in exactly the same format as the data that you have been given this year. The important data is in the "data" directory. There is no data in test format in this directory -- you should divide the data into training and test data on your own, either using the Java class provided or by simply splitting up the data and ignoring the sense indicators for testing.
Can we use late days for checkpoint?

You are technically allowed to use late days on the checkpoint -- however, we strongly discourage this. First of all, it's in your best interests to look at the project early and to get started on working with the data so that you're comfortable with it. Secondly, the purpose of the checkpoint is to give people feedback on how well their programs work. This feedback is more valuable to you if you have time afterwards to adjust your program (so delaying the feedback is unwise). Furthermore, since we are posting statistics on the performance of everyone's projects, the feedback is more valuable to the rest of the class if they can obtain those figures in good time -- if many people delay their checkpoints, then there will be no meaningful statistics available next Wednesday.
It's not critical to have a very strong working program by next week (although the feedback will be more valuable if you do) -- so if you don't have a lot of time, try implementing one of the simpler techniques. Of course, if you really have no time, you may choose to use late days, despite all of the reasons above.
What exactly needs to be submitted for the checkpoint?

For the checkpoint, you should submit a working wordsense disambiguation program and a working spelling correction program (these can be the same program, but must be accessed through two different shell scripts, as discussed in the handout). You do not need to have two machine learning techniques implemented by the checkpoint.
Is there a part-of-speech tagger available for our use?

Yes -- there is a tagger in /afs/ir/class/cs224n/src/icopost-0.9.1
There's documenation for it in the html directory or at http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/
There is an English (Penn treebank tagset) ngram and lexicon file in the data directory for use with the trigram tagger in particular.

Homework 1:

In Problem 6a, the handout says to get rid of all text between SGML elements. In some texts, there are <subsample> tags that surround most of the text. Should we keep this text?

Yes, you should keep the text within <subsample> tags. You should just delete the tags themselves, not the text between them.
In Problem 1, it looks like I'll get different answers depending on the exact structure of the text. Is this something to be worried about?

If you make assumptions about the structure of the text, be sure to state them clearly in your answer.
Problem 2 says to "show that, in the limit, 'Zipf's law' holds for text randomly generated according to X." Which variable should we be taking to the limit?

This is the limit as word length |w| increases.
In Problem 3c, what is meant by the "prior term"?

The prior term is the term that represents the prior probability of a class. That is, it's the probability that a document belongs to a class if you don't have any information about the document. In the algorithm in Mitchell's book, this is P(v_j). (The prior is indicated on slide 265 from the lecture notes from Wednesday's class.)
What should we be doing in problem 3e? Is there an error in the question?

There's no error in this question: what you are meant to do is to calculate P(c|w) and then to take the average of this value and P(c). That is, you're looking the effects of using [P(c|w) + P(c)]/2.
In the ACE corpus, some of the <sample> don't have corresponding </sample> tags. Is this a problem?

Don't worry about this -- just make sure that you delete any leftover tags after you have extracted the text.
I can't figure out how to get the corpus tokenized perfectly -- I keep running into problems with details. For example, should I keep the headings in, or delete them?

Working with corpora is always messy. You want to get the text to be as "clean" as possible, but there are always details that are difficult to work out and inconsistencies that seem impossible to reconcile. In other words, it may not be possible (or worth your time) to get everything perfectly clean, but as long as you make sensible decisions, you should be okay.