CS
224N -- Ling 237: Natural Language Processing
Spring 2002
FAQs
FAQs
Homework #4:
For question #6, we don't have lexical entries for "saw", "with", or "a". How should we
treat these words?
Please replace the word "a" in the sentence with "the". (Kathy saw the woman with the telescope.)
The word "saw" should have a lexical entry like "respect".
The word "with" should have a lexical entry like "in".
Homework #3:
Don't we need the words "as", "condiment", and "side" to be in the grammar?
Yes -- you should add these three words to the grammar. ("condiment" and "side" as nouns,
"as" as a preposition)
Word Sense Disambiguation Project:
Why don't we have any test data? Are we supposed to use the data in the "wordsense" directory?
The data in the "wordsense" folder should be used only as additional data if you have time. It is old data
that is not in exactly the same format as the data that you have been given this year. The important data is in
the "data" directory. There is no data in test format in this directory -- you should divide the data into training and
test data on your own, either using the Java class provided or by simply splitting up the data and ignoring the sense
indicators for testing.
Can we use late days for checkpoint?
You are technically allowed to use late days on the checkpoint -- however,
we strongly discourage this. First of all, it's in your best interests to
look at the project early and to get started on working with the data so
that you're comfortable with it. Secondly, the purpose of the checkpoint
is to give people feedback on how well their programs work. This feedback
is more valuable to you if you have time afterwards to adjust your program
(so delaying the feedback is unwise). Furthermore, since we are posting
statistics on the performance of everyone's projects, the feedback is more
valuable to the rest of the class if they can obtain those figures in good
time -- if many people delay their checkpoints, then there will be no
meaningful statistics available next Wednesday.
It's not critical to have a very strong working program by next week (although
the feedback will be more valuable if you do) -- so if you don't have a
lot of time, try implementing one of the simpler techniques. Of course, if
you really have no time, you may choose to use late days, despite all of
the reasons above.
What exactly needs to be submitted for the checkpoint?
For the checkpoint, you should submit a working wordsense disambiguation program and a working spelling
correction program (these can be the same program, but must be accessed through two different shell scripts, as
discussed in the handout). You do not need to have two machine learning techniques implemented by the checkpoint.
Is there a part-of-speech tagger available for our use?
Yes -- there is a tagger in /afs/ir/class/cs224n/src/icopost-0.9.1
There's documenation for it in the html directory or at
http://nats-www.informatik.uni-hamburg.de/~ingo/icopost/
There is an English (Penn treebank tagset) ngram and lexicon file in
the data directory for use with the trigram tagger in particular.
Homework 1:
In Problem 6a, the handout says to get rid of all text between SGML elements.
In some texts, there are <subsample> tags that surround most of the text. Should we keep
this text?
Yes, you should keep the text within <subsample> tags. You should just delete the tags
themselves, not the text between them.
In Problem 1, it looks like I'll get different answers depending on the exact
structure of the text. Is this something to be worried about?
If you make assumptions about the structure of the text, be sure to state them clearly
in your answer.
Problem 2 says to "show that, in the limit, 'Zipf's law' holds for text
randomly generated according to X." Which variable should we be taking to the limit?
This is the limit as word length |w| increases.
In Problem 3c, what is meant by the "prior term"?
The prior term is the term that represents the prior probability of
a class. That is, it's the probability that a document belongs to a
class if you don't have any information about the document. In the
algorithm in Mitchell's book, this is P(vj). (The prior is indicated on slide 265
from the lecture notes from Wednesday's class.)
What should we be doing in problem 3e? Is there an error in the question?
There's no error in this question: what you are meant to do is to calculate P(c|w) and
then to take the average of this value and P(c). That is, you're looking the effects of
using [P(c|w) + P(c)]/2.
In the ACE corpus, some of the <sample> don't have corresponding </sample>
tags. Is this a problem?
Don't worry about this -- just make sure that you delete any leftover tags after you have
extracted the text.
I can't figure out how to get the corpus tokenized perfectly -- I keep running into
problems with details. For example, should I keep the headings in, or delete them?
Working with corpora is always messy. You want to get the text to be as "clean" as possible,
but there are always details that are difficult to work out and inconsistencies that seem
impossible to reconcile. In other words, it may not be possible (or worth your time)
to get everything perfectly clean, but as long as you make sensible decisions, you should be okay.