|
This page answers frequently asked questions (FAQs) for CS224N / Ling 237.
HW4: Parent Annotation and miniTest17 May 2005Once you have added parent annotation to your grammar there are two parses for the test sentence in the miniTest data set that are equally probable given the training sentences. If you take these parse in the default order you will find the exact match, but don't worry if you aren't getting an exact match because ordering is abitrary for sentence of equal probability. HW2: parent annotation, speed targets17 May 2005Two points about HW4:
Memory issues for HW33 May 2005Various people have had problems with running out of memory when building maximum entropy classifiers or with them taking a long time to train. Here are three answers to this problem.
One is that training discriminative models takes a lot of
memory for optimization and takes a long time (you might
just be convinced that generative models are great after
all!!). Remember the
The second answer is to try to avoid storing unnecessary
stuff and doing a lot of excess computation in training your
models. But if you're doing things correctly (the
assignment does urge you to calculate the value and
derivative of the likelihood surface efficiently, since that
is the inner loop of the optimization process), then things
should not be too bad. While the code provided is of a
somewhat pedagogical character, and it's written in Java,
it's not toy code. All the stuff with The third answer is to think about your features. The assignment deadline is soon, and you want to get something running by then. Try to define useful, often applicable features that capture the most important data generalities, before putting in feature templates that put in tens of thousands of features of much more questionable utility. To get you started, here are two good features of an unknown word: "Is it capitalized?" and "Does it end with 's'?". That gives you two φ features, or 90 f features (given 45 Penn tags). A very modest number. In contrast, something like "What is the next word?" would add in about 1.3 million f features all by itself. That's already getting you up into the size of model that it will be difficult to train. "
1 May 2005
|
cat: 0 | bear: 1 | |
fuzzy: 0 | i = 0 | i = 1 |
claws: 1 | i = 2 | i = 3 |
small: 2 | i = 4 | i = 5 |
big: 3 | i = 6 | i = 7 |
medium: 4 | i = 8 | i = 9 |
These is are the indexes into the weights vector (the
λs). And where you see the handout refer to
fi, that's the same index. So what might be a
little confusing is that a given fi (such as
f2) corresponds not to a single observed
feature (like claws
) but to a <feature, class>
pair (like <claws
, cat
>).
So, for example, in the formulas on p. 3 of the handout, you see
things like fi(c', d). How do you evaluate
this? Well, suppose i is 2 and c' is
cat
. f2 is supposed to be "on"
whenever the class is cat
and the feature
claws
is present, so
f2(cat
, d) will be on just
in case the datum d has the feature claws
.
Now what if c' is bear
? In this case,
f2(bear
, d) will be off,
regardless of whether the datum d has the feature
claws
or not.
As the assignment instructions suggest, there is far more
data available than is used by default. It's already there
in the data/training
directory, which contains
more sentences than the U.S. tax code. OK, that may or may
not be literally true. But there are a lot of sentences,
more than a million. You're welcome to use as many as you
like, if you can manage to deal with e.g. memory issues.
I got some questions today regarding formulas found in section 31 of the Kevin Knight tutorial. He begins the section as follows:
P(a | e, f) = P(a, f | e) / P(f | e) = P(a, f | e) / Σa P(a, f | e)
Consider the computation of the denominator above. Using Model 1, we can write it as:
Σa Πj t(fj | eaj)
Knight is being imprecise here, and it's quite confusing. The latter expression is not really equal to the denominator above, because it neglects to take account of the length of the French sentence.
Here's a more precise account. Let's ignore the summation over a for the moment, and just focus on P(a, f | e). Since len(f) is a deterministic function of f, P(a, f | e) = P(a, f, len(f) | e). Now we repeatedly apply the chain rule for conditional probability:
P(a, f, len(f) | e) | = |
P(len(f) | e) × P(a | len(f), e) × P(f | a, len(f) e) |
So it's really the product of three probabilities, and Knight kind of ignored the first two, which determine how long the French sentence will be and what the alignment will be. He was really focused on the third piece, and it is true that
P(f | a, len(f) e) = Πj t(fj | eaj)
So Knight's slip is very confusing. I think the reason he made it is just that he was focused on something else. The point he is trying to make in this section is something quite different: that the sum and product in his expression can be effectively interchanged (read the section for details), saving a huge number of multiplications.
A line the starter code that adds the test sentences to the training sentences. Although normally it is a mistake to combine the test and training data, combining the test and training sentences in this case is acceptable because information that you are attempting to produce, namely the alignments of the test sentences, is not added to training data. Adding the test sentences simply allows you to optimize your model parameters with full knowledge of what sort of paired sentences can occur. In general, you can train on test data in the supervised learning setting so long as the test data you are using is, like your training data, unlabeled.
No. You can use Perl, C, C++, or any other widely used programming language. Extra credit if you design a Turing machine to compute your final project. Double extra credit if you build a diesel-powered mechanical computer to compute your final project. Triple extra credit if you build a human-level AI capable of autonomously conceiving, executing, and presenting your final project.[*]
readSpeechNBestLists()
Have you seen an error like this?
Exception in thread "main" java.io.FileNotFoundException: /afs/ir/class/cs224n/corpora/assignment1/wsj_n_bst/4ojc0209.acc (Too many open files)
This is a consequence of a small flaw in the starter code -- it is failing to close file readers when loading the HUB data.
There's a simple fix, which is to add three lines near the end of
the readSpeechNBestLists()
method. The revised
method is shown below, with the only modifications marked:
public static List readSpeechNBestLists(String path, Set vocabulary) throws IOException { List speechNBestLists = new ArrayList(); BufferedReader correctSentenceReader = open(path + "/REF.HUB1"); Map correctSentenceMap = readCorrectSentences(correctSentenceReader); List prefixList = getPrefixes(path); for (Iterator it = prefixList.iterator(); it.hasNext();) { String prefix = (String) it.next(); BufferedReader wordReader = open(path + "/" + prefix); BufferedReader scoreReader = open(path + "/" + prefix + ".acc"); List correctSentence = (List) correctSentenceMap.get(prefix); SpeechNBestList speechNBestList = buildSpeechNBestList(correctSentence, wordReader, scoreReader, vocabulary); if (speechNBestList != null) speechNBestLists.add(speechNBestList); wordReader.close(); // <-- ADD! scoreReader.close(); // <-- ADD! } correctSentenceReader.close(); // <-- ADD! return speechNBestLists; }
A few people have inquired about smoothing and unknown words (or more generally, n-grams). The basic idea of smoothing is to take some probability mass from the words seen during training and reallocate it to words not seen during training. Assume we have decided how much probability mass to reallocate, according to some smoothing scheme. The question is, how do we decide how to allocate this probability mass among unknown words, when we don't even know how many unknown words there are? (No fair peeking at the test data!)
There are multiple approaches, but no perfect solution. (This is an opportunity for you to experiment and innovate.) A straightforward and widely-used approach is to assume a special token *UNK* which represents (an equivalence class of) all unknown words. All of the reallocated probability mass is assigned to this special token, and any unknown word encountered during testing is treated as an instance of this token.
Another approach is to make the (completely unwarranted) assumption that there is some total vocabulary of fixed size B from which all data (training and test) has been drawn. Assuming a fixed value for B allows you to fix a value for N0, the number of unknown words, and the reallocated probability mass can then be divided equally (or according to some other scheme) among the N0 unknown words. The question then arises: how do you choose B (or equivalently, N0)? There is no principled way to do it, but you might think of B as a hyperparameter to be tuned using the validation data (see M&S p. 207).
Both of these approaches have the shortcoming that they treat all unknown words alike, that is, they will assign the same probability to any unknown word. You might think that it's possible to do better than this. Here are two unknown words: "sneck" and "mmrfio". Intuitively, which should be considered more probable? What kind of knowledge are you applying? Do you think a machine could make the same judgment?
In class, Chris mentioned a paper by Efron & Thisted, Estimating the number of unseen species: How many words did Shakespeare know?, which addresses related issues.
The instructions for HW1 suggest that you might try experimenting with a larger dataset. The primary dataset (treebank-sentences-spoken) contains less than a million words — which is not so large.
A larger dataset is now available for you to play with. The dataset is named bllip-speechified-sentences, and includes about 10 million words of Wall Street Journal text. (So, it's ten times as large as the primary dataset.) The data has been split 80/10/10 into training, validation, and test files, which can now be found in corpora/assignment1/.
If you examine the main method of LanguageModelTester, you will find that it accepts an optional second argument, which is the name of the dataset to be used. Note that if you supply bllip-speechified-sentences as the name of the dataset, it will use that dataset for both training and testing, with the consequence that your test results will not be directly comparable with test results for the primary dataset. However, with a few simple changes to the main method of LanguageModelTester, you can make it train on the larger training set and test on the original test set.
One reason it's nice to have more data is that it should make it possible to build more accurate models. Another is that it makes it easier to explore how model performance varies with the amount of training data — in other words, to investigate the learning curve.
Java 1.5 enhancements include generic types, metadata, autoboxing, an enhanced for loop, enumerated types, static import, C style formatted input/output, variable argument lists, concurrency utilities, and simpler RMI interface generation. Read Sun's Java 1.5 overview for a concise explanation of the new features, with simple examples. For more detail, check out this document.
You don't need to master all the new 1.5 features. The ones to focus on are generic types, autoboxing, and the enhanced for loop. The last two are very straightforward and can be grasped in a few minutes. Generic types can become quite complex, but you don't need to know the intricacies. You can get the basic idea pretty quickly.
The starter code for assignment 1 uses only Java 1.4 code, but the starter code for later assignments will use Java 1.5.
Although Java 1.5 has been out for almost a year, there is still no JDK 1.5 for Macintosh. It turns out that Sun doesn't provide the Macintosh JDK — Apple does, and they haven't been quick about it. (It took Apple 23 months to port Java 1.4 to the Mac a few years ago.) But the good news is that the wait is almost over: according to the rumor mill, the Mac JDK 1.5 will be released concurrently with Mac OS X 10.4 (aka Tiger), which is expected in mid-April.
In the meantime, there's a workaround that I've been using successfully for the last six months to do Java 1.5 on my Mac. At the beginning of last year, Sun published a Java 1.5 prototype called JSR014. It was unsupported software made available to developers to help them get up to speed with, and comment on, the new Java 1.5 language. It is basically a Java 1.5 compiler, written in Java 1.4, which can compile 1.5 source code into 1.4 bytecode. Since it's written in Java 1.4, it will run anywhere JDK 1.4 is installed — like your Mac, for instance. Although it's an unsupported static piece of code, and is based on a beta version of 1.5, I've had very few problems using it. Pretty much, you just write Java 1.5 (and use the 1.5 libraries), and it just works. However, there are no guarantees, and you should be prepared to fall back to using Java 1.5 on another platform if this doesn't work out for you.
sudo mv ~/Desktop/adding_generics-2_4ea/*.jar /Library/Java/Extensions/
sudo chmod +x ~/Desktop/tiger* sudo mv ~/Desktop/tiger* /usr/bin/ rehash
tigerc MyClass.java tiger MyClassand you should be good to go!
source="${compile.source}"Add the following attributes to the <javac> element:
fork="yes" executable="/usr/bin/tigerc" memoryMaximumSize="200m"(The last line tells it to use plenty of memory — you can fiddle with the exact value.) Also, search build.xml for the string "1.4", and replace each occurrence with "1.5". Now ant will compile Java 1.5 source files for you.
If you try this, let me know whether it works for you! If you find mistakes in my instructions, please let me know, so that your classmates can benefit from your discovery. And if everything works fine, I'd like to hear about that too.
[*] The maximum amount of extra credit you can earn for any such alternative implementation is 0.03 points.
|
Site design by Bill MacCartney |