CS 224N / Ling 237 — Natural Language Processing

This page answers frequently asked questions (FAQs) for CS224N / Ling 237.

5/16/08		Buggy command lines in PA3 handout?
5/5/08		What was the distribution of grades for PA1?
4/28/08		A minor bug in Hypothesis
4/25/08		To use the decoder in PA2, we have to implement getAlignmetProb which takes an "Alignment". What is its format? How's NULL represented?
4/25/08		Why am I getting an Exception when running `./run-dec` in the starter code?
4/24/08		What kind of numbers should I expect for PA2
4/24/08		Why is `wordAligner.train` not called in the starter code?
4/24/08		In Model 1 and 2, do we explicitly sum over all possible alignments?
4/21/08		Update on the PA2 decoder
4/21/08		"Results of quiz are not emailed to the instructor"?
4/16/08		The decoder in PA2
4/15/08		Where do I hand in my report late?
4/14/08		Where's mini.sent.txt for PA1?
4/11/08		Why is there last year's FAQ up on the site? What kind of numbers should I be getting?
4/3/08		How do I get past the JAVA_HOME errors when I try to compile on Hedge?
4/13/07		Do I have to create my validation set for PA1?
4/12/07		Increasing perplexity with more training data?
4/12/07		Do we smooth at train time or test time?
4/12/07		How do I use stop tokens in my n-gram model?
4/12/07		Does PA1 require a strict proof?
4/12/07		Smoothing implementation details
4/12/07		Smoothing and conditional probabilities
4/12/07		Smoothing and unknown words
4/9/07		What numbers should I be getting for PA1?
4/9/07		Having trouble compiling/running PA1?
4/1/07		Do I have to do my final project in Java?

Buggy command lines in PA3 handout?

16 May 2008

As some students have noticed, the example command lines in PA3 are full of bugs, some of the legacy (such as missing the "ir/" or having three 2's in "224n"), and some of them new (such as where we tell you to append "genia" et al. to the path information). They command lines have all been corrected in the PA3 handout which is available from the Assignments tab.

What was the distribution of grades for PA1?

5 May 2008

Mean: 8.38
Standard deviation: 1.80
Mode: 8

A minor bug in Hypothesis

28 April 2008

There's actually a minor bug in Hypothesis.java Line 121 (the assertion statement). Instead of

assert(targetSentence.size() != newTargetSentence.size());

It really should be:

assert(targetSentence.size() == newTargetSentence.size());

This shouldn't affect the way your code works. Previously we said it works with Sun JDK; it's probably because Sun JDK doesn't check the assertion by default, while gcj does. I will change this in the starter code. But don't panic. Whatever works for you should still work.

To use the decoder in PA2, we have to implement getAlignmetProb which takes an "Alignment". What is its format? How's NULL represented?

25 April 2008

In Model 1 and 2, each foreign word has to align to one English word (or NULL). In the internal representation of Alignment, the indices of sentences start from 0. If a foreign word is aligned to NULL, the index for the English word (NULL) will be -1.

Why am I getting an Exception when running `./run-dec` in the starter code?

25 April 2008

If you're on bramble (or hedge), you should check if your "java" points to the sun jdk one. If not, you should do this:

bramble13:~> setenv JAVA_HOME /usr/lib/j2sdk1.5-sun
bramble13:~> setenv PATH /usr/lib/j2sdk1.5-sun/bin:$PATH

(Thanks Tague for noticing the problem and providing the commands)

What kind of numbers should I expect for PA2

24 April 2008

Here are some AER numbers to give you some general idea of what you might get from PA2:

PMI: 0.79
Model1: 0.58
Model2: 0.45

Why is `wordAligner.train` not called in the starter code?

24 April 2008

You'll have to implement the "train" method, and it should be called in the WordAlignmentTester class. I just added one line in the WordAligner class in the starter code. If your starter code doesn't have it, add it in yourself. It's just one line:

wordAligner.train(trainingSentencePairs);

In Model 1 and 2, do we explicitly sum over all possible alignments?

24 April 2008

Many people were confused by the small examples in J&M reader and the Knight workbook. The workbook did mention that "It is not hard to figure out how to collect fractional counts during EM training efficiently as well." But they didn't say how.

For Model 1 and 2, you don't have to explicitly sum over all possible alignments. In Chris's lecture notes on April 5, there were a few slides that talked about how this could be done efficiently. (There were also some discussions on the newsgroup.)

Update on the PA2 decoder

21 April 2008

We've updated some of the starter code of PA2. For the code update and how to use the decoder, see:

http://www.stanford.edu/class/cs224n/pa/pa2-dec.pdf

"Quiz results not emailed to the instructor"?

21 April 2008

We've been getting a lot of questions about what this message print out by the quiz software means. It means nothing other than, when you take the quiz, it only saves the results on the server instead of emailing them to the instructor every time. Also, while you can "retake" the quiz to see the questions and answers again, we only save the results from the first time you take the quiz. Thus clicking through the quiz again does not change your score or the timestamp on it.

The decoder in PA2

16 April 2008

We've handed out PA2 today. Some details to notice:

1. Implement your aligner and see if it improves the AER. For now, don't panic if your aligner doesn't work sensible on the decoding results. We will make a few changes to the decoder within a week. (This will not affect your progress on this programming assignment, since you can plug in your aligner to run the decoder at the end.)

2. In the handouts we distributed in class today, it says:

"Additionally, you must add one more method to your LanguageModel: getUnigramProbability(String word) which will return the unigram probability of the passed in word."

This is actually not necessary anymore. I updated the 2 files in: /afs/ir/class/cs224n/pa2/java/src/cs224n/langmodel/. So, if you copied the starter code before 3pm this afternoon, you should comment out getUnigramProbability.

Where do I hand in my report late?

15 April 2008

There will be a submission box outside of Professor Manning's office at Gates 158. For code submitted before midnight, the writeup is due in the late box by 10 AM the following day. Please write the date and time of submission on your report and sign it before placing it in the box.

Where's mini.sent.txt for PA1?

14 April 2008

The handout of PA1 mentioned there is a mini.sent.txt. We changed the dataset this year and we didn't make a mini dataset. If you need one, you can take some lines from the existing data. This is just to speed up when you develop your code. As mentioned in the handout, the results obtained on 10 lines of data are meaningless.

Last year's FAQ; What kind of scores should I be getting?

11 April 2008

We've kept last year's FAQ entries up on the site as they are mostly relevant to this year's assignment. Any perplexity numbers or such listed in the old FAQs should be disregarded, however, as those numbers have been updated in the current PA1 handout. Below are some exact numbers from reference models run on this year's data:

	Unigram	Bigram	Trigram	Linearly-interpolated Bigram	Linearly-interpolated Trigram
Training set perplexity:	849.9011	71.8341	9.853	91.9442	21.9336
Test set perplexity:	817.5552	435.7855	2429.7497	228.0201	195.6814
HUB perplexity:	3070.1622	10366.2218	33935.5494	2258.6445	2092.19
HUB Word Error Rate:	0.0986	0.0757	0.0872	0.078	0.0734

Setting JAVA_HOME on Hedge

3 April 2008

Run "setenv JAVA_HOME /usr/lib/j2sdk1.5-sun/" and ant should work on hedge.

Using a validation set for PA1

13 April 2007

You do not have to create your own. Uncomment lines 227-228 in LanguageModelTester.java (after loading the training sentences, but before loading the test pairs) and you will then have a validation set to use.

You can use this set to choose parameters, such as the lambdas for linear interpolation (trying out hand-picked values is fine).

Increasing perplexity with more training data?

12 April 2007

Some students who have been trying to investigate learning curves have reported seeing test-set perplexity increase as the amount of training data grows. This is counter-intuitive: shouldn't more training data yield a better model, which is therefore able to attain lower perplexity? Chris came up with a possible explanation involving the handling of the <UNK> token. Remember that the <UNK> token actually represents an equivalence class of tokens. As more training data is added, this equivalence class shrinks. Because the meaning of the <UNK> token is changing, model perplexities are not directly comparable. Especially when the amount of training is small, adding more data will rapidly lower the model probability of <UNK>, causing the entropy and perplexity of the model distribution to grow.

If you've been looking at learning curves, an interesting investigation — not specifically required for the assignment — would be to measure the learning curve while holding the definition of <UNK> constant. This would mean allowing the models trained on small data sets to "know about" all the words in the largest training set. All known words in this sense would get explicit counts, which could be 0, and then you'd still have an <UNK> token representing all words which did not appear in even the largest training set.

Do we smooth at train time or test time?

12 April 2007

Generally speaking, smoothing should be the last step of training a model: first you collect counts from your training data, and then you compute a smoothed model distribution which can be applied to (that is, used to make predictions about) any test data.

In principle, it would be possible to postpone the computation of a smoothed probability until test time. But (a) it's not very efficient, because most smoothing algorithms require iterating through all the training data, which you shouldn't have to do more than once, and (b) if you're wanting to do this because your smoothing computation depends upon something in the test data, then you're doing things wrong. (For example, model probabilities should not depend on how many unknown words appear in the test data.)

How do I use stop tokens in my n-gram model?

12 April 2007

Real sentences are not infinite; they begin and end. To capture this in your n-gram model, you'll want to use so-called "stop" tokens, which are just arbitrary markers indicating the beginning and end of the sentence.

It's typically done as follows. Let <s> and </s> (or whatever) be arbitrary tokens indicating the start and end of a sentence, respectively. During training, wrap these tokens around each sentence before counting n-grams. So, if you're building a bigram model, and the sentence is

I like fish tacos

you'll change this to

<s> I like fish tacos </s>

and you'll collect counts for 5 bigrams, starting with (<s>, I) and ending with (tacos, </s>). If you encountered the same sentence during testing, you'd predict its probability as follows:

P(<s> I like fish tacos </s>) = P(<s>) · P(I | <s>) · ... · P(tacos | fish) · P(</s> | tacos)

where P(<s>) = 1. (After all, the sentence must begin.)

Does PA1 require a strict proof?

12 April 2007

Q. Is it necessary to give strict mathematical proof that the smoothing we've done is proper probability distribution? Or is it enough to just give a brief explanation?

A. You should give a concise, rigorous proof. No hand-waving. I'll show an example on Friday. Note that it's important that your proof applies to your actual implementation, not some ideal abstraction.

Smoothing implementation details

12 April 2007

Do you have questions regarding details of various smoothing methods? (For example, maybe you're wondering how to compute those alphas for Katz back-off smoothing.)

You might benefit from looking at a smoothing tutorial Bill put together last year.

For greater detail, an excellent source is the Chen & Goodman paper, An empirical study of smoothing techniques for language modeling.

Smoothing and conditional probabilities

12 April 2007

Some people have the wrong idea about how to combine smoothing with conditional probability distributions. You know that a conditional distribution can be computed as the ratio of a joint distribution and a marginal distribution:

P(x | y) = P(x, y) / P(y)

What if you want to use smoothing? The wrong way to compute the smoothed conditional probability distribution P(x | y) would be:

From the joint P(x, y), compute a smoothed joint P'(x, y).
Separately, from the marginal P(y), compute a smoothed marginal P''(y).
Divide them: let P'''(x | y) = P'(x, y) / P''(y).

The problem is that steps 1 and 2 do smoothing separately, so it makes no sense to divide the results. (In fact, doing this might even yield "probabilities" greater than 1.) The right way to compute the smoothed conditional probability distribution P(x | y) is:

From the joint P(x, y), compute a smoothed joint P'(x, y).
From the smoothed joint P'(x, y), compute a smoothed marginal P'(y).
Divide them: let P'(x | y) = P'(x, y) / P'(y).

Here, there is only one smoothing operation. We compute a smoothed joint distribution, and compute everything else from that.

If there's interest, I can show a worked-out example of this in Friday's section.

(It would also be correct to compute all the conditional distributions before doing any smoothing, and then to smooth each conditional distribution separately. This is a valid alternative to smoothing the joint distribution, and because it's simple to implement, this is often the approach used in practice. However, the results might not be as good, because less information is used in computing each smoothing function. This would be an interesting question to investigate in your PA1 submission.)

Smoothing and unknown words

12 April 2007

A few people have inquired about smoothing and unknown words (or more generally, n-grams). The basic idea of smoothing is to take some probability mass from the words seen during training and reallocate it to words not seen during training. Assume we have decided how much probability mass to reallocate, according to some smoothing scheme. The question is, how do we decide how to allocate this probability mass among unknown words, when we don't even know how many unknown words there are? (No fair peeking at the test data!)

There are multiple approaches, but no perfect solution. (This is an opportunity for you to experiment and innovate.) A straightforward and widely-used approach is to assume a special token <UNK> which represents (an equivalence class of) all unknown words. All of the reallocated probability mass is assigned to this special token, and any unknown word encountered during testing is treated as an instance of this token.

Another approach is to make the (completely unwarranted) assumption that there is some total vocabulary of fixed size B from which all data (training and test) has been drawn. Assuming a fixed value for B allows you to fix a value for N₀, the number of unknown words, and the reallocated probability mass can then be divided equally (or according to some other scheme) among the N₀ unknown words. The question then arises: how do you choose B (or equivalently, N₀)? There is no principled way to do it, but you might think of B as a hyperparameter to be tuned using the validation data (see M&S p. 207).

Both of these approaches have the shortcoming that they treat all unknown words alike, that is, they will assign the same probability to any unknown word. You might think that it's possible to do better than this. Here are two unknown words you might encounter, say, on the internet: "flavodoxin" and "B000EQHXQY". Intuitively, which should be considered more probable? What kind of knowledge are you applying? Do you think a machine could make the same judgment?

A paper by Efron & Thisted, Estimating the number of unseen species: How many words did Shakespeare know?, addresses related issues.

PA1 perplexity and WER

9 April 2007

Here are some estimates for what kind of numbers you should be seeing as you do PA1:

Model Test Set Perplexity HUB WER

Unigram 1300 .09
Bigram 1200 .075
Trigram 300 .06

For the bigram and trigram models, you should be able to get your training set perplexity below 100. Of course, these numbers are just guidelines to help you check whether you are on track. Try your best to get them as low as possible!

Compiling and Running PA1

9 April 2007

Follow the instructions on the first page of the PA1 handout. On the second page, the instruction to "cd cs224n/java" is incorrect - there is no such directory (the handout has now been fixed).

You can run ant in one of two ways. From the directory ~/cs224n/pa1/java/ you can type "./ant" to use the symbolic link, or you can add the cs224n bin directory to your path variable: "setenv PATH ${PATH}:/afs/ir/class/cs224n/bin/apache-ant-1.6.2/bin/". Once you change the PATH variable, you can invoke ant by simply typing "ant" (note, however, that to compile the files you should still make sure you are in directory ~/cs224n/pa1/java/ so it can find build.xml).

ant will compile the .java files in the ~/cs224n/pa1/java/src folder and put the resulting .class files in the ~/cs224n/pa1/java/classes folder. To run these classes, you must make sure java knows to look in the classes folder:

Make sure you are in ~/cs224n/pa1/java/. Check to see whether you have anything in the CLASSPATH variable by typing "printenv CLASSPATH". If nothing was printed out, you can set the CLASSPATH by simply typing "setenv CLASSPATH ./classes". If something was printed, type "setenv CLASSPATH ${CLASSPATH}:./classes". This will preserve whatever was already stored in the variable.

You should now be able to run the class files. From ~/cs224n/pa1/java/ try typing "java cs224n.assignments.LanguageModelTester" or try out the the included shell script "./run".

Do I have to do my final project in Java?

1 April 2007

No. You can use Perl, C, C++, or any other widely used programming language. Extra credit if you design a Turing machine to compute your final project. Double extra credit if you build a diesel-powered mechanical computer to compute your final project. Triple extra credit if you build a human-level AI capable of autonomously conceiving, executing, and presenting your final project.

Site design by Bill MacCartney

Buggy command lines in PA3 handout?

What was the distribution of grades for PA1?

A minor bug in Hypothesis

To use the decoder in PA2, we have to implement getAlignmetProb which takes an "Alignment". What is its format? How's NULL represented?

Why am I getting an Exception when running ./run-dec in the starter code?

What kind of numbers should I expect for PA2

Why is wordAligner.train not called in the starter code?

In Model 1 and 2, do we explicitly sum over all possible alignments?

Update on the PA2 decoder

"Quiz results not emailed to the instructor"?

The decoder in PA2

Where do I hand in my report late?

Where's mini.sent.txt for PA1?

Last year's FAQ; What kind of scores should I be getting?

Setting JAVA_HOME on Hedge

Using a validation set for PA1

Increasing perplexity with more training data?

Do we smooth at train time or test time?

How do I use stop tokens in my n-gram model?

Does PA1 require a strict proof?

Smoothing implementation details

Smoothing and conditional probabilities

Smoothing and unknown words

PA1 perplexity and WER

Compiling and Running PA1

Do I have to do my final project in Java?

Why am I getting an Exception when running `./run-dec` in the starter code?

Why is `wordAligner.train` not called in the starter code?