This page answers frequently asked questions (FAQs) for CS224N / Ling 237.

5/17/05 HW4: Parent Annotation and miniTest
5/17/05 HW4 parent annotation, speed targets
5/3/05 Memory issues for HW3
5/1/05 "stepSize underflow" on HW3
5/1/05 More questions on HW3
4/29/05 Typo in HW3 formula for G(λ)
4/26/05 IndexLinearizer, features, etc.
4/19/05 More data for HW2?
4/19/05 Questions on Knight, section 31
4/17/05 Starter code combines test and training sentences
4/13/05 Do I have to do my final project in Java?
4/9/05 Closing steams in readSpeechNBestLists()
4/5/05 Smoothing and unknown words
4/2/05 Larger dataset for HW1
3/30/05 How is Java 1.5 different from 1.4?
3/30/05 Using Java 1.5 on the Mac


HW4: Parent Annotation and miniTest

17 May 2005

Once you have added parent annotation to your grammar there are two parses for the test sentence in the miniTest data set that are equally probable given the training sentences. If you take these parse in the default order you will find the exact match, but don't worry if you aren't getting an exact match because ordering is abitrary for sentence of equal probability.



HW2: parent annotation, speed targets

17 May 2005

Two points about HW4:

  1. We recommend that you apply parent annotation only to phrasal categories, and not to POS tags (preterminals). Parent annotation of POS tags can be useful as well, but you need to do some fancier smoothing to get it to work well, and leaving it out will keep your grammar more compact.
  2. You should interpret "5 seconds per sentence" as a guideline and a target, not as a strict requirement. (As in other assignments, we're more interested in your efforts to achieve good performance than in the performance itself.) That being said, 5 seconds is indeed quite achievable, despite what you may be starting to think! For comparison, the Stanford Parser, which is also an exhaustive generalized CKY parser, implemented in a Java framework very similar to the one you're using, can parse a 20-word sentence in about 2 seconds, and a 10-word sentence in about 1 second. Admittedly, that parser has been carefully tuned for speed, but it shouldn't be too hard to get performance within some small factor of that. However, you needn't worry that you'll be marked down if you clock in at 11 seconds rather than 5, provided that your investigation is otherwise well planned, executed, and documented.


Memory issues for HW3

3 May 2005

Various people have had problems with running out of memory when building maximum entropy classifiers or with them taking a long time to train.

Here are three answers to this problem.

One is that training discriminative models takes a lot of memory for optimization and takes a long time (you might just be convinced that generative models are great after all!!). Remember the -Xmx flag. Try to find a machine with lots of memory (the elaines do have 2Gb of memory, but if many people are using one, they may not have 2Gb for you). You could also try tree or tree2 (Solaris) or vine (Linux). Try leaving your model to train overnight, or leaving several models to train overnight by writing a shell script. Don't try to start 6 classifiers training at once if each needs a lot of memory (the computer will either run out of swap space or swap like crazy, and neither result will get your assignment done).

The second answer is to try to avoid storing unnecessary stuff and doing a lot of excess computation in training your models. But if you're doing things correctly (the assignment does urge you to calculate the value and derivative of the likelihood surface efficiently, since that is the inner loop of the optimization process), then things should not be too bad. While the code provided is of a somewhat pedagogical character, and it's written in Java, it's not toy code. All the stuff with Index and IndexLinearizer is precisely to efficiently encode features as integer indexes so you shouldn't need to intern strings etc. - they should be gone by the time optimization starts. Similarly, the provided LBGFS quasi-Newton optimizer is basically the state of the art for training large maxent optimization problems. It's not a toy. You should be able to train models with lots of features.

The third answer is to think about your features. The assignment deadline is soon, and you want to get something running by then. Try to define useful, often applicable features that capture the most important data generalities, before putting in feature templates that put in tens of thousands of features of much more questionable utility. To get you started, here are two good features of an unknown word: "Is it capitalized?" and "Does it end with 's'?". That gives you two φ features, or 90 f features (given 45 Penn tags). A very modest number. In contrast, something like "What is the next word?" would add in about 1.3 million f features all by itself. That's already getting you up into the size of model that it will be difficult to train.



"stepSize underflow" on HW3

1 May 2005

Have you seen an error saying "BacktrackingSearcher.minimize: stepSize underflow."? Here's what it means. (This description should make sense if you've seen line search minimization before.) Inside the minimizer, it determines a direction to walk using the partial derivative information you supply, and then does a line search in that direction. It heads out a certain distance in the indicated direction, and if the function's value at that point isn't sufficiently below the previous function value, relative to the step size and the derivative that was calculated at the previous function value, then it goes into a loop where it tries to shorten the step size until the function value is sufficiently below the previous function value (relative to the calculated derivatives and the size of the step). You get the "stepSize underflow" message when it decides that there is no distance it can walk in the indicated direction which will lower the function value (as you want to do in minimization). In all cases of which we are aware, this doesn't indicate a problem in the optimization code, but rather that the function value or derivative information that you are supplying is wrong. For instance, if the derivatives say that a certain direction is downhill, but it is actually uphill, and that direction is chosen, then you'll get this message.



More questions on HW3

1 May 2005

There is still some confusion about the way fi(c, d) is defined, and about which calculations depend on the true class c of a datum and which use only some candidate class c′, and so on. A previous FAQ was intended to shed some light on this, but I guess further elucidation is needed. Let me make several points.

  • Each datum has a bundle of observable attributes represented by d. We can define a set of observable feature functions φj(d) which gives us access to these attributes. In the miniTest example, we could define five such functions:
    φ0(d) ≡ d includes fuzzy
    φ1(d) ≡ d includes claws
    etc.
  • Each datum also has a class (or label) represented by c. In some contexts, c is available; in others it is not. In some computations, c is relevant; in others it is not.
  • The observable feature functions φj are only a subset of all possible feature functions. In general, maximum entropy models use features fi(c, d) which can be any function of a class c and observable attributes d. This gives maximum entropy models great flexibility and power.
  • However, in many contexts — including assignment 3 — we simplify things somewhat by restricting ourselves to using only feature functions fi having the form:
    fi(c, d) ≡ c = ciφji(d)
    In other words, each fi just says that the datum belongs to some class ci and has some observable feature φji, where ci and ji are functions of i. (See slide 6 in the MaxEnt lecture slides.) For example, in the miniTest:
    f0(c, d) ≡ c is cat ∧ d includes fuzzy
    f1(c, d) ≡ c is bear ∧ d includes fuzzy
    f2(c, d) ≡ c is cat ∧ d includes claws
    etc.
  • Whether or not we know the true class c of a datum, we can compute the value of a feature function fi for any candidate class c′. The computation of fi(c′, d) does not depend on the true class c, and can be performed whether or not c is available.
  • The computation of the probability, according to a model, that a datum with observable attributes d belongs to candidate class c′ does not depend on the datum's true class c. Indeed, it must not; otherwise our model would be useless for classifying new data.
  • The computation of the likelihood, according to a model, of a set of fully observed training data consisting of pairs <c, d> depends on the probabilities that the model assigns to the true class c of each datum.
  • The computation of the derivative of the objective function with respect to weight λi depends on the difference, for each training datum <c, d>, between the true count of feature function fi(c, d) (which depends on the true class c) and the expected count, under the model, of fi(c′, d) (which does not).
  • Computing the expectation of a function of a variable c′ normally involves summing over all possible values of c′. However, in computing the expectation of fi(c′, d), we can determine in advance that all terms of the summation except one must be zero. (Why?) Therefore there is no need to compute the summation explicitly.

I hope this is of benefit to you — I am trying to walk a fine line between being helpful and doing all the thinking for you.



Typo in HW3 formula for G(λ)

29 April 2005

There is a crucial typo in the handout for assignment 3. The formulas for the "regularized" objective function G(λ) and its derivatives should read as follows:

G(λ) = F(λ) + Σi λi²/2σ²
G(λ)/∂λi = ∂F(λ)/∂λi + λi/σ²

Let's think about this. What we really want to maximize is the conditional log likelihood of the data. Let's call this quantity L(λ):

L(λ) = Σ<c, d>∈<C, D> log p(c | d, λ)

However, the optimizer we're using is designed to minimize whatever function is handed to it, so we define our objective function F(λ) as the negative of the conditional log likelihood:

F(λ) = −1 · L(λ)

Now when we regularize, we want to be penalizing large weight vectors. Since we're minimizing the objective function, the penalty term should be added to the objective function:

G(λ) = F(λ) + Σi λi²/2σ² = −1 · [ L(λ) − Σi λi²/2σ² ]

The last form is nearly identical to the (incorrect) formula for G(λ) in the handout, and it helps explain why the typo was never noticed (even though several experienced people looked at it before it was released). When you look at the formula in the handout, it's easy to forget that F(λ) was defined to be the negative of the conditional log likelihood. If instead you interpret that appearance of F(λ) as referring to the positive conditional log likelihood L(λ), then everything looks OK. But that's not consistent with how we defined F(λ).

I am a little concerned that this correction may confuse some people who were previously unconfused. I think it's likely that many people, like me, failed to noticed the error in the formula for G, but nevertheless gave it the correct interpretation. If you have already implemented this part of the assignment, and got the expected result of probability 0.73 for "cat", then it's quite likely that you've already set up your objective function correctly, and don't need to worry about it any further.

I will post a corrected version of the handout.



IndexLinearizer, features, etc.

26 April 2005

The use of IndexLinearizer in assignment 3 may create some confusion. Consider the miniTest example. There are five features (fuzzy, claws, small, big, medium) and two labels (cat and bear). The IndexLinearizer will assign indices i to the <feature, class> pairs as follows:

cat: 0 bear: 1
fuzzy: 0 i = 0 i = 1
claws: 1 i = 2 i = 3
small: 2 i = 4 i = 5
big: 3 i = 6 i = 7
medium: 4 i = 8 i = 9

These is are the indexes into the weights vector (the λs). And where you see the handout refer to fi, that's the same index. So what might be a little confusing is that a given fi (such as f2) corresponds not to a single observed feature (like claws) but to a <feature, class> pair (like <claws, cat>).

So, for example, in the formulas on p. 3 of the handout, you see things like fi(c', d). How do you evaluate this? Well, suppose i is 2 and c' is cat. f2 is supposed to be "on" whenever the class is cat and the feature claws is present, so f2(cat, d) will be on just in case the datum d has the feature claws.

Now what if c' is bear? In this case, f2(bear, d) will be off, regardless of whether the datum d has the feature claws or not.



More data for HW2?

19 April 2005

As the assignment instructions suggest, there is far more data available than is used by default. It's already there in the data/training directory, which contains more sentences than the U.S. tax code. OK, that may or may not be literally true. But there are a lot of sentences, more than a million. You're welcome to use as many as you like, if you can manage to deal with e.g. memory issues.



Questions on Knight section 31

19 April 2005

I got some questions today regarding formulas found in section 31 of the Kevin Knight tutorial. He begins the section as follows:

P(a | e, f)   =    P(a, f | e)  /  P(f | e)   =    P(a, f | e)  /  Σa P(a, f | e)

Consider the computation of the denominator above. Using Model 1, we can write it as:

Σa Πj t(fj | eaj)

Knight is being imprecise here, and it's quite confusing. The latter expression is not really equal to the denominator above, because it neglects to take account of the length of the French sentence.

Here's a more precise account. Let's ignore the summation over a for the moment, and just focus on P(a, f | e). Since len(f) is a deterministic function of f, P(a, f | e) = P(a, f, len(f) | e). Now we repeatedly apply the chain rule for conditional probability:

P(a, f, len(f) | e) = P(len(f) | e) ×
P(a | len(f), e) ×
P(f | a, len(f) e)

So it's really the product of three probabilities, and Knight kind of ignored the first two, which determine how long the French sentence will be and what the alignment will be. He was really focused on the third piece, and it is true that

P(f | a, len(f) e)    =    Πj   t(fj | eaj)

So Knight's slip is very confusing. I think the reason he made it is just that he was focused on something else. The point he is trying to make in this section is something quite different: that the sum and product in his expression can be effectively interchanged (read the section for details), saving a huge number of multiplications.



Starter code combines test and training sentences

17 April 2005

A line the starter code that adds the test sentences to the training sentences. Although normally it is a mistake to combine the test and training data, combining the test and training sentences in this case is acceptable because information that you are attempting to produce, namely the alignments of the test sentences, is not added to training data. Adding the test sentences simply allows you to optimize your model parameters with full knowledge of what sort of paired sentences can occur. In general, you can train on test data in the supervised learning setting so long as the test data you are using is, like your training data, unlabeled.



Do I have to do my final project in Java?

13 April 2005

No. You can use Perl, C, C++, or any other widely used programming language. Extra credit if you design a Turing machine to compute your final project. Double extra credit if you build a diesel-powered mechanical computer to compute your final project. Triple extra credit if you build a human-level AI capable of autonomously conceiving, executing, and presenting your final project.[*]



Closing steams in readSpeechNBestLists()

9 April 2005

Have you seen an error like this?

Exception in thread "main" java.io.FileNotFoundException:
/afs/ir/class/cs224n/corpora/assignment1/wsj_n_bst/4ojc0209.acc (Too many open files)

This is a consequence of a small flaw in the starter code -- it is failing to close file readers when loading the HUB data.

There's a simple fix, which is to add three lines near the end of the readSpeechNBestLists() method. The revised method is shown below, with the only modifications marked:

    public static List readSpeechNBestLists(String path, Set vocabulary) 
      throws IOException {
      List speechNBestLists = new ArrayList();
      BufferedReader correctSentenceReader = open(path + "/REF.HUB1");
      Map correctSentenceMap = readCorrectSentences(correctSentenceReader);
      List prefixList = getPrefixes(path);
      for (Iterator it = prefixList.iterator(); it.hasNext();) {
        String prefix = (String) it.next();
        BufferedReader wordReader = open(path + "/" + prefix);
        BufferedReader scoreReader = open(path + "/" + prefix + ".acc");
        List correctSentence = (List) correctSentenceMap.get(prefix);
        SpeechNBestList speechNBestList = 
          buildSpeechNBestList(correctSentence, wordReader, scoreReader, vocabulary);
        if (speechNBestList != null)
          speechNBestLists.add(speechNBestList);
        wordReader.close();             // <-- ADD!
        scoreReader.close();            // <-- ADD!
      }
      correctSentenceReader.close();    // <-- ADD!
      return speechNBestLists;
    }

Smoothing and unknown words

5 April 2005

A few people have inquired about smoothing and unknown words (or more generally, n-grams). The basic idea of smoothing is to take some probability mass from the words seen during training and reallocate it to words not seen during training. Assume we have decided how much probability mass to reallocate, according to some smoothing scheme. The question is, how do we decide how to allocate this probability mass among unknown words, when we don't even know how many unknown words there are? (No fair peeking at the test data!)

There are multiple approaches, but no perfect solution. (This is an opportunity for you to experiment and innovate.) A straightforward and widely-used approach is to assume a special token *UNK* which represents (an equivalence class of) all unknown words. All of the reallocated probability mass is assigned to this special token, and any unknown word encountered during testing is treated as an instance of this token.

Another approach is to make the (completely unwarranted) assumption that there is some total vocabulary of fixed size B from which all data (training and test) has been drawn. Assuming a fixed value for B allows you to fix a value for N0, the number of unknown words, and the reallocated probability mass can then be divided equally (or according to some other scheme) among the N0 unknown words. The question then arises: how do you choose B (or equivalently, N0)? There is no principled way to do it, but you might think of B as a hyperparameter to be tuned using the validation data (see M&S p. 207).

Both of these approaches have the shortcoming that they treat all unknown words alike, that is, they will assign the same probability to any unknown word. You might think that it's possible to do better than this. Here are two unknown words: "sneck" and "mmrfio". Intuitively, which should be considered more probable? What kind of knowledge are you applying? Do you think a machine could make the same judgment?

In class, Chris mentioned a paper by Efron & Thisted, Estimating the number of unseen species: How many words did Shakespeare know?, which addresses related issues.



Larger dataset for HW1

2 April 2005

The instructions for HW1 suggest that you might try experimenting with a larger dataset. The primary dataset (treebank-sentences-spoken) contains less than a million words — which is not so large.

A larger dataset is now available for you to play with. The dataset is named bllip-speechified-sentences, and includes about 10 million words of Wall Street Journal text. (So, it's ten times as large as the primary dataset.) The data has been split 80/10/10 into training, validation, and test files, which can now be found in corpora/assignment1/.

If you examine the main method of LanguageModelTester, you will find that it accepts an optional second argument, which is the name of the dataset to be used. Note that if you supply bllip-speechified-sentences as the name of the dataset, it will use that dataset for both training and testing, with the consequence that your test results will not be directly comparable with test results for the primary dataset. However, with a few simple changes to the main method of LanguageModelTester, you can make it train on the larger training set and test on the original test set.

One reason it's nice to have more data is that it should make it possible to build more accurate models. Another is that it makes it easier to explore how model performance varies with the amount of training data — in other words, to investigate the learning curve.



How is Java 1.5 different from 1.4?

30 March 2005

Java 1.5 enhancements include generic types, metadata, autoboxing, an enhanced for loop, enumerated types, static import, C style formatted input/output, variable argument lists, concurrency utilities, and simpler RMI interface generation. Read Sun's Java 1.5 overview for a concise explanation of the new features, with simple examples. For more detail, check out this document.

You don't need to master all the new 1.5 features. The ones to focus on are generic types, autoboxing, and the enhanced for loop. The last two are very straightforward and can be grasped in a few minutes. Generic types can become quite complex, but you don't need to know the intricacies. You can get the basic idea pretty quickly.

The starter code for assignment 1 uses only Java 1.4 code, but the starter code for later assignments will use Java 1.5.


Using Java 1.5 on the Mac

30 March 2005

Although Java 1.5 has been out for almost a year, there is still no JDK 1.5 for Macintosh. It turns out that Sun doesn't provide the Macintosh JDK — Apple does, and they haven't been quick about it. (It took Apple 23 months to port Java 1.4 to the Mac a few years ago.) But the good news is that the wait is almost over: according to the rumor mill, the Mac JDK 1.5 will be released concurrently with Mac OS X 10.4 (aka Tiger), which is expected in mid-April.

In the meantime, there's a workaround that I've been using successfully for the last six months to do Java 1.5 on my Mac. At the beginning of last year, Sun published a Java 1.5 prototype called JSR014. It was unsupported software made available to developers to help them get up to speed with, and comment on, the new Java 1.5 language. It is basically a Java 1.5 compiler, written in Java 1.4, which can compile 1.5 source code into 1.4 bytecode. Since it's written in Java 1.4, it will run anywhere JDK 1.4 is installed — like your Mac, for instance. Although it's an unsupported static piece of code, and is based on a beta version of 1.5, I've had very few problems using it. Pretty much, you just write Java 1.5 (and use the 1.5 libraries), and it just works. However, there are no guarantees, and you should be prepared to fall back to using Java 1.5 on another platform if this doesn't work out for you.

Directions to install JSR014 on Macintosh OS X

  1. First, you need to download the JSR014 code from the Sun website. Put it somewhere easy to find, like your Desktop. You can delete it later.
  2. If it hasn't been unzipped, unzip it into a directory of the same name (e.g. ~/Desktop/adding_generics-2_4ea).
  3. Next you need to copy all of the JAR files from the adding_generics-2_4ea directory to a convenient place for future use. In order for the scripts below to work, you need to put them in /Library/Java/Extensions/. Open a shell (with X11 or with the Terminal application), and run the following command:
    sudo mv ~/Desktop/adding_generics-2_4ea/*.jar /Library/Java/Extensions/
    
  4. Now you need to download and install two scripts:
    • tiger: the runtime script (replaces java)
    • tigerc: the compilation script (replaces javac)
    First, save these to your Desktop. Now you need to make them executable and put them in some directory on your path, such as /usr/bin/:
    sudo chmod +x ~/Desktop/tiger*
    sudo mv ~/Desktop/tiger* /usr/bin/
    rehash
    
  5. OK. Now you should be ready to compile and run Java 1.5 source code. To compile and run, type
    tigerc MyClass.java
    tiger MyClass
    
    and you should be good to go!
  6. If you're compiling with ant, you'll need to modify the build.xml file (supplied with the starter code for this class) to tell ant to use tigerc instead of javac for compilation. Open that file and look for the <javac> element (within the compile target). The last attribute for that element is
               source="${compile.source}"
    
    Add the following attributes to the <javac> element:
                 fork="yes"
           executable="/usr/bin/tigerc"
    memoryMaximumSize="200m"
    
    (The last line tells it to use plenty of memory — you can fiddle with the exact value.) Also, search build.xml for the string "1.4", and replace each occurrence with "1.5". Now ant will compile Java 1.5 source files for you.

If you try this, let me know whether it works for you! If you find mistakes in my instructions, please let me know, so that your classmates can benefit from your discovery. And if everything works fine, I'd like to hear about that too.


Notes

[*] The maximum amount of extra credit you can earn for any such alternative implementation is 0.03 points.