
This page answers frequently asked questions (FAQs) for CS224N / Ling 284. I don't understand this whole feature index thing in PA3.May 5, 2010
The use of
The i values are the indexes into the weights vector (the
λs). And where you see the handout refer to
f_{i}, that's the same index. So what might be a
little confusing is that a given f_{i} (such as
f_{2}) corresponds not to a single observed
feature (like
So, for example, in the formulas on pp. 34 of the handout, you see
things like f_{i}(c', d). How do you evaluate
this? Well, suppose i is 2 and c' is
Now what if c' is
People may be confused about the choice of how features are represented in our
for (EncodedDatum datum : data) { for (int label = 0; label < encoding.getNumLabels(); label++) { int numFeats = datum.getNumActiveFeatures(); for (int i = 0; i < numFeats; i++) { int feat = datum.getFeatureIndex(i); int index = indexLinearizer.getLinearIndex(feat, label); double val = datum.getFeatureCount(i); double weight = weights[index]; } } } Distortion Parameters for Model 228 April 2007There's been a lot of questions about the distortion parameters used for Model 2. "For each bucket j, you should have a parameter d(j) to indicate the probability of that distortion. These parameters should be learned during the EM process. (It turns out that choosing reasonable functions for d, instead of trying to learn parameters, can work pretty well. We'd like you to attempt learning distortion parameters with the EM algorithm, but dealing with distortions in some other sensible way could also warrant credit.)" Note that since d is over buckets determined by the indices of the English/French words and the lengths of the English/French sentences, it is indeed represented by a onedimensional table of floating points as mentioned in Knight's tutorial. What should getAlignmentProb() return?28 April 2007Say f is the source sentence, e is the target sentence (as it is in all our examples). Then getAlignmentProb() should return p(a , f  e). For PA2 Part I, why is the test data part of the training data?28 April 2007If you look at the starter code for PA2, you'll notice there's a suspicious line:
The reason why we include the test data is just to avoid unseen word problem and to simplify this assignment. And since we're doing unsupervised learning, the EM process doesn't make use of the annotated alignments in the test data, so we can say it's not cheating. checkModel() : How to Implement03 April 2009Your checkModel() function should loop over the entire vocabulary as well as the unknown words you have allocated mass for, and return 1. The assignment says for higher order ngrams (anything over unigrams), you should just choose a few words to condition on (the w1 in P(w2  w1)), and check each of those words individually to make sure they each sum to 1.
One easy way to do this is to choose 20 random words for w1, and sum the probabilities
for each w1, then return that number divided by 20. It should be 1 if you've properly
created conditional probabilities.
Increasing perplexity with more training data?1 April 2009Some students who have been trying to investigate learning curves have reported seeing testset perplexity increase as the amount of training data grows. This is counterintuitive: shouldn't more training data yield a better model, which is therefore able to attain lower perplexity? Chris came up with a possible explanation involving the handling of the <UNK> token. Remember that the <UNK> token actually represents an equivalence class of tokens. As more training data is added, this equivalence class shrinks. Because the meaning of the <UNK> token is changing, model perplexities are not directly comparable. Especially when the amount of training is small, adding more data will rapidly lower the model probability of <UNK>, causing the entropy and perplexity of the model distribution to grow. If you've been looking at learning curves, an interesting investigation — not specifically required for the assignment — would be to measure the learning curve while holding the definition of <UNK> constant. This would mean allowing the models trained on small data sets to "know about" all the words in the largest training set. All known words in this sense would get explicit counts, which could be 0, and then you'd still have an <UNK> token representing all words which did not appear in even the largest training set. Do we smooth at train time or test time?1 April 2009Generally speaking, smoothing should be the last step of training a model: first you collect counts from your training data, and then you compute a smoothed model distribution which can be applied to (that is, used to make predictions about) any test data. In principle, it would be possible to postpone the computation of a smoothed probability until test time. But (a) it's not very efficient, because most smoothing algorithms require iterating through all the training data, which you shouldn't have to do more than once, and (b) if you're wanting to do this because your smoothing computation depends upon something in the test data, then you're doing things wrong. (For example, model probabilities should not depend on how many unknown words appear in the test data.) How do I use stop tokens in my ngram model?1 April 2009Real sentences are not infinite; they begin and end. To capture this in your ngram model, you'll want to use socalled "stop" tokens, which are just arbitrary markers indicating the beginning and end of the sentence.
It's typically done as follows. Let
you'll change this to
and you'll collect counts for 5 bigrams, starting with
( P(
where P( Does PA1 require a strict proof?1 April 2009Q. Is it necessary to give strict mathematical proof that the smoothing we've done is proper probability distribution? Or is it enough to just give a brief explanation? A. You should give a concise, rigorous proof. No handwaving. I'll show an example on Friday. Note that it's important that your proof applies to your actual implementation, not some ideal abstraction. Smoothing implementation details1 April 2009Do you have questions regarding details of various smoothing methods? (For example, maybe you're wondering how to compute those alphas for Katz backoff smoothing.) You might benefit from looking at a smoothing tutorial Bill put together last year. For greater detail, an excellent source is the Chen & Goodman paper, An empirical study of smoothing techniques for language modeling. Smoothing and conditional probabilities1 April 2009Some people have the wrong idea about how to combine smoothing with conditional probability distributions. You know that a conditional distribution can be computed as the ratio of a joint distribution and a marginal distribution: P(x  y) = P(x, y) / P(y)
What if you want to use smoothing? The wrong way to
compute the smoothed conditional probability distribution
The problem is that steps 1 and 2 do smoothing separately, so it
makes no sense to divide the results. (In fact, doing this might
even yield "probabilities" greater than 1.) The right way
to compute the smoothed conditional probability distribution
Here, there is only one smoothing operation. We compute a smoothed joint distribution, and compute everything else from that. If there's interest, I can show a workedout example of this in Friday's section. (It would also be correct to compute all the conditional distributions before doing any smoothing, and then to smooth each conditional distribution separately. This is a valid alternative to smoothing the joint distribution, and because it's simple to implement, this is often the approach used in practice. However, the results might not be as good, because less information is used in computing each smoothing function. This would be an interesting question to investigate in your PA1 submission.) Smoothing and unknown words1 April 2009A few people have inquired about smoothing and unknown words (or more generally, ngrams). The basic idea of smoothing is to take some probability mass from the words seen during training and reallocate it to words not seen during training. Assume we have decided how much probability mass to reallocate, according to some smoothing scheme. The question is, how do we decide how to allocate this probability mass among unknown words, when we don't even know how many unknown words there are? (No fair peeking at the test data!) There are multiple approaches, but no perfect solution. (This is an opportunity for you to experiment and innovate.) A straightforward and widelyused approach is to assume a special token <UNK> which represents (an equivalence class of) all unknown words. All of the reallocated probability mass is assigned to this special token, and any unknown word encountered during testing is treated as an instance of this token. Another approach is to make the (completely unwarranted) assumption that there is some total vocabulary of fixed size B from which all data (training and test) has been drawn. Assuming a fixed value for B allows you to fix a value for N_{0}, the number of unknown words, and the reallocated probability mass can then be divided equally (or according to some other scheme) among the N_{0} unknown words. The question then arises: how do you choose B (or equivalently, N_{0})? There is no principled way to do it, but you might think of B as a hyperparameter to be tuned using the validation data (see M&S p. 207). Both of these approaches have the shortcoming that they treat all unknown words alike, that is, they will assign the same probability to any unknown word. You might think that it's possible to do better than this. Here are two unknown words you might encounter, say, on the internet: "flavodoxin" and "B000EQHXQY". Intuitively, which should be considered more probable? What kind of knowledge are you applying? Do you think a machine could make the same judgment? A paper by Efron & Thisted, Estimating the number of unseen species: How many words did Shakespeare know?, addresses related issues. Do I have to do my final project in Java?1 April 2009No. You can use Perl, C, C++, or any other widely used programming language. Extra credit if you design a Turing machine to compute your final project. Double extra credit if you build a dieselpowered mechanical computer to compute your final project. Triple extra credit if you build a humanlevel AI capable of autonomously conceiving, executing, and presenting your final project. Where do I hand in my report late?19 January 2011There is a handin box in the basement of Gates, near the bottom of the Awing stairwell. You can find directions to it here. To get into the basement after the building is locked, slide your SUID card in the card reader by the main basement entrance. For code submitted late, please write the date and time of submission on your report and sign it before placing it in the box. 

Site design by Bill MacCartney 