Part 1: Group Exercise
We are interested in building a language model over a language with three words: A, B, C. Our training corpus is
First train a unigram language model using maximum likelihood estimation. What are the probabilities? (Just leave in the form of a fraction)?
Reminder: We don't need start or end tokens for training a unigram model, since the context of each word doesn't matter. So, we will not add any special tokens to our corpus for this part of the problem.
Next train a bigram language model using maximum likelihood estimation. For this problem, we will add an end token, $$, at the end of the string, so that we can model the probability of the sentence ending after a particular word. If you chose to add a start token as well, that's fine too, but these solutions assume no start token. Fill in the probabilities below. Leave your answers in the form of a fraction.
Now evaluate your language models on the corpus
What is the perplexity of the unigram language model evaluated on this corpus? Since we didn't add any special start/end tokens when we were training our unigram language model, we won't add any when we evaluate the perplexity of the unigram language model, either, so that we're consistent.
What is the perplexity of the bigram language model evaluated on this corpus? Since we added an end token when we were training our bigram model, we'll add an end token to this corpus again before we evaluate perplexity.
Now repeat everything above for add-1 smoothing.