### CS 124 / LING 180 From Languages to Information, Dan Jurafsky, Winter 2019 Week 3: Group Exercises on Language Modeling, Tuesday Jan 17, 2019

#### You can copy and paste this text into a file, or just write answers on note paper, or whatever is easiest to work with for yourself (and keep to study for the final)! You will not be turning this in!

Part 1: Shannon Entropy

See Matt's presentation! Afterwards you might want to check out Shannon's famous paper on the entropy of English that proposed this task, which is:
Shannon, Claude E. "Prediction and entropy of printed English." Bell system technical journal 30, no. 1 (1951): 50-64.

Part 2: Language Model exercises

We are interested in building a language model over a language with three words: A, B, C. Our training corpus is

```AAABACBABBBCCACBCC
```

1. First train a unigram language model using maximum likelihood estimation. What are the probabilities? (Just leave in the form of a fraction)? Reminder: We don't need start or end tokens for training a unigram model, since the context of each word doesn't matter. So, we will not add any special tokens to our corpus for this part of the problem.
P(A) =

P(B) =

P(C) =

2. Next train a bigram language model using maximum likelihood estimation. For bigrams we will want to add {start} and {end} tokens. Fill in the probabilities below. Leave your answers in the form of a fraction.
P(A|start) =

P(A|A) =

P(A|B) =

P(A|C) =

P(B|start) =

P(B|A) =

P(B|B) =

P(B|C) =

P(C|start) =

P(C|A) =

P(C|B) =

P(C|C) =

P(end|A) =

P(end|B) =

P(end|C) =

3. Now evaluate your language models on the following corpus
```ABACABB
```

What is the perplexity of the unigram language model evaluated on this corpus? Since we didn't add any special start/end tokens when we were training our unigram language model, we won't add any when we evaluate the perplexity of the unigram language model, either, so that we're consistent.

What is the perplexity of the bigram language model evaluated on this corpus? We'll add a {start} and {end} token to this corpus again before we evaluate perplexity. Note that for perplexity we include the end token but not the start token in the total count of words N. In other words, assume we have a sequence, e.g. {start}ABC{end} that is of length M including start/end tokens. To compute the perplexity of this sequence we compute M-1 probabilities and take their product to the -(M-1)st root, because we do not compute a probability for seeing the {start} token. E.g., for the sequence {start}ABC{end} we would get (P(A|start) * P(B|A) * P(C|B) * P(end|C))**(-1/4).

4. Now repeat everything above for add-1 smoothing.

• Now answer this conceptual question: What is the difference between using an UNK token (for unknown words) and smoothing? What situations would you use one versus the other?

1. Go to the Google Ngram Viewer at https://books.google.com/ngrams, and answer the following questions. Google Ngrams is unreliable after 2000 (the data distribution shifted), so you can only use it correctly up until 2000.

2. Explore some positive adjectival exclamatives, modern or old (cool, terrific, amazing, keen, awesome, groovy, gnarly, or any of your favorites). Discuss any differences in time course.

3. Now look at swear words. Most of them share their temporal pattern. Why?

4. Approximately when did pizza first become mentioned in the media in the United States?

5. The United States slowly shifted over its history from seeing itself mostly as a collection of states to seeing itself more as a single entity. This change was reflected linguistically. In the 18th century, the United States was a plural noun phrase (after all, states ends in an s). When did the United States start being talked about as a singular entity instead of plural collection of states? Hint: Here's a sentence from a 1789 treaty with the Osage Nations:

The United States agree to furnish, at this place, for the use of the Osage nations, a blacksmith