CS 124 / LING 180 From Languages to Information,
Dan Jurafsky, Winter 2019
Week 3: Group Exercises on Naive Bayes and Sentiment
Jan 22, 2019
Part 1: Group Exercise
We want to build a naive bayes sentiment classifier using add1 smoothing,
as described in the lecture (not binary naive bayes, regular naive bayes). Here is our training corpus:
Training Set:
 just plain boring
 entirely predictable and lacks energy
 no surprises and very few laughs
+ very powerful
+ the most fun film of the summer
Test Set:
predictable with no originality
 Compute the prior for the two classes + and , and the likelihoods for each word given the class
(leave in the form of fractions).
 Then compute whether the sentence in the test set is of class positive or negative
(you may need a computer for this final computation).
 Would using binary multinomial Naive Bayes change anything?
 Why do you add V to the denominator of add1 smoothing, instead of just counting the words in one class?
 What would the answer to question 2 be without add1 smoothing?
 Can you think of any other features (or preprocessing) that you could add that might be useful
in predicting sentiment? (This will come in handy for PA3!).

Naive Bayes treats words as if they are independent conditioned upon the class (that is why we multiply the individual probabilities). For which (if any) of the new features you suggested does this independence assumption roughly hold?
Part 2: Challenge Problems

Go to the Sentiment demo at
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html.
Come up with 5 sentences that the classifier gets wrong.
Can you figure out what is causing the errors?
 It is sometimes the case that more complex features (like trigrams or bigrams) perform better than simple features (like unigrams) on the training set, but perform worse than simple features
on the test set. This is a particular case of the phenomenon called `overfitting' in machine learning. Discuss why this might be. Can you create a tiny training set with 2 3word documents and a test set with one document for which this overfitting situation holds?

Binary multinomial NB seems to work better on some problems than full count NB,
but full count works better on others.
For what kinds of problems might binary NB be better, and why?
(There is no known right answer to this question, but
I'd like you to think about the possibilities.)