CS 124 / LING 180 From Languages to Information,
Dan Jurafsky, Spring 2021
Week 3: Group Exercises on Naive Bayes and Sentiment
April 13, 2021

Part 1: Quick Naive Bayes review

First, move to your equivalently numbered room in the work section.

We want to build a naive bayes sentiment classifier using add-1 smoothing, as described in the lecture (not binary naive bayes, regular naive bayes). Here is our training corpus:

Training Set:

    - the movie has no plot
    - honestly pretty boring
    + pretty interesting movie 

Test Set:

    pretty enjoyable plot
  1. Compute the prior for the two classes + and -, and the likelihoods for each word given the class (leave in the form of fractions).

  2. Then compute whether the sentence in the test set is of class positive or negative (you may need a computer for this final computation).

  3. Would using binary multinomial Naive Bayes change anything?

  4. Why do you add |V| to the denominator of add-1 smoothing, instead of just counting the words in one class?

Part 2: Conceptual Problems
  1. For the following problem, please choose a group facilitator/representative who will also take notes on your discussion. When we come back to the lecture room, I will call for volunteer groups to report back to the whole class on your thoughts or results, and so I will call some of the representatives to the stage.

    Data ethics is an important componant of any supervised machine learning task, since the data has to come from somewhere. For example, all research involving human subjects in the United States must follow the 3 "Belmont Principles", which are:

    (the full Belmont report is here)

    Do the following:

    Return to your equivalently numbered room in the main lecture section and we will discuss as a class.

  2. After coming back to your work rooms, choose a different group facilitator/representative. Then first do this short ML conceptual problem: It is sometimes the case that more complex features (like trigrams or bigrams) perform better than simple features (like unigrams) on the training set, but perform worse than simple features on the test set. This is a particular case of the phenomenon called `overfitting' in machine learning. Discuss why this might be. Can you create a tiny training set with 2 3-word documents and a test set with one document for which this overfitting situation holds?

  3. Now go to the Sentiment demo at http://nlp.stanford.edu:8080/sentiment/rntnDemo.html. Come up with 5 sentences that the classifier gets wrong. Can you figure out what is causing the errors?

    Return to your equivalently numbered room in the main lecture section and we will discuss any particularly interesting sentiment examples as a class.

  4. After coming back to your work rooms, choose a third group facilitator/representative. Now let's continue thinking about data sources. Our data sources are the basis of all of our optimization problems. But who are we optimizing for? Should we be concerned that we are building systems in response to the people who create the most text on the web, namely men from wealthy, English-speaking countries? According to the researcher Ricardo Baeza-Yates in the journal CACM, "the percentage of all publicly reported Wikipedia female editors is just 11%". Furthermore, "it is estimated that over 50% of the most popular websites are in English, while the percentage of native English speakers in the world is approximately only 5%".

    What are the potential effects of biases in internet data sources? Does this bias negatively impact some populations more than others? If so, whom and in which ways? Do think we should attempt to remediate these biases, and if so do you have any ideas for steps that can be taken? Or do you think that remediation should not be a focus for data researchers? Again, your opinions will likely differ from each other, and of course we expect you will discuss this respectfully!!

    Return to your equivalently numbered room in the main lecture section and we will discuss as a class.