Analysis of opinion editorials from two Stanford student newspapers

Stanford has two large student newspapers. The Stanford Daily is the main campus tabloid, while the Stanford Review publishes conservative-leaning political articles on a biweekly basis. Each paper has a significant section of OpEds. In this exercise, we will apply topic models to analyze a corpus of opinion pieces published in both papers within the last year.

If you want to follow along, download the following directory which contains an IPython notebook, and the scraped datasets:

http://www.stanford.edu/class/stats202/lda.tar.gz

Scraping the Stanford Daily

We start by scraping the website of the Stanford Daily, using the packages requests and BeautifulSoup4 introduced in Homework 3.

In [1]:
import requests
from bs4 import *
In [2]:
baseURL = "http://www.stanforddaily.com/category/opinions/op-eds/page/"
hdr = {'User-Agent':'Mozilla/5.0'}

# Get a list of permalinks to each opinion piece (in first 11 pages)
articleURLs = {}
for page in range(1,12):
    opinionPage = requests.get(baseURL+str(page),headers = hdr)
    soup = BeautifulSoup(opinionPage.text, "html.parser")
    listItem = soup.findAll(attrs={'class':'item-list'})
    for li in listItem:
        link = li.find('a')
        articleURLs[link.attrs['title']] = link.attrs['href']       
In [3]:
# Print the number of pieces
len(articleURLs)
Out[3]:
88
In [4]:
# Remove "Permalink to" from each title
articleURLs = {key[13:] : value for key,value in articleURLs.items()}
In [27]:
# Scrape each article website for the text of the article
articles = {}
for title, url in articleURLs.items():
    print(title)
    articlePage = requests.get(url,headers = hdr)
    soup = BeautifulSoup(articlePage.text,"html.parser")
    # Remove irrelevant javascript text
    [x.extract() for x in soup.findAll(attrs={'id':'videoread'})]
    #soup.find(attrs={'id':'videoread'}).extract()
    # Article text appears in second element labeled "entry"
    entry = soup.find_all(attrs={'class':'entry'})[1]
    articles[title] = entry.getText()
Improving sex ed?
Multiple meanings and community critique: The complex place of muralism in Casa Zapata
Snowden and Manning are criminals and traitors, not heroes
Why SJP is antithetical to Stanford values
The value of anonymity (yes, even when it’s terrible)
From climate change to campus change: How climate discussions highlight a need for more productive debates
A false security
Pour one out for Meyer
Divestment doesn’t foster discrimination — Hillel and the ADL do
Why the Presbyterian Church (USA) chose boycott and divestment
Rethinking gender and sexual assault policy: My story
Divestment from Palestine: A human rights farce
Make the choice to make a difference
Fossil Free Stanford: Open letter to the Stanford University Board of Trustees
Working at Relcy
The real price of athletics at Stanford
Beware the religious freedom boogeyman
Decriminalizing victims: Let’s adopt the Nordic Model of prostitution law
Why I didn’t report
Stanford falls for BDS
The price of athletics at Stanford
On living in fear of telling the truth: My experience with SAE, retaliation and Title IX
A campus united against division
The time is now: Funding reform cannot be delayed
Limitations of identity activism
Letter to Stanford University on Tobacco Use Policies
Innovation for equity: The case for Bus Rapid Transit
Bite into a healthy lifestyle with a plant-strong diet
Uncle Sam still wants you
Stanford Class Confessions
Stanford Students for Life’s annual protest is shameful
Keeping the focus where it belongs: Condemning racial injustice is a stepping stone towards progress
‘India’s Daughter’ as seen by America’s Daughter
Abusing the term ‘anti-Semitism’
Response to ‘Islamophobia and the White moderate’
In support of Stanford Out of Occupied Palestine and the oppressed
Loving all those who matter: My struggle to love myself
Ferguson and Palestine ad absurdum
Fossil Free freshmen respond to criticism of civil disobedience
Mental health is our Vietnam
Speaking out: Student-athletes weigh in on LGBT acceptance in athletics
If I am not for myself, who will be for me?
Fight the natural gas expansion
Yes, know thyself, but first do a situational analysis
An awkward echo: 1977/2015
Forbes Café: A New Pricing Scam?
Oh no, FoHo: A critique of Stanford’s newest publication
Saving face vs. saving money in Greek life
Behind the decision making of the Sexual Assault Task Force
Racism in a Palo Alto establishment
How Finley fixed funding
I don’t love Stanford
How can an Israeli support the BDS movement?
Zionism, civil rights and Stanford activism: The case for productive education
The Israel I have seen
Islamophobia and the White moderate
Marine protected areas in the Southern Ocean
We don’t need your help
Ode to (Meyer) a work-in-progress
Marriage does not unite us
Facts and truth empower a smoke-free campus
David Shaw 2.0
Do divest from fossil fuels: A response
Affirming our commitment to community
Gross misconduct in APIRL’s handling of SJP’s divestment request
Stanford should abandon the Searsville Dam
Rape: An uncomfortable truth
A way to effect change at Stanford: University committees
Catharsis: CAPS and eating disorders
Facing the mental health crisis at Stanford
Confronting baseless allegations: The SOCC endorsement process
Response to ‘Enough of Shakespeare’
The hidden costs of ISC recruitment
Response to Stanford’s release of the climate survey results
Addressing differences within and without: An open letter from the JSA Board
Why I give to Stanford
Fun at the expense of respect: Changing how we see Native Americans in the 21st century
Concerning violence
Understanding the nature of prejudice: The truth of Islamophobia and Ahmed Mohamed
An open letter to Ms. Bloch-Horowitz from an SAE
The dream of peace vs. the nightmare of divestment
The politics of Instagramming genocide
The good professor and the research university
A needed voice
Girl power (for someone else’s world)
Housestaff family life at Stanford Hospital
The media’s role in climate change action
Don’t tell us how to feel
In [28]:
# Save the articles in a dictionary in the current directory
f = open('dailyArticles.txt','w')
f.write(str(articles))
f.close()

Scraping the Stanford Review

We now scrape articles from the website of the Stanford Review.

In [5]:
baseURL = 'http://stanfordreview.org/cat/sections/opinion/page/'
hdr = {'User-Agent':'Mozilla/5.0'}

# Get a list of permalinks to each opinion piece (in first 11 pages)
articleURLs = {}
for page in range(1,12):
    opinionPage = requests.get(baseURL+str(page),headers = hdr)
    soup = BeautifulSoup(opinionPage.text, "html.parser")
    listItem = soup.findAll(attrs={'class':'entry-title'})
    for li in listItem:
        link = li.find('a')
        articleURLs[link.text] = link.attrs['href'] 
In [6]:
# Print the number of articles
len(articleURLs)
Out[6]:
55
In [8]:
# Scrape each article website for the text of the article
reviewArticles = {}
for title, url in articleURLs.items():
    articlePage = requests.get(url,headers = hdr)
    soup = BeautifulSoup(articlePage.text, "html.parser")
    entry = soup.find(attrs={'class':'post-content'})
    reviewArticles[title] = entry.getText()
    print(title)
SOCC Should Be More Inclusive If It’s Going To Claim It Represents All Minority Views
Own Up
Israel Again Pitted Against Hatred
The Pandora’s Box of Intersectionality and Solidarity
The Fundamental Double Standard
Thawing Cold War Relations with Cuba
In Response to an Attack on Gay Marriage in the Stanford Daily
A Fresh Perspective On The New NSO
Stanford’s ASSU Divests from its own Legitimacy
Don’t Speak; Don’t Think
The Executive Slate Debate: What You Need to Know
The Dangers of Divestment
For the Skeptics: Why Divest?
Bridging the Anti-Semitism Gap
ASSU “At Large”
Havens and echo chambers: Identity politics at Stanford
Does the Honor Code Work?
Buzzword, Buzzword, Blah, Blah, Blah: Why We Tune Out ASSU Senate Campaigns
Fossil Fuel Divestment: The Logical Choice
The Undergraduate Senate Should Represent You
The Hidden Agenda Behind SOOP’s False Divestment Claims
The Case for Clarity in Campaign Contracts
Dear ProFros: Stanford Students Don’t Bite
When Law No Longer Rules
Forgotten Mexico: The Hypocrisy of #WeAreCharlie
Hurry Up! The Provost’s Task Force on Sexual Assault Reform’s Delay is Unacceptable
Who Really Hacked Sony Pictures?
Can the humanities thrive without being required?
Israel Has a Right to Exist – and So Does the United States
Fossil Fuel Divestment is Misguided: Focus on Investment Instead
Please, offend me
Valuing Victims and Valuing Fairness: Stanford’s Sexual Assault Problem
A Survivor Speaks Out Against Stanford’s Sexual Assault Proposal
An Open Letter to Stanford on the Economics of Divestment
SOCC Had Every Right to Ask About Candidate’s Identity
In Defense of SAL
Fossil Free Stanford’s Real Conflict of Interest
Marijuana at Stanford
Should American Sniper Come Under Fire?
Baltimore’s Legacy of Racial Discrimination
Cultural Oversensitivity Stifles Cultural Understanding
ASSU Poised to Demonstrate Its Economic Incompetence, Again
Editor’s Note: The Stanford Review and Bayonets
Silicon Valley’s Love For Weed
Stern Dining Should Not Honor Mass Murderers
End the Title IX Inquisition
The Death of Stanford’s Humanities Core
Applauding Stanford’s Sorority Recruitment Process
True Acceptance or Political Correctness?
Stanford’s Title IX Process Violates Due Process
Cut Corporate Taxes to Empower Stanford Workers
The Connection Between Race, Riots, and Rational Thinking
Monetise the Draw
Stanford GSB taps Charles Koch as 2016 Commencement Speaker
Demand the Change, or Make the Change?
In [19]:
# Save the articles to the working directory
f = open('reviewArticles.txt','w')
f.write(str(reviewArticles))
f.close()

Load articles from file

We can start from here if we've already done the scraping.

In [29]:
f = open('dailyArticles.txt','r')
articles = eval(f.read())
f.close()

f = open('reviewArticles.txt','r')
reviewArticles = eval(f.read())
f.close()

Cleaning the articles of non-ASCII characters and punctuation

In [30]:
# Define a set of punctuation characters (including numbers)
punct = set('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~1234567890')
# Add the newline character
punct.add('\n') 
In [64]:
# This will be a list containing all of the articles as lowercase strings without punctuation (a bag of words)
cleanArticles = []   

for title, article in articles.items():
    asciiArticle = article.encode('ascii','ignore').decode()
    cleanArticles.append("".join(' ' if x in punct else x for x in asciiArticle).lower())
    
for title, article in reviewArticles.items():
    asciiArticle = article.encode('ascii','ignore').decode()
    cleanArticles.append("".join(' ' if x in punct else x for x in asciiArticle).lower())
In [65]:
# For example, here is the first "clean" article
cleanArticles[0]
Out[65]:
u' in the aftermath of the stand with leah protests and national attention on date rape on college campuses  our class of      has had more inundation of information on sexual assault than any other class in stanford history  in their think about it  think alcohol edu  online class before coming to campus as well as additional sexual assault programming during facing reality at new student orientation  they have had multiple exposures to stanford programming condemning sexual assault  tuesday night  the sara office came to donner  a freshman dorm on campus  to host a panel titled mating  dating  and relating  whats good  whats healthy    whats harmful  as a freshman resident assistant  i was charged with planning logistics for the mandatory presentation  in emails back and forth with the sara office  the donner staff stated  our residents went through the new student orientation that incorporated sexual assault training  and theyve had a lot of caution around the worst that can happen with sexual relations on campus  wed love your presentation to bring in a little more sex positivity  and i think that would also increase the turnout  the sara presentation did not meet expectations  rather  the overwhelming focus was on unhealthy relationships with little focus on what positive sexual experiences might look like  skye lovett    noted  the presentation was unnecessarily hetronormative  nick salzar    and a donner ra noted  i thought that the people were very nice  and i want to thank them for coming  i think that the information that is presented is useful for people to know  that being said  that is not what we asked for  why does sara wrap a talk that is really just about sexual assault with the title  mating  dating  and relating  peter litzow    said  although the issues they talked about were very important  i dont think they necessarily represented the wider spectrum of intimacy that can happen in freshmen year  most people are experiencing casual hook ups or awkward encounters in general  and information about that would be more pertinent to freshmen in their first quarter   megan calfas    stated  i felt like it was a conversation thats worth having  but its not the only conversation worth having  and its kind of the only conversation that were having at these events  the way its presented feels very inorganic  hannah pho    commented  i think it made everyone even more uncomfortable than they already were  it was words off a paper that wed heard before  and i wanted to leave the room so badly but felt like i had some sort of obligation to stay  anton de leon    stated  these sorts of panels trivialize real issues  as a freshman ra  i am cognizant that many of my freshmen have limited sexual experience and would benefit from university sponsored sex education  one freshman likened it to teaching abstinence in that there has been little acknowledgement from the university that some students are sexually active  when the office of alcohol policy and education  oape  came to donner  they did a wonderful job presenting on how to drink responsibly if one chooses to drink  my freshmen have yet to hear a presentation about getting tested for stis  asking for consent or how to use birth control  the lack of effectiveness of the sara sexual assault presentation as well as the negligence of sex positive messages and practical information regarding sex that is being made available to the freshmen class is shocking  in the current state  greek organizations have become a scapegoat for the university that needs to find a target  in       duke suffered a      decline in applications for admission after the rape charges against members of the schools lacrosse team  i fear that the increased programming around sexual assault at stanford is as much motivated to maintain our image of selectivity than as to protect and prevent instances of sexual assault  stanford  we can do better  sex education policy should not just focus on the worst case scenarios  but also provide information for how students can grow into their sexuality  by providing sex positive programming to the freshmen class  we enable them to make responsible decisions as they grow into their adulthood  mckenzie andrews    is an ra in donner  she can be contacted at andrews  at stanford edu    '

Tokenization

Before we apply the topic model, we want to get rid of common words such as articles and pronouns, which are referred to as "stop words". We also include words that appear in nearly every article, such as "Stanford" and "student".

In [66]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'u', 'many']

# Adding a few words that don't help us distinguish between articles
stopwords.extend(['stanford','student','students'])

We also want to cluster words that are slight variations of one another, such as "politics", "political", and "politician". This can be done using the famous Porter stemmer, which maps each word to its semantic stem, for the example above, "politi". This function is available from the package ntlk.

In [67]:
import nltk
stemmer = nltk.PorterStemmer()
In [68]:
# This will be a list of articles, where each article is a list of words
tokenizedArticles = []
# This will be a dictionary mapping each "stem" to a representative word 
stem2representative = {}

for article in cleanArticles:
    # Split the article into words and drop the stopwords
    tokens = [word for word in article.split() if word not in stopwords]  
    # Replace each word by its stem, storing a representative word for each stem
    stems = []
    for word in tokens:
        s = stemmer.stem(word)
        stem2representative[s] = word
        stems.append(s)
    # Eliminate stems that appear less than twice in the article    
    tokens = [word for word in stems if stems.count(word)>2]
    tokenizedArticles.append(tokens)

# For each article, replace the stem by a single representative
tokenizedArticles = [ [stem2representative[s] for s in article] for article in tokenizedArticles ]

Topic models

Now, we can apply the Latent Dirichlet Allocation model to this set of articles. This model is implemented in the python package gensim.

In [69]:
import gensim
In [70]:
# Assign each word a unique identifier (a number), and define a dictionary for mapping from word to number
dictionary = gensim.corpora.Dictionary(tokenizedArticles)
In [71]:
# Number of unique tokens in all the articles
len(dictionary.token2id)
Out[71]:
1660
In [72]:
# Here we define a "corpus", which is simply a representation of each article as a set of (word,count) pairs
corpus = [dictionary.doc2bow(text) for text in tokenizedArticles]
In [73]:
# Now, we can train the LDA model on the corpus
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=30, update_every=1, alpha="auto", passes=400, chunksize=143)
In [74]:
# We can view the words associated to topic 12 by calling
model.print_topic(12)
Out[74]:
u'0.064*taxing + 0.046*company + 0.044*fairness + 0.044*trade + 0.037*corporate + 0.033*workers + 0.026*profit + 0.026*money + 0.024*make + 0.024*people'
In [75]:
# Similarly, we can view the distribution of topics in article 57 by calling
model[corpus[57]]
Out[75]:
[(20, 0.99928893406074626)]
In [76]:
# Here, we store the distribution of topics in every article
topicDists = [ model[corpus[i]] for i in range(len(corpus)) ]

Saving the output

In [77]:
import pandas as pd
In [78]:
numTopics = 30
topics = {"topic":[],"word":[],"weight":[]}
for topic in range(numTopics):
    x = model.show_topic(topic,20)
    for weight, word in x:
        topics["topic"].append(topic)
        topics["word"].append(word)
        topics["weight"].append(weight)
topics = pd.DataFrame(topics)
In [79]:
topics.to_csv("topics.csv")
In [80]:
topics
Out[80]:
topic weight word
0 0 0.037576 athletes
1 0 0.024578 sexual
2 0 0.021362 percent
3 0 0.021362 sae
4 0 0.018421 one
5 0 0.016797 assault
6 0 0.016129 campus
7 0 0.015389 harassment
8 0 0.014360 report
9 0 0.013682 retaliating
10 0 0.011975 drug
11 0 0.011711 investigation
12 0 0.011122 lgbt
13 0 0.010269 policy
14 0 0.010269 marijuana
15 0 0.010031 years
16 0 0.009415 issue
17 0 0.009415 victims
18 0 0.009359 culture
19 0 0.009102 university
20 1 0.033309 dialogue
21 1 0.031093 kyle
22 1 0.028877 american
23 1 0.026662 sniper
24 1 0.022230 viewers
25 1 0.022230 movie
26 1 0.022230 film
27 1 0.017799 yik
28 1 0.017799 controversial
29 1 0.017799 people
... ... ... ...
570 28 0.016414 membership
571 28 0.016414 discussion
572 28 0.014371 putnams
573 28 0.014371 policy
574 28 0.014371 communicate
575 28 0.014162 one
576 28 0.012328 even
577 28 0.012328 virtue
578 28 0.012328 activists
579 28 0.012328 experience
580 29 0.039702 identities
581 29 0.032835 socc
582 29 0.029020 political
583 29 0.029020 communicate
584 29 0.027918 campus
585 29 0.019851 years
586 29 0.016419 divestment
587 29 0.015897 people
588 29 0.014818 senate
589 29 0.013210 assu
590 29 0.012060 sexual
591 29 0.011910 university
592 29 0.011192 even
593 29 0.010734 movement
594 29 0.010708 latino
595 29 0.010708 groups
596 29 0.010708 chicano
597 29 0.009945 endorsement
598 29 0.009917 would
599 29 0.009182 minor

600 rows × 3 columns

In [81]:
opEds = { "topic":[], "probability":[], "document":[], "paper":[]}
for opEd in range(143):
    for topic,prob in topicDists[opEd]:
        opEds["topic"].append(topic)
        opEds["probability"].append(prob)
        opEds["document"].append(opEd)
        opEds["paper"].append("Stanford Daily" if opEd<88 else "Stanford Review")
opEds = pd.DataFrame(opEds)
In [82]:
opEds.to_csv("opEds.csv")
In [83]:
opEds
Out[83]:
document paper probability topic
0 0 Stanford Daily 0.998778 29
1 1 Stanford Daily 0.998730 7
2 2 Stanford Daily 0.999258 20
3 3 Stanford Daily 0.269004 13
4 3 Stanford Daily 0.729223 20
5 4 Stanford Daily 0.999111 25
6 5 Stanford Daily 0.998287 1
7 6 Stanford Daily 0.978254 8
8 7 Stanford Daily 0.998719 24
9 8 Stanford Daily 0.618979 4
10 8 Stanford Daily 0.380087 21
11 9 Stanford Daily 0.172650 21
12 9 Stanford Daily 0.825826 23
13 10 Stanford Daily 0.999668 15
14 11 Stanford Daily 0.138658 4
15 11 Stanford Daily 0.860255 15
16 12 Stanford Daily 0.998717 2
17 13 Stanford Daily 0.924747 11
18 13 Stanford Daily 0.074378 18
19 14 Stanford Daily 0.997866 14
20 15 Stanford Daily 0.999274 22
21 16 Stanford Daily 0.999075 27
22 17 Stanford Daily 0.998798 16
23 18 Stanford Daily 0.998971 15
24 19 Stanford Daily 0.519245 5
25 19 Stanford Daily 0.479591 20
26 20 Stanford Daily 0.999176 22
27 21 Stanford Daily 0.943922 0
28 21 Stanford Daily 0.046557 6
29 22 Stanford Daily 0.998359 1
... ... ... ... ...
197 131 Stanford Review 0.041330 15
198 131 Stanford Review 0.031038 16
199 131 Stanford Review 0.028016 17
200 131 Stanford Review 0.044704 18
201 131 Stanford Review 0.034421 19
202 131 Stanford Review 0.044838 20
203 131 Stanford Review 0.027976 21
204 131 Stanford Review 0.037756 22
205 131 Stanford Review 0.037892 23
206 131 Stanford Review 0.031018 24
207 131 Stanford Review 0.028124 25
208 131 Stanford Review 0.034317 26
209 131 Stanford Review 0.034451 27
210 131 Stanford Review 0.027998 28
211 131 Stanford Review 0.037783 29
212 132 Stanford Review 0.183900 15
213 132 Stanford Review 0.075694 18
214 132 Stanford Review 0.739799 29
215 133 Stanford Review 0.509945 11
216 133 Stanford Review 0.489490 27
217 134 Stanford Review 0.999568 9
218 135 Stanford Review 0.083696 24
219 135 Stanford Review 0.915512 28
220 136 Stanford Review 0.999370 17
221 137 Stanford Review 0.999090 10
222 138 Stanford Review 0.997551 26
223 139 Stanford Review 0.999426 17
224 140 Stanford Review 0.999311 12
225 141 Stanford Review 0.999408 27
226 142 Stanford Review 0.999295 24

227 rows × 4 columns

Visualization using ggplot and Shiny

We will visualize our topic model stored in topics.csv and opEds.csv using the R packages ggplot and Shiny. The latter allows you to create interactive plots in R. You must first install and load the package using the commands

install.packages('shiny')
library(shiny)

You can read more about Shiny at http://shiny.rstudio.com. A Shiny app is composed of two scripts in the same directory. The first script ui.R, sets up the layout of the HTML page where the app is hosted, as well as any interactive widgets such as the sliders that we use in the example. The second script server.R contains R commands which generate a plot using variables defined interactively through the widgets. In our case, we are using ggplot to visualize the output of the topic model.

The archive at

http://www.stanford.edu/class/stats202/lda.tar.gz

also contains the data files and R scripts needed to run the app. Once you have downloaded and decompressed the archive, you can launch the app by setting your working directory in R to the directory containing ui.R and server.R and using the command:

runApp('.')