Analysis of opinion editorials from two Stanford student newspapers¶

Stanford has two large student newspapers. The Stanford Daily is the main campus tabloid, while the Stanford Review publishes conservative-leaning political articles on a biweekly basis. Each paper has a significant section of OpEds. In this exercise, we will apply topic models to analyze a corpus of opinion pieces published in both papers within the last year.

If you want to follow along, download the following directory which contains an IPython notebook, and the scraped datasets:

http://www.stanford.edu/class/stats202/lda.tar.gz

Scraping the Stanford Daily¶

We start by scraping the website of the Stanford Daily, using the packages requests and BeautifulSoup4 introduced in Homework 3.

import requests
from bs4 import *

baseURL = "http://www.stanforddaily.com/category/opinions/op-eds/page/"
hdr = {'User-Agent':'Mozilla/5.0'}

# Get a list of permalinks to each opinion piece (in first 11 pages)
articleURLs = {}
for page in range(1,12):
    opinionPage = requests.get(baseURL+str(page),headers = hdr)
    soup = BeautifulSoup(opinionPage.text, "html.parser")
    listItem = soup.findAll(attrs={'class':'item-list'})
    for li in listItem:
        link = li.find('a')
        articleURLs[link.attrs['title']] = link.attrs['href']

# Print the number of pieces
len(articleURLs)

88

# Remove "Permalink to" from each title
articleURLs = {key[13:] : value for key,value in articleURLs.items()}

# Scrape each article website for the text of the article
articles = {}
for title, url in articleURLs.items():
    print(title)
    articlePage = requests.get(url,headers = hdr)
    soup = BeautifulSoup(articlePage.text,"html.parser")
    # Remove irrelevant javascript text
    [x.extract() for x in soup.findAll(attrs={'id':'videoread'})]
    #soup.find(attrs={'id':'videoread'}).extract()
    # Article text appears in second element labeled "entry"
    entry = soup.find_all(attrs={'class':'entry'})[1]
    articles[title] = entry.getText()

Improving sex ed?
Multiple meanings and community critique: The complex place of muralism in Casa Zapata
Snowden and Manning are criminals and traitors, not heroes
Why SJP is antithetical to Stanford values
The value of anonymity (yes, even when it’s terrible)
From climate change to campus change: How climate discussions highlight a need for more productive debates
A false security
Pour one out for Meyer
Divestment doesn’t foster discrimination — Hillel and the ADL do
Why the Presbyterian Church (USA) chose boycott and divestment
Rethinking gender and sexual assault policy: My story
Divestment from Palestine: A human rights farce
Make the choice to make a difference
Fossil Free Stanford: Open letter to the Stanford University Board of Trustees
Working at Relcy
The real price of athletics at Stanford
Beware the religious freedom boogeyman
Decriminalizing victims: Let’s adopt the Nordic Model of prostitution law
Why I didn’t report
Stanford falls for BDS
The price of athletics at Stanford
On living in fear of telling the truth: My experience with SAE, retaliation and Title IX
A campus united against division
The time is now: Funding reform cannot be delayed
Limitations of identity activism
Letter to Stanford University on Tobacco Use Policies
Innovation for equity: The case for Bus Rapid Transit
Bite into a healthy lifestyle with a plant-strong diet
Uncle Sam still wants you
Stanford Class Confessions
Stanford Students for Life’s annual protest is shameful
Keeping the focus where it belongs: Condemning racial injustice is a stepping stone towards progress
‘India’s Daughter’ as seen by America’s Daughter
Abusing the term ‘anti-Semitism’
Response to ‘Islamophobia and the White moderate’
In support of Stanford Out of Occupied Palestine and the oppressed
Loving all those who matter: My struggle to love myself
Ferguson and Palestine ad absurdum
Fossil Free freshmen respond to criticism of civil disobedience
Mental health is our Vietnam
Speaking out: Student-athletes weigh in on LGBT acceptance in athletics
If I am not for myself, who will be for me?
Fight the natural gas expansion
Yes, know thyself, but first do a situational analysis
An awkward echo: 1977/2015
Forbes Café: A New Pricing Scam?
Oh no, FoHo: A critique of Stanford’s newest publication
Saving face vs. saving money in Greek life
Behind the decision making of the Sexual Assault Task Force
Racism in a Palo Alto establishment
How Finley fixed funding
I don’t love Stanford
How can an Israeli support the BDS movement?
Zionism, civil rights and Stanford activism: The case for productive education
The Israel I have seen
Islamophobia and the White moderate
Marine protected areas in the Southern Ocean
We don’t need your help
Ode to (Meyer) a work-in-progress
Marriage does not unite us
Facts and truth empower a smoke-free campus
David Shaw 2.0
Do divest from fossil fuels: A response
Affirming our commitment to community
Gross misconduct in APIRL’s handling of SJP’s divestment request
Stanford should abandon the Searsville Dam
Rape: An uncomfortable truth
A way to effect change at Stanford: University committees
Catharsis: CAPS and eating disorders
Facing the mental health crisis at Stanford
Confronting baseless allegations: The SOCC endorsement process
Response to ‘Enough of Shakespeare’
The hidden costs of ISC recruitment
Response to Stanford’s release of the climate survey results
Addressing differences within and without: An open letter from the JSA Board
Why I give to Stanford
Fun at the expense of respect: Changing how we see Native Americans in the 21st century
Concerning violence
Understanding the nature of prejudice: The truth of Islamophobia and Ahmed Mohamed
An open letter to Ms. Bloch-Horowitz from an SAE
The dream of peace vs. the nightmare of divestment
The politics of Instagramming genocide
The good professor and the research university
A needed voice
Girl power (for someone else’s world)
Housestaff family life at Stanford Hospital
The media’s role in climate change action
Don’t tell us how to feel

# Save the articles in a dictionary in the current directory
f = open('dailyArticles.txt','w')
f.write(str(articles))
f.close()

Scraping the Stanford Review¶

We now scrape articles from the website of the Stanford Review.

baseURL = 'http://stanfordreview.org/cat/sections/opinion/page/'
hdr = {'User-Agent':'Mozilla/5.0'}

# Get a list of permalinks to each opinion piece (in first 11 pages)
articleURLs = {}
for page in range(1,12):
    opinionPage = requests.get(baseURL+str(page),headers = hdr)
    soup = BeautifulSoup(opinionPage.text, "html.parser")
    listItem = soup.findAll(attrs={'class':'entry-title'})
    for li in listItem:
        link = li.find('a')
        articleURLs[link.text] = link.attrs['href']

# Print the number of articles
len(articleURLs)

55

# Scrape each article website for the text of the article
reviewArticles = {}
for title, url in articleURLs.items():
    articlePage = requests.get(url,headers = hdr)
    soup = BeautifulSoup(articlePage.text, "html.parser")
    entry = soup.find(attrs={'class':'post-content'})
    reviewArticles[title] = entry.getText()
    print(title)

SOCC Should Be More Inclusive If It’s Going To Claim It Represents All Minority Views
Own Up
Israel Again Pitted Against Hatred
The Pandora’s Box of Intersectionality and Solidarity
The Fundamental Double Standard
Thawing Cold War Relations with Cuba
In Response to an Attack on Gay Marriage in the Stanford Daily
A Fresh Perspective On The New NSO
Stanford’s ASSU Divests from its own Legitimacy
Don’t Speak; Don’t Think
The Executive Slate Debate: What You Need to Know
The Dangers of Divestment
For the Skeptics: Why Divest?
Bridging the Anti-Semitism Gap
ASSU “At Large”
Havens and echo chambers: Identity politics at Stanford
Does the Honor Code Work?
Buzzword, Buzzword, Blah, Blah, Blah: Why We Tune Out ASSU Senate Campaigns
Fossil Fuel Divestment: The Logical Choice
The Undergraduate Senate Should Represent You
The Hidden Agenda Behind SOOP’s False Divestment Claims
The Case for Clarity in Campaign Contracts
Dear ProFros: Stanford Students Don’t Bite
When Law No Longer Rules
Forgotten Mexico: The Hypocrisy of #WeAreCharlie
Hurry Up! The Provost’s Task Force on Sexual Assault Reform’s Delay is Unacceptable
Who Really Hacked Sony Pictures?
Can the humanities thrive without being required?
Israel Has a Right to Exist – and So Does the United States
Fossil Fuel Divestment is Misguided: Focus on Investment Instead
Please, offend me
Valuing Victims and Valuing Fairness: Stanford’s Sexual Assault Problem
A Survivor Speaks Out Against Stanford’s Sexual Assault Proposal
An Open Letter to Stanford on the Economics of Divestment
SOCC Had Every Right to Ask About Candidate’s Identity
In Defense of SAL
Fossil Free Stanford’s Real Conflict of Interest
Marijuana at Stanford
Should American Sniper Come Under Fire?
Baltimore’s Legacy of Racial Discrimination
Cultural Oversensitivity Stifles Cultural Understanding
ASSU Poised to Demonstrate Its Economic Incompetence, Again
Editor’s Note: The Stanford Review and Bayonets
Silicon Valley’s Love For Weed
Stern Dining Should Not Honor Mass Murderers
End the Title IX Inquisition
The Death of Stanford’s Humanities Core
Applauding Stanford’s Sorority Recruitment Process
True Acceptance or Political Correctness?
Stanford’s Title IX Process Violates Due Process
Cut Corporate Taxes to Empower Stanford Workers
The Connection Between Race, Riots, and Rational Thinking
Monetise the Draw
Stanford GSB taps Charles Koch as 2016 Commencement Speaker
Demand the Change, or Make the Change?

# Save the articles to the working directory
f = open('reviewArticles.txt','w')
f.write(str(reviewArticles))
f.close()

Load articles from file¶

We can start from here if we've already done the scraping.

f = open('dailyArticles.txt','r')
articles = eval(f.read())
f.close()

f = open('reviewArticles.txt','r')
reviewArticles = eval(f.read())
f.close()

Cleaning the articles of non-ASCII characters and punctuation¶

# Define a set of punctuation characters (including numbers)
punct = set('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~1234567890')
# Add the newline character
punct.add('\n')

# This will be a list containing all of the articles as lowercase strings without punctuation (a bag of words)
cleanArticles = []   

for title, article in articles.items():
    asciiArticle = article.encode('ascii','ignore').decode()
    cleanArticles.append("".join(' ' if x in punct else x for x in asciiArticle).lower())
    
for title, article in reviewArticles.items():
    asciiArticle = article.encode('ascii','ignore').decode()
    cleanArticles.append("".join(' ' if x in punct else x for x in asciiArticle).lower())

# For example, here is the first "clean" article
cleanArticles[0]

u' in the aftermath of the stand with leah protests and national attention on date rape on college campuses  our class of      has had more inundation of information on sexual assault than any other class in stanford history  in their think about it  think alcohol edu  online class before coming to campus as well as additional sexual assault programming during facing reality at new student orientation  they have had multiple exposures to stanford programming condemning sexual assault  tuesday night  the sara office came to donner  a freshman dorm on campus  to host a panel titled mating  dating  and relating  whats good  whats healthy    whats harmful  as a freshman resident assistant  i was charged with planning logistics for the mandatory presentation  in emails back and forth with the sara office  the donner staff stated  our residents went through the new student orientation that incorporated sexual assault training  and theyve had a lot of caution around the worst that can happen with sexual relations on campus  wed love your presentation to bring in a little more sex positivity  and i think that would also increase the turnout  the sara presentation did not meet expectations  rather  the overwhelming focus was on unhealthy relationships with little focus on what positive sexual experiences might look like  skye lovett    noted  the presentation was unnecessarily hetronormative  nick salzar    and a donner ra noted  i thought that the people were very nice  and i want to thank them for coming  i think that the information that is presented is useful for people to know  that being said  that is not what we asked for  why does sara wrap a talk that is really just about sexual assault with the title  mating  dating  and relating  peter litzow    said  although the issues they talked about were very important  i dont think they necessarily represented the wider spectrum of intimacy that can happen in freshmen year  most people are experiencing casual hook ups or awkward encounters in general  and information about that would be more pertinent to freshmen in their first quarter   megan calfas    stated  i felt like it was a conversation thats worth having  but its not the only conversation worth having  and its kind of the only conversation that were having at these events  the way its presented feels very inorganic  hannah pho    commented  i think it made everyone even more uncomfortable than they already were  it was words off a paper that wed heard before  and i wanted to leave the room so badly but felt like i had some sort of obligation to stay  anton de leon    stated  these sorts of panels trivialize real issues  as a freshman ra  i am cognizant that many of my freshmen have limited sexual experience and would benefit from university sponsored sex education  one freshman likened it to teaching abstinence in that there has been little acknowledgement from the university that some students are sexually active  when the office of alcohol policy and education  oape  came to donner  they did a wonderful job presenting on how to drink responsibly if one chooses to drink  my freshmen have yet to hear a presentation about getting tested for stis  asking for consent or how to use birth control  the lack of effectiveness of the sara sexual assault presentation as well as the negligence of sex positive messages and practical information regarding sex that is being made available to the freshmen class is shocking  in the current state  greek organizations have become a scapegoat for the university that needs to find a target  in       duke suffered a      decline in applications for admission after the rape charges against members of the schools lacrosse team  i fear that the increased programming around sexual assault at stanford is as much motivated to maintain our image of selectivity than as to protect and prevent instances of sexual assault  stanford  we can do better  sex education policy should not just focus on the worst case scenarios  but also provide information for how students can grow into their sexuality  by providing sex positive programming to the freshmen class  we enable them to make responsible decisions as they grow into their adulthood  mckenzie andrews    is an ra in donner  she can be contacted at andrews  at stanford edu    '

Tokenization¶

Before we apply the topic model, we want to get rid of common words such as articles and pronouns, which are referred to as "stop words". We also include words that appear in nearly every article, such as "Stanford" and "student".

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'u', 'many']

# Adding a few words that don't help us distinguish between articles
stopwords.extend(['stanford','student','students'])

We also want to cluster words that are slight variations of one another, such as "politics", "political", and "politician". This can be done using the famous Porter stemmer, which maps each word to its semantic stem, for the example above, "politi". This function is available from the package ntlk.

import nltk
stemmer = nltk.PorterStemmer()

# This will be a list of articles, where each article is a list of words
tokenizedArticles = []
# This will be a dictionary mapping each "stem" to a representative word 
stem2representative = {}

for article in cleanArticles:
    # Split the article into words and drop the stopwords
    tokens = [word for word in article.split() if word not in stopwords]  
    # Replace each word by its stem, storing a representative word for each stem
    stems = []
    for word in tokens:
        s = stemmer.stem(word)
        stem2representative[s] = word
        stems.append(s)
    # Eliminate stems that appear less than twice in the article    
    tokens = [word for word in stems if stems.count(word)>2]
    tokenizedArticles.append(tokens)

# For each article, replace the stem by a single representative
tokenizedArticles = [ [stem2representative[s] for s in article] for article in tokenizedArticles ]

Topic models¶

Now, we can apply the Latent Dirichlet Allocation model to this set of articles. This model is implemented in the python package gensim.

import gensim

# Assign each word a unique identifier (a number), and define a dictionary for mapping from word to number
dictionary = gensim.corpora.Dictionary(tokenizedArticles)

# Number of unique tokens in all the articles
len(dictionary.token2id)

1660

# Here we define a "corpus", which is simply a representation of each article as a set of (word,count) pairs
corpus = [dictionary.doc2bow(text) for text in tokenizedArticles]

# Now, we can train the LDA model on the corpus
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=30, update_every=1, alpha="auto", passes=400, chunksize=143)

# We can view the words associated to topic 12 by calling
model.print_topic(12)

u'0.064*taxing + 0.046*company + 0.044*fairness + 0.044*trade + 0.037*corporate + 0.033*workers + 0.026*profit + 0.026*money + 0.024*make + 0.024*people'

# Similarly, we can view the distribution of topics in article 57 by calling
model[corpus[57]]

[(20, 0.99928893406074626)]

# Here, we store the distribution of topics in every article
topicDists = [ model[corpus[i]] for i in range(len(corpus)) ]

Saving the output¶

import pandas as pd

numTopics = 30
topics = {"topic":[],"word":[],"weight":[]}
for topic in range(numTopics):
    x = model.show_topic(topic,20)
    for weight, word in x:
        topics["topic"].append(topic)
        topics["word"].append(word)
        topics["weight"].append(weight)
topics = pd.DataFrame(topics)

topics.to_csv("topics.csv")

topics

opEds = { "topic":[], "probability":[], "document":[], "paper":[]}
for opEd in range(143):
    for topic,prob in topicDists[opEd]:
        opEds["topic"].append(topic)
        opEds["probability"].append(prob)
        opEds["document"].append(opEd)
        opEds["paper"].append("Stanford Daily" if opEd<88 else "Stanford Review")
opEds = pd.DataFrame(opEds)

opEds.to_csv("opEds.csv")

opEds

Visualization using ggplot and Shiny¶

We will visualize our topic model stored in topics.csv and opEds.csv using the R packages ggplot and Shiny. The latter allows you to create interactive plots in R. You must first install and load the package using the commands

install.packages('shiny')
library(shiny)

You can read more about Shiny at http://shiny.rstudio.com. A Shiny app is composed of two scripts in the same directory. The first script ui.R, sets up the layout of the HTML page where the app is hosted, as well as any interactive widgets such as the sliders that we use in the example. The second script server.R contains R commands which generate a plot using variables defined interactively through the widgets. In our case, we are using ggplot to visualize the output of the topic model.

The archive at

http://www.stanford.edu/class/stats202/lda.tar.gz

also contains the data files and R scripts needed to run the app. Once you have downloaded and decompressed the archive, you can launch the app by setting your working directory in R to the directory containing ui.R and server.R and using the command:

runApp('.')

	topic	weight	word
0	0	0.037576	athletes
1	0	0.024578	sexual
2	0	0.021362	percent
3	0	0.021362	sae
4	0	0.018421	one
5	0	0.016797	assault
6	0	0.016129	campus
7	0	0.015389	harassment
8	0	0.014360	report
9	0	0.013682	retaliating
10	0	0.011975	drug
11	0	0.011711	investigation
12	0	0.011122	lgbt
13	0	0.010269	policy
14	0	0.010269	marijuana
15	0	0.010031	years
16	0	0.009415	issue
17	0	0.009415	victims
18	0	0.009359	culture
19	0	0.009102	university
20	1	0.033309	dialogue
21	1	0.031093	kyle
22	1	0.028877	american
23	1	0.026662	sniper
24	1	0.022230	viewers
25	1	0.022230	movie
26	1	0.022230	film
27	1	0.017799	yik
28	1	0.017799	controversial
29	1	0.017799	people
...	...	...	...
570	28	0.016414	membership
571	28	0.016414	discussion
572	28	0.014371	putnams
573	28	0.014371	policy
574	28	0.014371	communicate
575	28	0.014162	one
576	28	0.012328	even
577	28	0.012328	virtue
578	28	0.012328	activists
579	28	0.012328	experience
580	29	0.039702	identities
581	29	0.032835	socc
582	29	0.029020	political
583	29	0.029020	communicate
584	29	0.027918	campus
585	29	0.019851	years
586	29	0.016419	divestment
587	29	0.015897	people
588	29	0.014818	senate
589	29	0.013210	assu
590	29	0.012060	sexual
591	29	0.011910	university
592	29	0.011192	even
593	29	0.010734	movement
594	29	0.010708	latino
595	29	0.010708	groups
596	29	0.010708	chicano
597	29	0.009945	endorsement
598	29	0.009917	would
599	29	0.009182	minor

	document	paper	probability	topic
0	0	Stanford Daily	0.998778	29
1	1	Stanford Daily	0.998730	7
2	2	Stanford Daily	0.999258	20
3	3	Stanford Daily	0.269004	13
4	3	Stanford Daily	0.729223	20
5	4	Stanford Daily	0.999111	25
6	5	Stanford Daily	0.998287	1
7	6	Stanford Daily	0.978254	8
8	7	Stanford Daily	0.998719	24
9	8	Stanford Daily	0.618979	4
10	8	Stanford Daily	0.380087	21
11	9	Stanford Daily	0.172650	21
12	9	Stanford Daily	0.825826	23
13	10	Stanford Daily	0.999668	15
14	11	Stanford Daily	0.138658	4
15	11	Stanford Daily	0.860255	15
16	12	Stanford Daily	0.998717	2
17	13	Stanford Daily	0.924747	11
18	13	Stanford Daily	0.074378	18
19	14	Stanford Daily	0.997866	14
20	15	Stanford Daily	0.999274	22
21	16	Stanford Daily	0.999075	27
22	17	Stanford Daily	0.998798	16
23	18	Stanford Daily	0.998971	15
24	19	Stanford Daily	0.519245	5
25	19	Stanford Daily	0.479591	20
26	20	Stanford Daily	0.999176	22
27	21	Stanford Daily	0.943922	0
28	21	Stanford Daily	0.046557	6
29	22	Stanford Daily	0.998359	1
...	...	...	...	...
197	131	Stanford Review	0.041330	15
198	131	Stanford Review	0.031038	16
199	131	Stanford Review	0.028016	17
200	131	Stanford Review	0.044704	18
201	131	Stanford Review	0.034421	19
202	131	Stanford Review	0.044838	20
203	131	Stanford Review	0.027976	21
204	131	Stanford Review	0.037756	22
205	131	Stanford Review	0.037892	23
206	131	Stanford Review	0.031018	24
207	131	Stanford Review	0.028124	25
208	131	Stanford Review	0.034317	26
209	131	Stanford Review	0.034451	27
210	131	Stanford Review	0.027998	28
211	131	Stanford Review	0.037783	29
212	132	Stanford Review	0.183900	15
213	132	Stanford Review	0.075694	18
214	132	Stanford Review	0.739799	29
215	133	Stanford Review	0.509945	11
216	133	Stanford Review	0.489490	27
217	134	Stanford Review	0.999568	9
218	135	Stanford Review	0.083696	24
219	135	Stanford Review	0.915512	28
220	136	Stanford Review	0.999370	17
221	137	Stanford Review	0.999090	10
222	138	Stanford Review	0.997551	26
223	139	Stanford Review	0.999426	17
224	140	Stanford Review	0.999311	12
225	141	Stanford Review	0.999408	27
226	142	Stanford Review	0.999295	24