Stanford has two large student newspapers. The Stanford Daily is the main campus tabloid, while the Stanford Review publishes conservative-leaning political articles on a biweekly basis. Each paper has a significant section of OpEds. In this exercise, we will apply topic models to analyze a corpus of opinion pieces published in both papers within the last year.
If you want to follow along, download the following directory which contains an IPython notebook, and the scraped datasets:
http://www.stanford.edu/class/stats202/lda.tar.gz
We start by scraping the website of the Stanford Daily, using the packages requests and BeautifulSoup4 introduced in Homework 3.
import requests
from bs4 import *
baseURL = "http://www.stanforddaily.com/category/opinions/op-eds/page/"
hdr = {'User-Agent':'Mozilla/5.0'}
# Get a list of permalinks to each opinion piece (in first 11 pages)
articleURLs = {}
for page in range(1,12):
opinionPage = requests.get(baseURL+str(page),headers = hdr)
soup = BeautifulSoup(opinionPage.text, "html.parser")
listItem = soup.findAll(attrs={'class':'item-list'})
for li in listItem:
link = li.find('a')
articleURLs[link.attrs['title']] = link.attrs['href']
# Print the number of pieces
len(articleURLs)
# Remove "Permalink to" from each title
articleURLs = {key[13:] : value for key,value in articleURLs.items()}
# Scrape each article website for the text of the article
articles = {}
for title, url in articleURLs.items():
print(title)
articlePage = requests.get(url,headers = hdr)
soup = BeautifulSoup(articlePage.text,"html.parser")
# Remove irrelevant javascript text
[x.extract() for x in soup.findAll(attrs={'id':'videoread'})]
#soup.find(attrs={'id':'videoread'}).extract()
# Article text appears in second element labeled "entry"
entry = soup.find_all(attrs={'class':'entry'})[1]
articles[title] = entry.getText()
# Save the articles in a dictionary in the current directory
f = open('dailyArticles.txt','w')
f.write(str(articles))
f.close()
We now scrape articles from the website of the Stanford Review.
baseURL = 'http://stanfordreview.org/cat/sections/opinion/page/'
hdr = {'User-Agent':'Mozilla/5.0'}
# Get a list of permalinks to each opinion piece (in first 11 pages)
articleURLs = {}
for page in range(1,12):
opinionPage = requests.get(baseURL+str(page),headers = hdr)
soup = BeautifulSoup(opinionPage.text, "html.parser")
listItem = soup.findAll(attrs={'class':'entry-title'})
for li in listItem:
link = li.find('a')
articleURLs[link.text] = link.attrs['href']
# Print the number of articles
len(articleURLs)
# Scrape each article website for the text of the article
reviewArticles = {}
for title, url in articleURLs.items():
articlePage = requests.get(url,headers = hdr)
soup = BeautifulSoup(articlePage.text, "html.parser")
entry = soup.find(attrs={'class':'post-content'})
reviewArticles[title] = entry.getText()
print(title)
# Save the articles to the working directory
f = open('reviewArticles.txt','w')
f.write(str(reviewArticles))
f.close()
We can start from here if we've already done the scraping.
f = open('dailyArticles.txt','r')
articles = eval(f.read())
f.close()
f = open('reviewArticles.txt','r')
reviewArticles = eval(f.read())
f.close()
# Define a set of punctuation characters (including numbers)
punct = set('!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~1234567890')
# Add the newline character
punct.add('\n')
# This will be a list containing all of the articles as lowercase strings without punctuation (a bag of words)
cleanArticles = []
for title, article in articles.items():
asciiArticle = article.encode('ascii','ignore').decode()
cleanArticles.append("".join(' ' if x in punct else x for x in asciiArticle).lower())
for title, article in reviewArticles.items():
asciiArticle = article.encode('ascii','ignore').decode()
cleanArticles.append("".join(' ' if x in punct else x for x in asciiArticle).lower())
# For example, here is the first "clean" article
cleanArticles[0]
Before we apply the topic model, we want to get rid of common words such as articles and pronouns, which are referred to as "stop words". We also include words that appear in nearly every article, such as "Stanford" and "student".
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'u', 'many']
# Adding a few words that don't help us distinguish between articles
stopwords.extend(['stanford','student','students'])
We also want to cluster words that are slight variations of one another, such as "politics", "political", and "politician". This can be done using the famous Porter stemmer, which maps each word to its semantic stem, for the example above, "politi". This function is available from the package ntlk.
import nltk
stemmer = nltk.PorterStemmer()
# This will be a list of articles, where each article is a list of words
tokenizedArticles = []
# This will be a dictionary mapping each "stem" to a representative word
stem2representative = {}
for article in cleanArticles:
# Split the article into words and drop the stopwords
tokens = [word for word in article.split() if word not in stopwords]
# Replace each word by its stem, storing a representative word for each stem
stems = []
for word in tokens:
s = stemmer.stem(word)
stem2representative[s] = word
stems.append(s)
# Eliminate stems that appear less than twice in the article
tokens = [word for word in stems if stems.count(word)>2]
tokenizedArticles.append(tokens)
# For each article, replace the stem by a single representative
tokenizedArticles = [ [stem2representative[s] for s in article] for article in tokenizedArticles ]
Now, we can apply the Latent Dirichlet Allocation model to this set of articles. This model is implemented in the python package gensim.
import gensim
# Assign each word a unique identifier (a number), and define a dictionary for mapping from word to number
dictionary = gensim.corpora.Dictionary(tokenizedArticles)
# Number of unique tokens in all the articles
len(dictionary.token2id)
# Here we define a "corpus", which is simply a representation of each article as a set of (word,count) pairs
corpus = [dictionary.doc2bow(text) for text in tokenizedArticles]
# Now, we can train the LDA model on the corpus
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=30, update_every=1, alpha="auto", passes=400, chunksize=143)
# We can view the words associated to topic 12 by calling
model.print_topic(12)
# Similarly, we can view the distribution of topics in article 57 by calling
model[corpus[57]]
# Here, we store the distribution of topics in every article
topicDists = [ model[corpus[i]] for i in range(len(corpus)) ]
import pandas as pd
numTopics = 30
topics = {"topic":[],"word":[],"weight":[]}
for topic in range(numTopics):
x = model.show_topic(topic,20)
for weight, word in x:
topics["topic"].append(topic)
topics["word"].append(word)
topics["weight"].append(weight)
topics = pd.DataFrame(topics)
topics.to_csv("topics.csv")
topics
opEds = { "topic":[], "probability":[], "document":[], "paper":[]}
for opEd in range(143):
for topic,prob in topicDists[opEd]:
opEds["topic"].append(topic)
opEds["probability"].append(prob)
opEds["document"].append(opEd)
opEds["paper"].append("Stanford Daily" if opEd<88 else "Stanford Review")
opEds = pd.DataFrame(opEds)
opEds.to_csv("opEds.csv")
opEds
We will visualize our topic model stored in topics.csv and opEds.csv using the R packages ggplot and Shiny. The latter allows you to create interactive plots in R. You must first install and load the package using the commands
install.packages('shiny')
library(shiny)
You can read more about Shiny at http://shiny.rstudio.com. A Shiny app is composed of two scripts in the same directory. The first script ui.R, sets up the layout of the HTML page where the app is hosted, as well as any interactive widgets such as the sliders that we use in the example. The second script server.R contains R commands which generate a plot using variables defined interactively through the widgets. In our case, we are using ggplot to visualize the output of the topic model.
The archive at
http://www.stanford.edu/class/stats202/lda.tar.gz
also contains the data files and R scripts needed to run the app. Once you have downloaded and decompressed the archive, you can launch the app by setting your working directory in R to the directory containing ui.R and server.R and using the command:
runApp('.')