Homework 8: Zeitgeist

Assignment developed by Brahm Capoor

This is the last homework, pulling it all together. This homework is relatively small. Due Wed March 11th at 11:55pm as usual. The grace period extends until Friday, March 13th, after which time-due to university policy-we will not provide extensions, OAE or otherwise.

The first part of this project is standard Python code, slicing up and organizing data. The second part uses Jupyter to display the data and produce a notebook.

In this assignment, you'll take the Python skills you've been developing over the last few months and apply them to a real-world data analysis problem. Specifically, we've collected a dataset of 1.5 million news headlines published over the last two decades and you'll be leveraging dictionaries, Jupyter notebooks and matplotlib to produce graphs tracking the usage of different words in these headlines over time.

Download the starter code here.

Headline Data

The headlines data looks like the following text, the data on each line separated by commas. Each line represents one particular news headline. The first column is the month in which the headline was published-measured in months past January 2003-which we'll refer to as the month number, and the second is the text of the headline itself.

0,ambitious olsson wins triple jump
...
80,golfers tee worlds longest golf course
80,muse named worlds best act q awards
...
190,tasmanian golf museum claims world oldest electronic computer
    

Each headline has been made lowercase preprocessed to remove all punctuation and to remove all stopwords: words such as 'the' and 'at', which occur so frequently they skew any meaningful analysis.

To work with this data, organize it as a "headlines" dict with a key for each unique word that appears in a headline whose corresponding value is a counts dictionary mapping a month number (using the definition above) to the number of times that word was used in that month:

{
    'ambitious': {0 : 1},
    'golfers': {80: 1},
    'worlds': {80: 2},
    'golf': {80: 1, 190: 1}        
}
    

Part 1: produce_headlines_dict(text)

This part is familiar PyCharm / Python coding. In the headlinelib.py file, complete the the function, which takes in text lines and returns a headlines dict as above.

Use the string function text.splitlines() which given a string, splits it into a list of lines to loop over.

One simple Doctest is provided. Syntax quirk: within Doctests, the newline character must be written with two-backslashes (\\n) as shown in the provided test. Write at least 2 additional Doctests, each test with at least 2 headlines and at least 3 unique words.

The provided main() calls your produce_headlines_dict from the command line as another way to run your code. We've provided a small file with a few headlines called small-headlines.txt that you can use to test your program like so:

$ python3 headlinelib.py small-headlines.txt
    

The only deliverable for Part-1 is that your produce_headlines_dict is done and tested to work correctly. Make sure your function is correct before moving on to the next part. It's a lot easier to test and perfect your function in PyCharm where you have the Doctests working for you.

Part 2: Jupyter

Make sure jupyter and matplotlib are installed as in lecture. Not a problem if it mentions that "pip" could be upgraded, you can ignore that (on Windows, the the command below uses "python" instead of "python3".)

$  python3 -m pip install jupyter matplotlib 
    

Run Jupyter

From a terminal in your Zeitgeist folder, run jupyter notebook, which should open a Jupyter web page.

$ jupyter notebook
    

1. Open the 'Headline Analysis' jupyter notebook in the starter code folder. Run the first cell, which contains this code:

%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import urllib.request

import headlinelib
    

The first two lines ensure that your notebook automatically updates to reflect any changes you make to your headlinelib.py file. The %matplotlib phrase avoids the problem that graphs do not always show up. The import just brings in your python code to call.

2. Web data. Here are the 4 lines from the lecture example to display the contents of the file at http://web.stanford.edu/class/cs106a/hello.txt

import urllib.request
f = urllib.request.urlopen('http://web.stanford.edu/class/cs106a/hello.txt')
text = f.read().decode('utf-8')
len(text)
    

The headlines data is at the following url, so change the code to load that text.

http://web.stanford.edu/class/cs106a/headlines.txt
    

Get the urllib calls working in your notebook so it downloads the headlines data text. This is a sizeable amount of data, and might take a while to download so make sure to do it in a separate cell to the rest of your code so you only need to run it once. If the download is taking too long, you can use the small headlines datafile at http://cs106a.stanford.edu/smallheadlines.txt , although this will limit the graphs you're able to produce later on.

3. Call your headlinelib code to parse the data.

headlines = headlinelib.produce_headlines_dict(text)
len(headlines)
    

Calling len on the headlines dictionary will output a number, providing a little bit of confirming output to signal that it worked. The length in this case is 163265, indicating that there are 163265 distinct words across the headlines. If you are using the small headlines data, the number of keys should be 8831.

Graphing

Now, we'll use Jupyter's strength in graphing to build a tool that allows us to track the frequency of word usage in newspaper headlines over time. Your main job is to implement the following function:

def plot_words(words)
    

which accepts as a parameter a list of words and whose job is to plot-on the same graph-the frequency trend of each word to allow you to compare their usage over time.

For each word in words, plot a graph of month number against the number of times the word was used in that month, which you should retrieve from the headlines dictionary. To produce the graph, you'll need to have two equal-length lists: one for the x values, and one for the y values.

For the x values, you can use the MONTHS constant which stores every integer between 0 and 199, which represents every month for which we have data for at least one headline.

To find the y values, our job is-for each month in MONTHS-to find how many times that word was used in a newspaper headline to use as a y value. If it was not used at all in a month, treat the frequency as 0. You can use the dictionary's .get function, which accepts as parameters a key and a value to return if the key is not in the dictionary. For example, d.get('a', 42) will return the value associated with 'a' if it is a key in the dictionary, and 42 otherwise. Store all of these frequencies in a list, and once you have it, you should be able to plot MONTHS against it.

Plotting tips

Testing your function

To allow you to test your plot_words function, we've provided you with the ability to directly query it to see the trends for various words. You can do so using the cell with this code:

words = input("Type a space-separated list of search terms and press enter: ").split()
plot_words(words)
    

Running the code in this cell produces a textbox for you to type in, as you'll see below:

A prompt in Jupyter

Typing in your search terms, separated by spaces will pass those terms as a list into plot_words. In doing so, you should be able to produce graphs like the examples we've produced below:

Plot for Olympics
Plot for Bitcoin
Plot for Bitcoin

Most frequent words

As a last step, we're interested in finding out which words have most frequently been used in headlines over the last 20 years. In the cell below your graphs, write code leveraging the sorted function's key parameter to determine the 10 most commonly used words. There are multiple ways to do this; our own solution is two lines of code and additionally leverages the sum function as well as a dictionary's .values() function, which returns a list of all the values in a dictionary. Some of the most frequent words might surprise you, whilst others might be more expected.

This is all the analysis you are required to implement for this assignment, but we'd encourage you to explore this data further if you have the time and inclination. There are plenty of interesting questions to ask of it, and should you arrive at any interesting conclusions, we'd love to hear about them.

When your graphs look good, use File > Save and Checkpoint to save the .ipynb file in its current state. Then use File > Close and Halt to get out of the notebook. Back at the Jupyter file list, there's a Quit button at the top and you can close the tab. Please turn in the 2 files on Paperless: headlinelib.py and analysis.ipynb.