Today: comprehensions, matplotlib intro

List Comprehensions - Beautiful Hacker

For more details, see the Python Guide Comprehensions section

Comprehension Introduction

# Say we have:
nums = 
[1, 2, 3, 4, 5, 6]

# We want:
[1, 4, 9, 16, 25, 36]

# 1-line comprehension solves this:
[n * n for n in nums]

Comprehension Steps

Comprehension Examples

>>> nums = [1, 2, 3, 4, 5, 6]
>>> [n + 10 for n in nums]
[11, 12, 13, 14, 15, 16]
>>>
>>> [abs(n - 3) for n in nums]
[2, 1, 0, 1, 2, 3]
>>>
>>> [str(n) + '!!' for n in nums]
['1!!', '2!!', '3!!', '4!!', '5!!', '6!!']
>>>

Comprehension Summary

We have a list of n elements, the comprehension is a dense, 1-line way to compute a list of new elements. That happens to be a useful little pattern for our next topic, matplotlib. We'll talk about comprehensions more on Friday.


The Universe of Matplotlib

Matplotlib is an extremely capable and popular Python module for producing visualizations of data. Install it with "-m pip" like this:

$ python3 -m pip install matplotlib

Matplotlib is very popular library used by researches and media for generating visualizations. Matplotlib has a dizzying number of features. We will just scratch the surface here, so you get a feel for what's there, and can become more expert with it after CS106A if you like.

For more matplotlib information, see matplotlib.org and here is a popular tutorial (with tons of ads!): Matplotlib Tutorial

For this lecture example, we'll just use the few matplotlib features shown below, and later CS106A matplotlib work will use these same features.

Election 2020 Data - mystery-plot.zip

Download to get started

mystery-plot.zip which contains real data from the 2020 election. There's a lot of real data in here we can graph.

The file "mystery-plot.py" is complete - we'll just play with it to experiment with matplotlib, and it is working example matplotlib code.

Data source: there's a nicely organized 2020 data set - graphics at the New York Times and other things you've seen are built from this data. It's also huge and complicated! The California and Texas data sets below are from this url.

Look at ca-2020-election.csv

The file ca-2020-election.csv looks like this: (you can click on it from within Pycharm to see it). There are more than 70,000 lines of data in here, with data for each voting precinct in California.

COUNTY,FIPS,RGPREC_KEY,ELECTION,TYPE,RGPREC,TOTREG_R,DEM,REP,AIP,PAF,MSC,LIB,NLP,GRN...
25,06049,0604902001,g20,V,02001,96,22,41,1,0,1,2,0,0,0,29,40,56,3,0,2,0,0,1,0,0,0,0,0,0,0...
25,06049,0604904001,g20,V,04001,111,25,61,6,0,0,1,0,0,0,18,58,53,0,2,1,1,0,0,0,0,0,0,2,0,0...
25,06049,0604906001,g20,V,06001,365,101,184,16,2,0,2,0,4,0,56,166,199,5,10,2,1,5,4,3,1,0,0,..
..

How To Parse - read_ca()

How to parse this? No problem! Use: (1) for line in f, (2) line.split(',')

The function read_ca() parses the above text, computing a dict with one key for each county, and its value is the total votes in that county.

See "Votes" Dict

Run like this to see the votes dict computed by read_ca(). It turns out there are 58 counties in California, the most populous state in the US.

$ python3 mystery-plot.py -ca ca-2020-election.csv 
{'Modoc': 4349, 'Madera': 53889, 'Napa': 72700, 'Sonoma': 268569, 'Merced': 90482, 'Kern': 298938, 'San Joaquin': 281988, 'San Benito': 28724, 'Stanislaus': 211971, 'Ventura': 427164, 'Santa Cruz': 146024, ...
$

Now Matplotib

Say we want to produce graphics like this - a bar chart for 3 counties, showing vote total per county.

alt: matplotlib outout of 3 bars

Step 1 - Matplotlib Standard Import

At top of file, have the following import. This form of import sets up plt as a shorthand for matplotlib.pyplot

This is a standard, idiomatic shorthand used in most examples, so we're using it too.

import matplotlib.pyplot as plt

Step 2 - plt.figure()

The function plot_ca1() builds the above plot with these steps.

First make the blank rectangle, called a "figure" in matplotlib. One unit of size here is about 0.5 inch, so this is kind of 4 inches by 2 inches.

    plt.figure(figsize=(8, 4))

Step 3 - x_vals, y_vals

We have the votes dict, like this - key is county, value is votes in that county.

votes = {'Modoc': 4349, 'Madera': 53889, 'Napa': 72700,
         'Sonoma': 268569, ... }

1. We want x_vals to be list of the counties to graph (i.e. the values going across the x-direction). In this case, this is just a list of the 3 county names.

    x_vals = ['Santa Cruz', 'Santa Clara', 'San Mateo']

2. We want y_vals to be a list of the y values corresponding to each x value (essentially the height for each x value). In this case, the y value is the number of votes for each county. We get each y value by looking up the number of votes up in the votes dict.

    y_vals = [votes['Santa Cruz'], votes['Santa Clara'], votes['San Mateo']]
    # y_vals is:
    #   [146024, 857609, 377876]

Step 4 - Make Bar Plot

With x_vals and y_vals ready, call plt.bar() to make the bar chart. Set the color we want, add some titling and show it.

    plt.bar(x_vals, y_vals, color='green')
    plt.title('Votes Per County')
    plt.show()

All Together - plot_ca1()

Here is the code run with the -ca1 flag. Computes the counts dict as above, then has some basic matplotlib calls to put the data on screen. You will need to be able to write code like this to make graphs.

def plot_ca1(filename):
    """Plot 3 counties - basic matplotlib"""
    votes = read_ca(filename)
    plt.figure(figsize=(8, 4))  # each unit is about 0.5 inch
    x_vals = ['Santa Cruz', 'Santa Clara', 'San Mateo']
    y_vals = [votes['Santa Cruz'],
              votes['Santa Clara'],
              votes['San Mateo']]
    # y_vals ends up as:
    #   [146024, 857609, 377876]
    plt.bar(x_vals, y_vals, color='green')
    plt.title('Votes Per County')
    # plt.xlabel('County')  # could have more titling
    # plt.ylabel('Votes')
    plt.show()

Run -ca1

You can run with -ca1 on the command line which runs the above code to make that chart. This is the sort of code you will need later. Use the tab key to autocomplete the "ca-20..." filename.

$ python3 mystery-plot.py -ca1 ca-2020-election.csv

alt: matplotlib outout of 3 bars

More Examples in Pattern

I'm going to dig around in the data to make more graphs, but all the examples will work in this x_vals, y_vals pattern, with the other parts basically the same as the example above.


7 Counties -ca2

This version plots 7 bay area counties, run with the -ca2 flag on the same data file.

$ python3 mystery-plot.py -ca2 ca-2020-election.csv 

alt: matplotlib outout of 7 bars

How To Compute y_vals

Have x_vals, now with 7 counties...

x_vals = ['Santa Clara', 'San Mateo',
              'Alameda', 'San Francisco',
              'Marin', 'Sonoma', 'Napa']

Could do y_vals like this, for each county, retrieve its number from the votes dict.

y_vals = [votes['Santa Clara'], votes['San Mateo'],
              votes['Alameda'], votes['San Francisco'],
              votes['Marin'], votes['Sonoma'], votes['Napa']]

The above code is tedious. Is there a way to make the list of numbers without typing all the county names? The x_vals list is already a list of all the county names. The neat way is to write a comprehension, for each county name, looking up its value in the votes dict

y_vals = [votes[county] for county in x_vals]

# y_vals ends up as:
#  [857609, 377876, 777781, 442345, 156801, 268569, 72700]

We don't want to have to manually type in the code to look up the value for each of the 7 counties.

Here's the full plot_ca2() code with the comprehension.

def plot_ca2(filename):
    """Plot more ca counties, using comprehension"""
    votes = read_ca(filename)
    plt.figure(figsize=(8, 4))
    # Expand to 7 bay-area counties
    x_vals = ['Santa Clara', 'San Mateo',
              'Alameda', 'San Francisco',
              'Marin', 'Sonoma', 'Napa']

    # Instead of typing each county again,
    # comprehension pulls each county name
    # out of the x_vals list - nice!

    y_vals = [votes[county] for county in x_vals]

    # y_vals ends up as:
    #  [857609, 377876, 777781, 442345, 156801, 268569, 72700]
    plt.bar(x_vals, y_vals, color='green')
    plt.title('Votes Per County')
    plt.show()

A Mystery - Look at First Digit Of Totals

Modern best practice - look at the data to inform yourself about what's going on. Here we'll use matplotlib to look in a big data set, get insights out.

Digit Plot Code -ca3

The function first_digits() takes in a list of numbers, computes a dict of how often each first digit appears, used in the matplotlib code below. The function output looks like this - here saying that the digit 4 appears 7 times, the digit 5 appears 3 times, and so on.

first_digits(nums) ->
  {4: 7, 5: 3, 7: 5, 2: 14, 9: 5, 1: 15, 8: 2, 6: 4, 3: 3}

Here is the code to draw 9 bars, for the digits 1, 2, 3, .. 9.

def plot_ca3(filename):
    """Plot first digits of all ca counties"""
    votes = read_ca(filename)
    nums = votes.values()
    counts = first_digits(nums)

    plt.figure(figsize=(8, 4))
    x_vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9']
    y_vals = [counts[int(x)] for x in x_vals]
    plt.bar(x_vals, y_vals, color='green')
    plt.title('CA First Digits')
    plt.show()

Look At Output - Hmmmmmmmmmm

$ python3 mystery-plot.py -ca3 ca-2020-election.csv

That does not look random or uniform. What is going on here? Why are 1 and 2 so much more prevalent? Maybe just random noise? Maybe signs of a conspiracy tampering with election data?

Look At Texas Data

$ python3 mystery-plot.py -tx tx-2020-election.csv 

Hmmm again. This conspiracy is widespread!

Look at Worldwide Potato Production

Look at file potato-production.csv - 10,000 rows of potato production data in tons, like this

Afghanistan,AFG,1961,130000
Afghanistan,AFG,1962,115000
Afghanistan,AFG,1963,122000
Afghanistan,AFG,1964,129000
Afghanistan,AFG,1965,132000
...

Have read_potato() function to read in the text file as before. Graph the digits of the values of this data set.

$ python3 mystery-plot.py -potato potato-production.csv

Hmm. Looks just like the Texas data. This conspiracy is truly everywhere!

Isaac Asimov Eureka Quote

Quote: The most exciting phrase to hear in science, the one that heralds new discoveries, is not "Eureka!" (I found it!) but "That’s funny ..." - Isaac Asimov

What is going on? Benford's Law

Explanation of Benford's

alt: benfords slope decreases as x goes to infinity

Why 1xxx is more common than 2xxx

alt: benfords slope decreases as x goes to infinity