L28

Today: matplotlib, jupyter notebooks, numpy

Today we'll look at Python features you might want to use in the future. These are not on the final!

Recall: Comprehension 1-2-3

Nick's mnemonic: re-use syntax of other Python features
1. type in a pair of outer brackets [ ]
2. inside write a foreach "for n in nums"
3. then the result expression "n * n" goes on the left

>>> nums = [1, 2, 3, 4, 5, 6]
>>> [n + 10 for n in nums]
[11, 12, 13, 14, 15, 16]
>>> [abs(n - 3) for n in nums]
[2, 1, 0, 1, 2, 3]
>>> [str(n) + '!!' for n in nums]
['1!!', '2!!', '3!!', '4!!', '5!!', '6!!']

The Universe of Matplotlib

Matplotlib is an extremely capable and popular, open-source Python module for producing visualizations of data. Install it with "pip" like this:

$ python3 -m pip install matplotlib

Matplotlib is super popular with researches and media for generating visualizations. The number of features and variations available Matplotlib is a little dizzying. We will just scratch the surface here, so you get a feel for what's there.

See the official matplotlib.org and here is a popular tutorial (with tons of ads!): Matplotlib Tutorial

For this lecture example, we'll just use the few matplotlib features shown below.

Matplotlib Standard Import

At top of file, have the following import form. This sets up "plt" as the name to be used in the file. This is a standard import for matplotlib, see it in examples etc, so we'll use it too.

# standard import line .. 'plt' is idiomatic here
import matplotlib.pyplot as plt

mystery-data.zip

Download to get started

mystery-data.zip. There's a lot of real data in here we can graph.

The file "mystery-plot.py" is complete; we'll just play with it to experiment with matplotlib, and it is working example matplotlib code.

Data: Election 2020

Data source: there's a nicely organized 2020 data set - graphics at the New York Times and other things you've seen are built from this data. It's also huge and complicated! The California and Texas data sets below are from this url.

Look at ca-2020-election.csv

The file ca-2020-election.csv looks like this: (you can click on it from within Pycharm to see it). There are more than 70,000 lines of data in here, with data for each voting precinct in California.

COUNTY,FIPS,RGPREC_KEY,ELECTION,TYPE,RGPREC,TOTREG_R,DEM,REP,AIP,PAF,MSC,LIB,NLP,GRN...
25,06049,0604902001,g20,V,02001,96,22,41,1,0,1,2,0,0,0,29,40,56,3,0,2,0,0,1,0,0,0,0,0,0,0...
25,06049,0604904001,g20,V,04001,111,25,61,6,0,0,1,0,0,0,18,58,53,0,2,1,1,0,0,0,0,0,0,2,0,0...
25,06049,0604906001,g20,V,06001,365,101,184,16,2,0,2,0,4,0,56,166,199,5,10,2,1,5,4,3,1,0,0,..
..

How To Parse That?

How to parse this? No problem! (1) for line in f, (2) line.split(','). See the function read_ca() which parses the above, computing totals per county in a dict which is returned.

Run -ca

Run like this to see the dict computed by read_ca(), where the county name is the key, and the total votes in that county is the value.

$ python3 mystery-plotmystery-plot.py -ca ca-2020-election.csv 
{'Modoc': 4349, 'Madera': 53889, 'Napa': 72700, 'Sonoma': 268569, 'Merced': 90482, 'Kern': 298938, 'San Joaquin': 281988, 'San Benito': 28724, 'Stanislaus': 211971, 'Ventura': 427164, 'Santa Cruz': 146024, ...
$

Basic Plot Code -ca1

Here is the code run with the -ca1 flag. Computes the counts dict as above, then has some basic matplotlib calls to put the data on screen. This is a basic example of pulling in some data in Python, and then calling matplotlib to graph it.

def plot_ca1(filename):
    """Plot 3 counties - basic matplotlib"""
    votes = read_ca(filename)
    plt.figure(figsize=(8, 4))  # each unit is about 0.5 inch
    xs = ['Santa Cruz', 'Santa Clara', 'San Mateo']
    ys = [votes['Santa Cruz'], votes['Santa Clara'],
          votes['San Mateo']]
    plt.bar(xs, ys, color='green')
    plt.title('Votes Per County')
    # plt.xlabel('County')  # more titling on the graph
    # plt.ylabel('Votes')
    plt.show()

votes - dict of county -> count
xs is list of county names, each is an "x" in this scheme
e.g. ['Santa Cruz', 'Santa Clara', 'San Mateo']
ys is list of "y" values, one for each x
e.g. [18000, 800000, 395000]
Look up each y value in votes dict
plt.bar() makes bar char out of xs, ys data
Try plt.plot() for a line graph instead

Run -ca1

You can run with -ca1, runs the above code to make this chart. This is the sort of code you will need later.

$ python3 mystery-plot.py -ca1 ca-2020-election.csv

alt: matplotlib outout of 3 bars

Comprehension With -ca2

Here is the matplotlib code for the -ca2 option. This version plots 7 bay area counties. We don't want to have to manually type in the code to look up the value for each of the 7.

Instead, the code uses a comprehension to compute the ys value - look at the ys = .. line. It uses a comprehension: for each county in xs, look up its value in the counts dict. This is short, and also avoids the sort of error you might make, manually trying to have the xs and ys lists be in the same order.

def plot_ca2(filename):
    """Plot more ca counties, using comprehension"""
    votes = read_ca(filename)
    plt.figure(figsize=(8, 4))
    # Expand to 7 bay-area counties
    xs = ['Santa Clara', 'San Mateo',
          'Alameda', 'San Francisco',
          'Marin', 'Sonoma', 'Napa']
    # Instead of typing each county again,
    # comprehension pulls each county name
    # out of the xs variable - nice!
    ys = [votes[county] for county in xs]
    plt.bar(xs, ys, color='green')
    plt.title('Votes Per County')
    plt.show()

Run this to see the -ca2 data with

$ python3 mystery-plot.py -ca2 ca-2020-election.csv

Explore - First Digits

We have this big data set
We'll use matplotlib to explore the data
Look at all vote totals
Look at first digit (leftmost) of each total: 1, 2, .. 9
Look at histogram of those digits
First guess: the occurrence of the digits should be kind of uniform -
1's 2's 3's .. occur about the same amount

Digit Plot Code -ca3

The function first_digits() takes in a list of numbers, computes a dict of how often each first digit appears, used in the matplotlib code below. The function output looks like this - here saying that the digit 4 appears 7 times, the digit 5 appears 3 times, and so on.

first_digits(nums) ->
  {4: 7, 5: 3, 7: 5, 2: 14, 9: 5, 1: 15, 8: 2, 6: 4, 3: 3}

Here is the code to draw 9 bars, for the digits 1, 2, 3, .. 9.

def plot_ca3(filename):
    """Plot first digits of all ca counties"""
    votes = read_ca(filename)
    nums = votes.values()
    counts = first_digits(nums)

    plt.figure(figsize=(8, 4))
    xs = ['1', '2', '3', '4', '5', '6', '7', '8', '9']
    plt.bar(xs, [counts[int(x)] for x in xs], color='green')
    plt.title('CA First Digits')
    plt.show()

Look At Output - Hmmmmmmmmmm

$ python3 mystery-plot.py -ca3 ca-2020-election.csv

That does not look random or uniform. What is going on here? Why are 1 and 2 so much more prevalent? Conspiracy to tamper with election data?

Look At Texas Data

Data file tx-2020-election.csv
Texas has a lot of counties, 254, so the pattern comes out stronger
Function votes = read_texas(filename) just like CA
Look at its digits
The data file begins 'tx..' - use tab key

$ python3 mystery-plot.py -tx tx-2020-election.csv

hmmm again. This conspiracy is widespread!

Great Asimov Quote about Data

Quote: The most exciting phrase to hear in science, the one that heralds new discoveries, is not "Eureka!" (I found it!) but "That’s funny ..." - Isaac Asimov

Look at Worldwide Potato Production

Data file potato-production.csv
Worldwide production of potatoes since 1961, many countries, in tons/year
Have read_potato() function as before
Graph the digits of the values of this data

$ python3 mystery-plot.py -potato potato-production.csv

Hmm. Looks just like the Texas data. This conspiracy is truly everywhere!

What is going on? Benford's Law

This is not a fluke
This is a surprising but real feature of data
Benfords Law
Have a bunch of numbers counting something where the numbers span a big range
You will get more numbers starting with 1, 2 far fewer with 8, 9

Explanation of Benford's

Here is my attempted story/explanation of Benford's
Say there is some thing, and counts of it span a big range
Think about the distribution of counts
Towards infinity .. there are fewer of the thing, approaching zero
e.g. counting ants per square meter in many different spots on earth
A thousand ants is common
A million ants might happen
But for some big enough number, like a trillion, it never happens
So (hand-wave) there is this downward slant towards infinity
The wikipedia page talks about logarithms
Here we simply note that the larger values get more rare
Think about counts in these ranges: 100, 200, 300, ...900
Because of slant, get more in 100 200, fewer in 800, 900
This story is not an air-tight proof!
But the data is super clear
Benford's law is true
Look at the potato example
I did not look through many agricultural counts to find one with Benford's
Every dataset I looked at had Benford's

Jupyter Notebook

traffic.zip Jupyter example

Jupyter notebooks
An extremely popular tech built on Python, all sorts of real-world data analysis uses Jupyter
Built on top of regular python code
Support an interactive notebook style
Produce notebook - shows steps, lets others build
Like spreadsheets, but with the full power of Python

This command will install both jupyter and matplotlib

$ python3 -m pip install jupyter matplotlib

Jupyter Startup

I'll run through a small "traffic" example here
List of Jupyter commands below
Create terminal in the "traffic" folder, then this command:

$ jupyter notebook

This should open a Jupyter tab in your browser - Jupyter works through the browser
In browser, could use "New" Python3 to create blank notebook
Here click on the "traffic.ipynb" for our lecture example

Traffic Example Steps

This example starts with delay data for hours 0-23 of every day for a year. Process it to make a graph showing average delay for each hour of the day. Some of the support code is in traffic.py, and we call its functions from within the notebook.

Hit shift-return in each Jupyter cell to run that code
See its output right there, play with the parameters
Click on "traffic.ipynb" to in Jupyter list to open the notebook
OR traffic-output.html is a non-interactive picture of the final notebook state, showing all the steps and output
This created by exporting the notebook to HTML

Why Scientists Love Jupyter

With notebook form, you publish your analysis and output, along with the mechanism to create it - invites iteration and study.

Commands in Jupyter

A Jupyter notebook is interactive, showing the steps to follow with the output of each step.

Jupyter steps
1. import foo - import foo module, to call its foo.bar() functions, which are typical Python/CS106A functions that get the data organized
Within Jupyter, call the foo.py functions, experiment and graph the resulting data
2. Shift-return - runs the code in that cell
If a list or dict as last item, prints out that data. A good way to confirm the step worked. Can say len(lst or lst[:10] to avoid printing something huge.
Basically shift-return your way down, computing intermediate steps
3a. Kernel > Restart Clear - Erase all the outputs, so you can shift-return from the top again
3b. Kernel > Restart and Run - Erase all the outputs, then run everything from the top again
4. File > Save - save the notebook to its .ipynb file (the output may be there, or may be in the blank/reset state)
5. File > Shutdown - end the jupyter process, then you can close the tab

Recall: Functions in matplotlib

The code in the notebook makes simple matplotlib calls to insert some graphs.

# standard import line .. 'plt' is idiomatic here
import matplotlib.pyplot as plt

# 1. plot 1-d list of values. plot() uses lists
plt.plot([5, 13, 2, 7])
plt.show()


# 2. Provide both x-values and y-values lists to plot() - a common pattern
# specify color, titling
plt.title('Some Words Here')
# plot() pattern: plot( [ x-values ], [ y-values] )
plt.plot([1, 2, 3, 4], [5, 13, 2, 7], color='red')
plt.show()

Scientific Numerics: numpy

numpy.org

Widely used, free / open-source module for scientific math
Fast bulk math operations
You write in Python
The operation runs in another, fast language (C)
This is a great combination - expressiveness of Python, but with the performance of C
e.g. the sum() operation below is not running as Python code, but in a faster way on the C side

alt: array is created on Python side, a version of the array is on fast C side. sum operation is called for on the Python side, but runs on the fast C side.

$ python3 -m pip install numpy

This example creates a numpy array. Look at its dimensions, compute "sum" a couple ways. numpy supports common numerical patterns, including the vector/matrix operations of linear algebra.

>>> import numpy as np
>>> 
>>> 
>>> a = np.array([[1, 2, 3],     # Create 2-d array
...               [4, 5, 6]])
>>> 
>>> # len 1st dimension = 2, len of 2nd dim = 3
>>> a.shape
(2, 3)
>>> 
>>> a.sum()         # sum of whole thing
np.int64(21)
>>> a.sum(axis=0)   # vector of sums, taking out axis 0
array([5, 7, 9])
>>> a.sum(axis=1)   # vector of sums, taking out axis 1
array([ 6, 15])