Today: comprehensions, matplotlib intro
For more details, see the Python Guide Comprehensions section
# Say we have: nums = [1, 2, 3, 4, 5, 6] # We want: [1, 4, 9, 16, 25, 36] # 1-line comprehension solves this: [n * n for n in nums]
>>> nums = [1, 2, 3, 4, 5, 6] >>> [n + 10 for n in nums] [11, 12, 13, 14, 15, 16] >>> >>> [abs(n - 3) for n in nums] [2, 1, 0, 1, 2, 3] >>> >>> [str(n) + '!!' for n in nums] ['1!!', '2!!', '3!!', '4!!', '5!!', '6!!'] >>>
We have a list of n elements, the comprehension is a dense, 1-line way to compute a list of new elements. That happens to be a useful little pattern for our next topic, matplotlib. We'll talk about comprehensions more on Friday.
Matplotlib is an extremely capable and popular Python module for producing visualizations of data. Install it with "-m pip" like this:
$ python3 -m pip install matplotlib
Matplotlib is very popular library used by researches and media for generating visualizations. Matplotlib has a dizzying number of features. We will just scratch the surface here, so you get a feel for what's there, and can become more expert with it after CS106A if you like.
For more matplotlib information, see matplotlib.org and here is a popular tutorial (with tons of ads!): Matplotlib Tutorial
For this lecture example, we'll just use the few matplotlib features shown below, and later CS106A matplotlib work will use these same features.
Download to get started
mystery-plot.zip which contains real data from the 2020 election. There's a lot of real data in here we can graph.
The file "mystery-plot.py" is complete - we'll just play with it to experiment with matplotlib, and it is working example matplotlib code.
Data source: there's a nicely organized 2020 data set - graphics at the New York Times and other things you've seen are built from this data. It's also huge and complicated! The California and Texas data sets below are from this url.
The file ca-2020-election.csv looks like this: (you can click on it from within Pycharm to see it). There are more than 70,000 lines of data in here, with data for each voting precinct in California.
COUNTY,FIPS,RGPREC_KEY,ELECTION,TYPE,RGPREC,TOTREG_R,DEM,REP,AIP,PAF,MSC,LIB,NLP,GRN... 25,06049,0604902001,g20,V,02001,96,22,41,1,0,1,2,0,0,0,29,40,56,3,0,2,0,0,1,0,0,0,0,0,0,0... 25,06049,0604904001,g20,V,04001,111,25,61,6,0,0,1,0,0,0,18,58,53,0,2,1,1,0,0,0,0,0,0,2,0,0... 25,06049,0604906001,g20,V,06001,365,101,184,16,2,0,2,0,4,0,56,166,199,5,10,2,1,5,4,3,1,0,0,.. ..
read_ca()
How to parse this? No problem! Use: (1) for line in f
, (2) line.split(',')
The function read_ca()
parses the above text, computing a dict with one key for each county, and its value is the total votes in that county.
Run like this to see the votes dict computed by read_ca(). It turns out there are 58 counties in California, the most populous state in the US.
$ python3 mystery-plot.py -ca ca-2020-election.csv {'Modoc': 4349, 'Madera': 53889, 'Napa': 72700, 'Sonoma': 268569, 'Merced': 90482, 'Kern': 298938, 'San Joaquin': 281988, 'San Benito': 28724, 'Stanislaus': 211971, 'Ventura': 427164, 'Santa Cruz': 146024, ... $
Say we want to produce graphics like this - a bar chart for 3 counties, showing vote total per county.
At top of file, have the following import. This form of import sets up plt
as a shorthand for matplotlib.pyplot
This is a standard, idiomatic shorthand used in most examples, so we're using it too.
import matplotlib.pyplot as plt
The function plot_ca1()
builds the above plot with these steps.
First make the blank rectangle, called a "figure" in matplotlib. One unit of size here is about 0.5 inch, so this is kind of 4 inches by 2 inches.
plt.figure(figsize=(8, 4))
x_vals, y_vals
We have the votes
dict, like this - key is county, value is votes in that county.
votes = {'Modoc': 4349, 'Madera': 53889, 'Napa': 72700, 'Sonoma': 268569, ... }
1. We want x_vals
to be list of the counties to graph (i.e. the values going across the x-direction). In this case, this is just a list of the 3 county names.
x_vals = ['Santa Cruz', 'Santa Clara', 'San Mateo']
2. We want y_vals
to be a list of the y values corresponding to each x value (essentially the height for each x value). In this case, the y value is the number of votes for each county. We get each y value by looking up the number of votes up in the votes
dict.
y_vals = [votes['Santa Cruz'], votes['Santa Clara'], votes['San Mateo']] # y_vals is: # [146024, 857609, 377876]
With x_vals
and y_vals
ready, call plt.bar()
to make the bar chart. Set the color we want, add some titling and show it.
plt.bar(x_vals, y_vals, color='green') plt.title('Votes Per County') plt.show()
plot_ca1()
Here is the code run with the -ca1 flag. Computes the counts dict as above, then has some basic matplotlib calls to put the data on screen. You will need to be able to write code like this to make graphs.
def plot_ca1(filename): """Plot 3 counties - basic matplotlib""" votes = read_ca(filename) plt.figure(figsize=(8, 4)) # each unit is about 0.5 inch x_vals = ['Santa Cruz', 'Santa Clara', 'San Mateo'] y_vals = [votes['Santa Cruz'], votes['Santa Clara'], votes['San Mateo']] # y_vals ends up as: # [146024, 857609, 377876] plt.bar(x_vals, y_vals, color='green') plt.title('Votes Per County') # plt.xlabel('County') # could have more titling # plt.ylabel('Votes') plt.show()
You can run with -ca1 on the command line which runs the above code to make that chart. This is the sort of code you will need later. Use the tab key to autocomplete the "ca-20..." filename.
$ python3 mystery-plot.py -ca1 ca-2020-election.csv
I'm going to dig around in the data to make more graphs, but all the examples will work in this x_vals, y_vals pattern, with the other parts basically the same as the example above.
This version plots 7 bay area counties, run with the -ca2 flag on the same data file.
$ python3 mystery-plot.py -ca2 ca-2020-election.csv
Have x_vals, now with 7 counties...
x_vals = ['Santa Clara', 'San Mateo', 'Alameda', 'San Francisco', 'Marin', 'Sonoma', 'Napa']
Could do y_vals like this, for each county, retrieve its number from the votes dict.
y_vals = [votes['Santa Clara'], votes['San Mateo'], votes['Alameda'], votes['San Francisco'], votes['Marin'], votes['Sonoma'], votes['Napa']]
The above code is tedious. Is there a way to make the list of numbers without typing all the county names? The x_vals list is already a list of all the county names. The neat way is to write a comprehension, for each county name, looking up its value in the votes dict
y_vals = [votes[county] for county in x_vals] # y_vals ends up as: # [857609, 377876, 777781, 442345, 156801, 268569, 72700]
We don't want to have to manually type in the code to look up the value for each of the 7 counties.
Here's the full plot_ca2() code with the comprehension.
def plot_ca2(filename): """Plot more ca counties, using comprehension""" votes = read_ca(filename) plt.figure(figsize=(8, 4)) # Expand to 7 bay-area counties x_vals = ['Santa Clara', 'San Mateo', 'Alameda', 'San Francisco', 'Marin', 'Sonoma', 'Napa'] # Instead of typing each county again, # comprehension pulls each county name # out of the x_vals list - nice! y_vals = [votes[county] for county in x_vals] # y_vals ends up as: # [857609, 377876, 777781, 442345, 156801, 268569, 72700] plt.bar(x_vals, y_vals, color='green') plt.title('Votes Per County') plt.show()
Modern best practice - look at the data to inform yourself about what's going on. Here we'll use matplotlib to look in a big data set, get insights out.
The function first_digits() takes in a list of numbers, computes a dict of how often each first digit appears, used in the matplotlib code below. The function output looks like this - here saying that the digit 4 appears 7 times, the digit 5 appears 3 times, and so on.
first_digits(nums) -> {4: 7, 5: 3, 7: 5, 2: 14, 9: 5, 1: 15, 8: 2, 6: 4, 3: 3}
Here is the code to draw 9 bars, for the digits 1, 2, 3, .. 9.
def plot_ca3(filename): """Plot first digits of all ca counties""" votes = read_ca(filename) nums = votes.values() counts = first_digits(nums) plt.figure(figsize=(8, 4)) x_vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9'] y_vals = [counts[int(x)] for x in x_vals] plt.bar(x_vals, y_vals, color='green') plt.title('CA First Digits') plt.show()
$ python3 mystery-plot.py -ca3 ca-2020-election.csv
That does not look random or uniform. What is going on here? Why are 1 and 2 so much more prevalent? Maybe just random noise? Maybe signs of a conspiracy tampering with election data?
$ python3 mystery-plot.py -tx tx-2020-election.csv
Hmmm again. This conspiracy is widespread!
Look at file potato-production.csv - 10,000 rows of potato production data in tons, like this
Afghanistan,AFG,1961,130000 Afghanistan,AFG,1962,115000 Afghanistan,AFG,1963,122000 Afghanistan,AFG,1964,129000 Afghanistan,AFG,1965,132000 ...
Have read_potato() function to read in the text file as before. Graph the digits of the values of this data set.
$ python3 mystery-plot.py -potato potato-production.csv
Hmm. Looks just like the Texas data. This conspiracy is truly everywhere!
Quote: The most exciting phrase to hear in science, the one that heralds new discoveries, is not "Eureka!" (I found it!) but "That’s funny ..." - Isaac Asimov