Today: modules, urllib, Jupyter notebooks

Productivity Boost: Modules Of Code To Use

Organized modules of code for common problems. Organized, documented, and waiting for your code to use. We say that you build your code "on top of" the libraries.

alt:your code built on top of modules like sys

We ♥ Modules

Key question: starting a project
Is part of this solved in a module already?
Using module code you didn't have to write is very attractive
(This is kind of a no-brainer case to make!)
Somebody else wrote it, you can just use it
It's well tested
It has real docs
Your teammates may already be familiar with it
CS106A we see this "module" theme a little
CS106A needs to cover fundamentals
loops, strs, dicts, files ..
Courses beyond CS106A, you will probably spend more time using modules

Module = Dependency

When you write code using a module
That code now depends on that module's existence
If that module disappeared, your code would stop working
Official, standard modules are the safest
They are officially part of Python, they are not going away

"Standard" Modules

Standard = included/maintained as part of Python3 install
Can rely on this module now and in the future
No separate install is required
Module is installed when python is installed

1. Many Standard Modules

Do not: memorize whole list of modules
Do: check the list for help when starting a project
Standard Python Modules List
A few examples...
math module of math functions, e.g. math.cos()
email module for creating and parsing email messages
random module for creating pseudo random numbers
os module for listing directories, creating files
datetime module of calendar functions
zipfile module for reading/creating .zip files
urllib module for making http requests, using data

2. Non-Standard "pip" Modules

Other modules are valuable but they are not a standard part of Python. For code using non-standard module to work, the module must be installed on that computer via the "pip" Python tool. e.g. for homeworks we had you pip-install the "Pillow" module with this command:

$ python3 -m pip install Pillow
..prints stuff...
Successfully installed Pillow-5.4.1

A non-standard module can be great, although the risk is harder to measure. The history thus far is that popular modules continue to be maintained. Sometimes the maintenance is picked up by a different group than the original module author.

Module Docs

Every module has formal "documentation" - "docs"
Explain what functions do
The "abstraction" of each function
-what it does
-how to call it
Demo web search: "python math module"
Gets python.org math docs

Recall: Import

Every module has a short name, "sys" "math" "urllib" -Typically just a word, no dashes, dots, underbars
Look up the "math" module docs
Simplest import form, by name:
```
import math
```

Refer to elements within module as module.elem

    import math   # once at top of file

    ...

    val = math.sqrt(x)

Readability - can see where things came from, looking at a line
There are other forms of import, but this is the simplest

How Does The Web Work?

alt: web client makes request to server, gets back HTML response, browser draws the HTML

Browser has url ("client" side)
Web server is running on some machine ("server" side)
Browser sends GET request to server
Server gets request, sends back HTML response data
HTML is a text code
Browser "renders" the HTML on screen
Common request/response data types: HTML JPEG PNG GIF SVG
Video formats MPEG4 h.265 (crazy patent problems)
New open format "AV1" .. you heard it here first!
Demo:
Visit python.org or sfgate.com or whatever
Right-click on page, "View Source" to see the HTML text that makes a web page
Think of all the surfing you have done .. HTML code defines each page

HTML

HTML is plain text that looks like this, tags like <b> mark up the text.

This <b>bolded</b> text

HTML Experiment - View Source

Go to python.org. Try view-source command on this page (right click on page).

urllib 1

CS106A theme: look at some modules you can build upon
Python code to request an HTML page by url
urllib - making requests to a server, getting back data
Another example of using a standard module
docs - urllib.request docs on python.org
Makes a URL look like a local file mostly

>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> text = f.read().decode('utf-8')

f.read() works once, returning bytes
decode('utf-8') decode raw bytes -> unicode string
Does not always work, they may be blocking python on purpose
f.read() - all the bytes
r.read(300) - just first 300 bytes
Can try http: or https: for these

>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> text = f.read().decode('utf-8')
>>> # text is the HTML
>>> # use text.find('xxx') to look for something, show that slice
>>> # like text[5000:5200]
>>>
>>> f = urllib.request.urlopen('https://sfgate.com/')
>>> text = f.read().decode('utf-8')

# without >>>
import urllib.request
f = urllib.request.urlopen('http://www.python.org/')
text = f.read().decode('utf-8')

Web hello.txt Example / Exercise

You can try this one or follow along
Theme: Read data from a url instead of a file
Function: str.splitlines() -> a list of lines
Look at this url in browser
http://web.stanford.edu/class/cs106a/hello.txt
Now try that from python

import urllib.request
>>> f = urllib.request.urlopen('http://web.stanford.edu/class/cs106a/hello.txt')
>>> text = f.read().decode('utf-8')
>>> text
'Hello from CS106A url!\nWhat if there were data here?\n12,34\n25,19\n66,0\n1,2\n'
>>> lines = text.splitlines()
>>> lines
['Hello from CS106AP url!', 'What if there were data here?', '12,34', '25,19', '66,0', '1,2']
>>> lines = lines[2:]  # one way to get rid of first 2 lines
>>> lines
['12,34', '25,19', '66,0', '1,2']
>>> lines[0].split(',')
['12', '34']

Data From the Web vs. Files

Very often data is drawn from file, into your program .. parse to list dict ..
New picture: data is drawn from a URL
The parsing list/dict is the same as ever

Traffic Example - Jupyter

traffic.zip Jupyter example

This is a real data set about a particular area (thanks to Jason Lin for getting this data). It measures the total seconds spent in traffic grouped by hour of the day. There is data for every day over a calendar year. It looks like this:

Parsing: we are thinking "for line in f" and "line.split(',')" here

day_of_year,hour_of_day,time_in_secs
01-01-18,0,2549
01-01-18,1,2751
01-01-18,2,2248
01-01-18,3,2440
01-01-18,4,2666
01-01-18,5,2084
01-01-18,6,2302
01-01-18,7,3410
01-01-18,8,3229
01-01-18,9,2367
01-01-18,10,2217
01-01-18,11,2082
01-01-18,12,2055
01-01-18,13,2842
01-01-18,14,2206
01-01-18,15,2178
01-01-18,16,2974
01-01-18,17,2444
01-01-18,18,2965
01-01-18,19,2714
01-01-18,20,2783
01-01-18,21,1951
01-01-18,22,2234
01-01-18,23,2263
01-02-18,0,2311
01-02-18,1,2732
01-02-18,2,2165
01-02-18,3,2377
01-02-18,4,2836
01-02-18,5,2841
01-02-18,6,2719
01-02-18,7,3671
01-02-18,8,3417
01-02-18,9,2578
01-02-18,10,2301
01-02-18,11,2357
01-02-18,12,2165
01-02-18,13,1958
...

traffic.py

traffic.py - Open in PyCharm to see code
traffic.py developed normally, tested
Code to read this data into a dict
This is CS106A bread-and-butter type code
Add all days together, key is hour: 0..23

{0: 922633,
 1: 870481,
 2: 814756,
 3: 850357,
...
 22: 844618,
 23: 902191}

read_times()

def read_times(filename):
    """
    Read filename into a "times" dict.
    """
    with open(filename) as f:
        lines = f.readlines()
    lines.pop(0)  # first line is not data
    times = {}
    for hour in range(24):
        times[hour] = 0

    for line in lines:
        splits = line.split(',')
        hour = int(splits[1])
        time = int(splits[2])
        times[hour] += time
    return times

Jupyter Notebook

Jupyter notebooks jupyter.org
An extremely popular tech built on Python
Jupyter is for calling your code, analyzing data
PyCharm is for building your code
Much real-world data analysis is done in Jupyter
Built on top of regular python code
Support an interactive notebook style
Produce notebook - shows steps, lets others build
Like spreadsheets, but with the full power of Python
Install modules on your machine: "jupyter" and "matplotlib"
Not a problem if pip install mentions that "pip" could be upgraded, can ignore that

$ python3 -m pip install jupyter
$ python3 -m pip install matplotlib

Start Jupyter

I'll run through the small Jupyter traffic example here
We'll do a bigger example in section
HW8 will be a small Jupyter project
List of Jupyter commands below
Download the traffic.zip to get the "traffic" folder
Get a command line prompt in the "traffic" folder, then run this command to start
$ jupyter notebook
This should open a Jupyter tab in your browser - Jupyter works through the browser
For this example, click on the "traffic.ipynb" for our lecture example
To create a new doc, would use "New" Python3 to create blank notebook

Traffic Example Setup

Jupyter notebook is a vertical series of "cells", each with some compuation
Hit shift-return in a Jupyter cell to run it
See cell's output right there, play with the parameters
Typically run cells from top to bottom
So later cells use output of earlier cells
Click on "traffic.ipynb" to in Jupyter list to open the notebook
OR traffic-output.html is a non-interactive picture of the final notebook state, showing all the steps and output

Traffic Example - First Half

Here is the code from the traffic notebook. Looks best in the notebook, so you can see each cell and its output.

Shift-return to run each cell, top to bottom
If the last line is an expression like times
Its value gets printed after the cell
A way to confirm it worked, and think about next steps
Could use times[:10] or len(times) to avoid too much output there
Importing "traffic.py", calling its functions to work on data here

%matplotlib inline
# Use the traffic.read_times() function to read in "times" dict.
# Times dict has keys 0..23 for hours, values are wait-times in seconds
# for traffic at that hour over a whole year.
import traffic
times = traffic.read_times('commute-times.txt')
times

# Use traffic.div_times() to scale it down to seconds for one day.
times = traffic.div_times(times, 365)
times

Commands in Jupyter

Jupyter is ..
a. Interactive
b. Produces a notebook you can share, showing your process
Jupyter steps
1. import foo - import foo module, to call its foo.bar() functions, which are typical Python/CS106A functions that get the data organized
Within Jupyter, call the foo.py functions, experiment and graph the resulting data
2. Shift-return - runs the code in that cell
If a list or dict as last item, prints out that data. A good way to confirm the step worked
Can say len(lst) or lst[:10] to avoid printing something huge.
Keep hitting shift-return on a cell until it's doing what you want
3a. Kernel > Restart Clear - Erase all the outputs, so you can shift-return from the top again
3b. Kernel > Restart and Run - Erase all the outputs, then run everything from the top again
4. File > Save and Checkpoint - save the notebook to its .ipynb file
5. File > Close and Halt - save notebook, close tab and return to file-list
6. Quit, button at top of file-list, quit the whole thing (or can ctrl-c in terminal that started it)

The Universe of Matplotlib

Matplotlib is an extremely capable and popular Python module for producing visualizations of data. Install it with "pip" as above. Matplotlib is so popular, there is no chance it is going to stop being supported. Matplotlib has a dizzying number of features. We will just scratch the surface here, so you get a feel for what's there.

For this lecture example, we'll just use the few matplotlib features shown below. If you would like more information about matplotlib in the future, see matplotlib.org

Functions in matplotlib

matplotlib has many, many features
We'll just use a few
Here's an outline

# standard import line .. 'plt' is idiomatic here
import matplotlib.pyplot as plt

# 1. plot 1-d list of values. plot() uses lists
plt.plot([5, 13, 2, 7])
plt.show()


# 2. Provide both x-values and y-values lists to plot() - a common pattern
# specify color, titling
plt.title('Some Words Here')
# plot() pattern: plot( [ x-values-list ], [ y-values-list ] )
plt.plot([1, 2, 3, 4], [5, 13, 2, 7], color='red')
plt.show()

Have: times dict, want: y values list

Have "times" dict - look at it. How to make a list of the y values in order by hour? Just times.values() won't work, it's in random order.

{0: 2527.7616438356163,
 1: 2384.8794520547945,
 2: 2232.2082191780823,
 3: 2329.745205479452,
...
 20: 2421.4301369863015,
 21: 2316.26301369863,
 22: 2314.0219178082193,
 23: 2471.7561643835616}

Traffic Plot Code

...
# Plotting works on *lists* - make a len-24 "waits" list
# from the times dict.
waits = [ times[key]  for key in range(24)   ]
waits

# This is the canonical line to import matplotlib using the name "plt"
import matplotlib.pyplot as plt

# 1. Simplest: plot a 1-d list of values, no x-values specified
plt.plot(waits)
plt.show()

# 2. More typical, provide [ x-values ] and [ y-values ] lists
# Here we narrow the data to the hours 6..20.
# (Comprehension is not strictly needed here, but a common pattern)
plt.plot([i for i in range(4, 21)], waits[4:21])
plt.show()

Fancier Plotting

Plotting with more flair!

# 3. Like above, but set size, add labels, can play with line color and width
plt.figure(figsize=(10, 5))  # figsize unit is like 0.5 inch
plt.title('Commute times')
plt.xlabel('Hour of day')
plt.ylabel('Commute time in seconds')
plt.plot([i for i in range(6,21)], waits[6:21], color='purple', linewidth=9)
plt.show()