L20

Today: recall nested, recall dict-count, more sophisticated nested-dict examples, other ways to read a file

Recall: Reference to Nested List

(Example from lecture-18.) Say outer is a list of lists shown below. The expression outer[2] is a reference to the nested [5] list.

>>> outer = [[1, 2], [3, 4], [5]]
>>> nums = outer[2]
>>> 
>>> nums
[5]

The line nums = outer[2] sets the nums variable to point to the nested list. We will very often set a variable in this way, pointing to the nested structure, before doing operations on it.

alt: outer points to list, nums points to nested [5] list

How To Append `6` on the `[5]`?

The variable nums is pointing to the nested list [5]. Call nums.append(6) to append, changing the nested list. Looking at the original outer list, we see its contents are changed too. This shows that the nums variable really was pointing to the list nested inside of the outer list.

>>> nums.append(6)
>>> nums
[5, 6]
>>> outer
[[1, 2], [3, 4], [5, 6]]
>>>

Recall: Dict-Count Algorithm

The dict-count algorithm is very important, so let's review the steps.

Say we are building a counts dict, counting how many times each string appears in the strs list

strs = ['a', 'b', 'a', 'c', 'b']

Want to build this counts dict ultimately

counts == {'a': 2, 'b': 2, 'c': 1}

Dict-Count - Init and Increment

Here's the standard dict-count code. Look at each value, build up the dict. The central question for each value is: not seen before? The first time we see a value, initialize the dict with it. We'll call this the "init" step. Equivalently we could think of the question as "is this the first time seeing this value?"

counts = {}
for s in strs:
    # 1. not seen before - "init"
    if s not in counts:
        counts[s] = 0

    # 2. unified - "increment"
    counts[s] += 1

The above is the unified version, where the counts[s] += 1 step is done for every s. We'll call that the "increment" step.

(optional) Our Friend Modulo `%`

Recall that modulo % is the remainder after int division.

57 % 10 -> 7
123 % 10 -> 3
19 % 10 -> 9
10 % 10 -> 0
98 % 10 -> 8
99 % 10 -> 9
100 % 10 -> 0

The % 10 of a non-negative int is just its last digit.

Mathematics angle: The numbers represented by the digits to the left of the rightmost digit all include 10 as a factor. Computing % 10 is just what's left after all the multiples of 10 are taken away.

(optional) Digit Count - Exercise

> digit_count()

Apply the dict-count algorithm to count how many numbers end with each digit.

digit_count(nums): Give a list of non-negative ints. The last digit of each num can be found by computing num % 10. For example 57 % 10 is 7, and 7 is the last digit of 57. Build and return a counts dictionary where each key is an int digit, and its value is the count of one or more numbers in the list ending with that digit.

Digit Count - Solution

def digit_count(nums):
    counts = {}
    for num in nums:
        digit = num % 10
        if digit not in counts:
            counts[digit] = 0
        counts[digit] += 1
    return counts

Advanced Dict - Nested / Inner

Now we'll work more sophisticated problems, where we nest a list or dict inside of dict.

More sophisticated dict algorithms
Old: dict value is int counter 0, 1, 2, 3
Now: dict value is a nested structure
a list, dict stored as a value inside a dict
We'll use the words "inner" or "nested" for the list or dict stored inside
Two nested examples below: emails and birthdays

Email - Parse User and Host

# Have email strings
-'abby@foo.com'
-'bob@bar.com'

# One @
-"user" is left of @ -> 'abby'
-"host" is right of @ -> 'foo.com'

Email Hosts Challenge

This is a tricky problem. We'll go step by step in lecture, you can follow along. Then we'll work a similar problem in section.

High level: we have a big list of email addresses. We want to organize the data by host (read: use host as key). For each host, build up a list of all the users for that host.

Given a list of email address strings. Each email address has one '@' in it, e.g. 'abby@foo.com', where 'abby' is the user, and 'foo.com' is the host.

Create a nested dict with a key for each host, and the value for that key is a list of all the users for that host, in the order they appear in the original list (repetition allowed).

Here is the input and output. Essentially going through the data, organizing it by host.

input emails:
  ['abby@foo.com', 'bob@bar.com', 'abe@foo.com']

output hosts dict:
  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

Type Commitments: key and value

When working a nested problem, it's good to keep in mind the type of the key and value, as it's easy to confused on these. Here we'll write down the key and value type and refer to these later in the coding.

Here are the two types we have for the hosts dict. Write these on the board, for reference later when we get to the code. A commitment.

1. hosts key = string `'foo.com'`

Each key in the hosts dict is a host string, e.g. 'foo.com'

2. hosts value = nest list of users

The value for each key is an inner list of users for that host, e.g. ['abby', 'abe']

Email Hosts Example

> email_hosts() - nested dict problem

Here is the code to start with. We need code to add each 'abby@foo.com' into the hosts structure.

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]
        # your code here
        pass
    return hosts

We have host and user...

1. Not seen host before? Init

With a key, the first question is: not seen seen before? If not seen before, create an initial value in the dict for that key - aka "init".

What is the type of each value? A list. So the init value will be a list. For counting the init value was 0. Now the init value is [].

        # init ([])
        if host not in hosts:
            hosts[host] = []

2. Set Variable for Nested `hosts[host]`

The outer hosts dict looks like this:

  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

Now we want to edit the list for this host. Say host = 'foo.com', the reference to that nested list is hosts[host]

The expression hosts[host] is hard to read. Let's set a variable to point to the nested list. In this case, it's the list of users. Use the variable name users

        users = hosts[host]  # var -> nested

3. Add User to List - Increment

We want to append this user to the nested list of users. The variable users points to that list, so we just do an append on it.

        # increment (.append)
        users.append(user)

Email Hosts Solution Code

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]

        # key algorithm: init/increment
        if host not in hosts:
            hosts[host] = []
        users = hosts[host]  # var -> nested
        users.append(user)
    return hosts

Drawing of the Email Hosts Sequence

alt:what hosts memory looks like, adding one user

Here's another example using nested-lists, using a handy technique to parse the data.

Handy Parsing Technique `split(':')`

Nice technique, say we have a string like rating = 'donut:10'

Use split(':') to separate the string into parts (we mentioned split() before, but this is maybe our first real example with it):

rating = 'donut:10'
parts = rating.split(':')
# parts is now ['donut', '10']`
# parts[0] -> 'donut'
# parts[1] -> '10'

(optional) food_ratings()

Say we have a bunch of ratings about foods, and we want to organize them per food. Each input rating is a string combining the food name and its numeric rating like this 'donut:10'. so the list or ratings looks like this:

['donut:10', 'apple:8', 'donut:9', 'apple:6', 'donut:7']

We process all the ratings to load up a dict with a key for each distinct food, and its value is a list of all that food's ratings, like this:

{
   'donut': [10, 9, 7],
   'apple': [8, 6]
}

> food_ratings() - nested dict problem

food_ratings(ratings): Given a list of food survey rating strings like this 'donut:10'. Create and return a "foods" dict that has a key for each food, and its value is a list of all its rating numbers in int form. Use split() to divide each food rating into its parts. There's a lot of Python packed into this question!

Build dict with structure:

Key = one food string

Value = list of rating ints

Preamble: Birthdays `split('-')`

The birthdays problem below has dates like 'dec-31-2002'

We'll use split('-') to extract the parts from this string, like this:

>>> date = 'dec-31-2002'
>>> 
>>> parts = date.split('-')
>>> parts
['dec', '31', '2002']
>>>
>>> parts[0]
'dec'
>>> parts[2]
'2002

Birthdays Example

Here is a more complex nested-dict example to work in class.

> birthdays()

Say we have birthdays of Stanford students. Want to know - has the distribution of months changed over the years? Like maybe Jan used to be most common, but now it's Feb? (Malcolm Gladwell examined the effect of birth-date on student performance in his book Outliers, and recently did a podcast episode on it if you are curious.)

Say as input we have a list of birthday dates. Output will be a years dict with a key for each year. The value for each year will be a count dict of that year's months.

dates = ['jan-31-2002', 'jan-20-2002', 'dec-10-2001']

years = {

    '2002': {'jan': 2},
         
    '2001': {'dec': 1}
}

Types

To help later, we'll note down the key/value types for this nested structure.

1. Key of years dict is string year, e.g. '2002'

2. Value of years dict is a nested count dict. Its key is a month string, e.g. 'dec', and its value is the int count of how many times that month appears in that year's data.

1. Year Not Seen Before - init

What is the key? The year.

What is the value for each year? A count dict. So the init if not seen before is the empty dict.

        # Year not seen before - init
        if year not in years:
            years[year] = {}

2. Set var "counts"

Set a "counts" var pointing to the nested counts dict. We'll use the variable name "counts" here, since it's just a counts dict, using the standard counts-dict steps.

        # Set var -> nested
        counts = years[year]

3. Increment Counts dict

Do increment step on the counts dict. This amounts to the standard 3 lines to add a data point to a counts dict:

1. Month not seen before: init = 0

2. This month += 1

        # Standard init/+= counts steps
        if month not in counts:
            counts[month] = 0
        counts[month] += 1

Birthday years solution

def birthdays(dates):
    years = {}
    for date in dates:
        parts = date.split('-')
        month = parts[0]
        year = parts[2]

        # Year not seen before - init
        if year not in years:
            years[year] = {}

        # Set var -> nested
        counts = years[year]

        # Standard init/+= counts steps
        if month not in counts:
            counts[month] = 0
        counts[month] += 1
    return years

File Reading 2.0

See the Python Guide about file operations for more details: File Reading and Writing

We've done it this way: for line in f:
Advantage: low memory use even with huge file
Look at a few different reading forms
"with" form - handles closing automatically
'r' for reading (the default)
'w' for file-writing - not using in CS106A

Standard "with" to open a text file for reading:

with open(filename) as f:
    # use f in here

You can specify a particular encoding (default depends on your machine / locale). The encoding 'utf-8' is what many files use. If you get an error: a UnciodeDecodeError, try adding 'utf-8' as below, and if that fails try 'utf-16'.

with open(filename, encoding='utf-8') as f:
    # use f

Older way to open() a file (use in interpreter). Normally use "with" as above .. the modern way.

f = open(filename)
# use f
# f.close() when done
# "with" does the .close() automatically

1. File Loop

Most common way to look at the text of a file, process 1 line at a time. Uses the least memory. We often use the strip() function to remove the '\n' off the end of the line as shown below.

for line in f:
    # process each line
    line = line.strip()   # trim off '\n'

2. Alternative: r.readlines()

f.readlines() - return list of line strings, can do slices etc. to access lines in a custom order. Each line has the '\n' at its end. As above, .strip() can remove the '\n' from a line if needed.

>>> f = open('poem.txt')  # use open() in interpreter
>>> lines = f.readlines()
>>> lines
['Roses are red\n', 'Violets are blue\n', 'This Does Not Rhyme\n']
>>> 
>>> line = lines[0]    # first line
>>> line
'Roses are red\n'
>>> line.strip()       # strip() - remove whitespace from ends
'Roses are red'
>>>
>>> # What if we want to skip the first line?
>>>
>>> lines[1:]   # slice to skip first line
['Violets are blue\n', 'This does not rhyme\n']

3. Alternative: f.read()

read() - whole file into one string. Handy if you can process the whole thing at once, not needing to go line by line. Reading from a file "consumes" the data. Doing a second read returns the empty string.

>>> f = open('poem.txt')
>>> s = f.read()          # whole file in string
>>> s
'Roses are red\nViolets are blue\nThis does not rhyme\n'
>>> 
>>> f.read()          # reading again gets nothing
''

Babynames Background

Social Security Administration's (SSA) baby names data set of babies born in the US going back more than 100 years. This part of the project will load and organize the data. Part-b of the project will build out interactive code that displays the data.

New York Times: Where Have All The Lisas Gone. This is the article that gave Nick the idea to create this assignment way back when.

This is an endlessly interesting data set to look through: john and mary, jennifer, ethel and emily, trinity and bella and dawson, blanche and stella and stanley, michael and miguel.

We'll demo HW6 Baby Names with this data shortly.