L19

Today: code for dict-count algorithm, when does python make a copy? More sophisticated nested-dict example, other ways to read a file

Recall: Dict = Advanced

Have: have some big data set, data is not organized, perhaps random order
Dict:
Pick out data items we want
Store each item under a key in the dict
Done: now the data is organized by key
The dict is fast doing get/set by key, its defining superpower
Job interview pattern:
Interview question has some messed up data
Best answer inevitably uses a dict to organize the data
Because the dict is powerful and fast...
Interviewers cannot resist using it

Recall: Dict-Count Algorithm

Say we are building a counts dict, counting how many times each string appears in the strs list

strs = ['a', 'b', 'a', 'c', 'b']

Want to build this dict ultimately

counts == {'a': 2, 'b': 2, 'c': 1}

Dict-Count Picture

alt: counts a 2 b 2 c 1

Dict-Count Code

Standard dict-count code:

counts = {}
for s in strs:
    # 1. not in -> init
    if s not in counts:
        counts[s] = 0

    # 2. increment
    counts[s] = counts[s] + 1

Dict-Count "fix" Story

Want to do the (2) step for every s
Fundamentally, this increment line does the counting
BUT that line will crash for the first time s is seen
counts[s] on right side fails if s not in
Therefore, have (1) to "fix" the dict, installing s with value 0
Many dict algorithms have in/not-in logic as a step

Recall Style `not in` Form

Could write as: not s in counts
But PEP8 prefers: s not in counts
They are equivalent
Similarly, prefer s != 'x' instead of not s == 'x'

A Reminder From Our Friend Modulo `%`

Recall that modulo % is the remainder after int division.

57 % 10 -> 7
123 % 10 -> 3
19 % 10 -> 9
10 % 10 -> 0
98 % 10 -> 8
99 % 10 -> 9
100 % 10 -> 0

The % 10 of a non-negative int is just its last digit.

Mathematics angle: All the digits to the left of the rightmost one includes 10 as a factor. Computing % 10 is just what's left after all the multiples of 10 are taken away.

Digit Count - Exercise

> digit_count()

Apply the dict-count algorithm to count how many numbers end with each digit.

digit_count(nums): Give a list of non-negative ints. The last digit of each num can be found by computing num % 10. For example 57 % 10 is 7, and 7 is the last digit of 57. Build a counts dictionary where each key is an int digit, and its value is the count of one or more numbers in the list ending with that digit.

Digit Count - Solution

def digit_count(nums):
    counts = {}
    for num in nums:
        digit = num % 10
        if digit not in counts:
            counts[digit] = 0
        counts[digit] += 1
    return counts

Python and Copying

For more detail see guide: Python Not Copying

When Python uses an assignment = with a data structure like a list or a dict, Python does not make a copy of the structure. Instead, there is just the one list or dict, and multiple pointers pointing to it.

1. One List Two Vars

>>> lst = [1, 2, 3]
>>> b = lst
>>>
>>> # lst and b appear to have the same value
>>> # in fact, they both point to the same list
>>> lst
[1, 2, 3]
>>> b
[1, 2, 3]

Key: there is one list, two vars pointing to it. We can call .append() using either variable, and they both do the same thing, changing the one underlying list.

>>> b.append(99)  # b.append()
>>> b
[1, 2, 3, 99]     # b's list is changed
>>> lst
[1, 2, 3, 99]     # so is lst - it's the same list

alt: lst and b point to the one list with 99 appended

2. One List and One Dict

Here is code that creates one list and one dict, each with a variable pointing to it.

>>> lst = [1, 2, 3]
>>> d = {}
>>> d['a'] = 1

Memory looks like:
alt: one lst points to list, d points to dict

2. Store Reference To List inside Dict

>>> d['b'] = lst

What does this do? Key: the = does not make a copy of the list. Instead, it stores an additional reference to the one list inside the dict.

Memory looks like:
alt: reference to list stored inside dict

3. d['b'].append() - What Happens?

There is just one list, and there are two references to it. This is fine. What does the following code do?

>>> d['b'].append(4)

What does memory look like now? First, what does the list look like? Who is pointing to it?

Memory looks like:
alt: list is modified

What do these lines of code print now?

>>> lst
???
>>> d['b']
???

Answer

Both lst and d['b'] are both references to the list, which is now [1, 2, 3, 4]

4. Use "nums" Variable

Use = to store another reference to list in a "nums" variable. Does this make a copy of the list? No. It's just another reference to the one list. What happens when we do nums.append(99)?

>>> nums = d['b']
>>> nums.append(99)
>>> nums
[1, 2, 3, 4, 99]
>>> d['b']
[1, 2, 3, 4, 99]
>>>

alt: nums also points to the list

Summary - Pointers Proliferate

Python does not copy a list or dict when used with, say, =. Instead, Python just spreads around more pointers to the one list. This is a normal way for Python programs to work - a few important lists or dicts, and pointers to those structures spread around in the code. This does not require any action on your part, just realize that that there are no copies.

Dict-Count Chapter 1 Summary - Init/Incr

Our first use of dict was counting
Super important dict code pattern
Stereotypical not-in/in logic per data item
1. "not in" - fix dict so key is present
Call this "init", e.g. value = 0
2. Then can assume key is in there, update value
Call this "increment", e.g. value += 1

Suppose "x" holds the key we're counting...

    if x not in counts:   # Fix so x is in there
       counts[x] = 0      # -Init
    counts[x] += 1        # -Increment

Advanced Dict - Nested / Inner

More sophisticated dict algorithms
Old: dict value is int counter 0, 1, 2, 3
Now: dict value is a nested "inner" structure
a list, dict stored as a value inside a dict (like d/lst above)
We'll use the word inner for the list or dict stored inside
email_hosts example below - nice, realistic example of this

Email - Parse User and Host

# Have email strings
-'abby@foo.com'
-'bob@bar.com'

# One @
-"user" is left of @ -> 'abby'
-"host" is right of @ -> 'foo.com'

Email Hosts Challenge

This is a tricky problem. We'll go step by step in lecture, you can follow along. Then we'll work a similar problem in section.

High level: we have a big list of email addresses. We want to organize the data by host. For each host, build up a list of all the users for that host.

Given a list of email address strings. Each email address has one '@' in it, e.g. 'abby@foo.com', where 'abby' is the user, and 'foo.com' is the host.

Create a nested dict with a key for each host, and the value for that key is a list of all the users for that host, in the order they appear in the original list (repetition allowed).

emails:
  ['abby@foo.com', 'bob@bar.com', 'abe@foo.com']

returns hosts dict:
  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

Type Commitments: key and value

When working a nested dict problem, it's good to keep in mind the type of each key and its value. This info guides code that reads or writes in the dict - when do you do += and when do you do .append(). What we have for this problem - will refer to this when writing a key line of code.

Here are the two types we have for the hosts dict. Write these on the board, for reference later when we get to the code. A commitment.

1. hosts key = string `'foo.com'`

Each key in the hosts dict is a host string, e.g. 'foo.com'

2. hosts value = inner list of users

The value for each key is an inner list of users for that host, e.g. ['abby', 'abe']

Email Hosts Example

> email_hosts() - nested dict problem

Here is the code to start with. The "not in" structure still applies.

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]
        # your code here
        pass
    return hosts

1. Init Line

Need to init for the case where the host is not in dict already. For counting the init value was 0. Now the init value is [].

        # init ([])
        if host not in hosts:
            hosts[host] = []

2. Increment Line v1

Think about the "increment" line - want to append this user to the inner list of users. What is the reference to the inner list of users? Look above at the definition for each key and value. The inner list is hosts[host] - not so readable though.

        # increment (.append)
        hosts[host].append(user)

Issue: `hosts[host]`

What is hosts[host]? It is hard to read.

Recall that the hosts dict looks like this:

  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

In hosts dict, each key is a host, and each value is a list of user names. Therefore, hosts[host] is accessing one of the lists of users.

Style Technique: Decomp By Var - v2

Instead of using hosts[host] as is, put its value into a variable with a good name, spelling out what sort of data it holds. This helps go step by step and is how our solution is written. Note how the names in this line of code confirm that the logic is correct: users.append(user) This depends on the "shallow" feature of Python data (above), e.g. hosts[host] returns a reference to the embedded list to us.

Store the inner list in a var, then append() on the var:

        users = hosts[host]
        users.append(user)

Or if you cannot think of a word for the inner list, you could at least use "inner" as the var name. Not fancy, but better than v1:

        inner = hosts[host]
        inner.append(user)

Email Hosts Solution Code

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]

        # key algorithm: init/increment
        if host not in hosts:
            hosts[host] = []
        users = hosts[host]  # decomp by var
        users.append(user)
    return hosts

Drawing of the Email Hosts Sequence

alt:what hosts memory looks like, adding one user

Practice Later: food_ratings()

Say we have a bunch of ratings about foods, and we want to organize them per food.

> food_ratings() - nested dict problem

food_ratings(ratings): Given a list of food survey rating strings like this 'donut:10'. Create and return a "foods" dict that has a key for each food, and its value is a list of all its rating numbers in int form. Use split() to divide each food rating into its parts. There's a lot of Python packed into this question!

Build dict with structure:

Key = one food string

Value = list of rating ints

Nice Parsing Technique `split(':')`

Nice technique, say we have rating = 'donut:10'

Use split():

rating = 'donut:10'
parts = rating.split(':')
# parts is now ['donut', '10']`

File Reading 2.0

See guide for more details: File Reading and Writing

We've done this way: for line in f:
Advantage: low memory use even with huge file
Look at a few different reading forms
"with" form - handles closing automatically
'r' for reading (the default)
'w' for file-writing

Standard "with" to open a text file for reading:

with open(filename) as f:
    # use f in here

You can specify a particular encoding (default depends on your machine / locale). The encoding 'utf-8' is what many files use. Try this if you get a UnciodeDecodeError. Or you may have a file which has a different encoding, so you will need to try others such as 'utf-16'.

with open(filename, encoding='utf-8') as f:
    # use f

Older way to open() a file (use in interpreter)

f = open(filename)
# use f
# f.close() when done
# "with" does the .close() automatically

1. File Loop

Most common way to look at the text of a file, process 1 line at a time. Uses the least memory.

for line in f:
    # process each line

2. Alternative: r.readlines()

f.readlines() - return list of line strings, can do slices etc. to access lines in a custom order. Each line has the '\n' at its end. Use str.strip() to strip off whitespace from the ends of a line.

>>> f = open('poem.txt')  # use open() in interpreter
>>> lines = f.readlines()
>>> lines
['Roses are red\n', 'Violets are blue\n', 'This Does Not Rhyme\n']
>>> 
>>> line = lines[0]    # first line
>>> line
'Roses are red\n'
>>> line.strip()       # strip() - remove whitespace from ends
'Roses are red'
>>>
>>> # What if we want to skip the first line?
>>>
>>> lines[1:]   # slice to skip first line
['Violets are blue\n', 'This Does Not Rhyme\n']

3. Alternative: f.read()

read() - whole file into one string. Handy if you can process the whole thing at once, not needing to go line by line. Reading from a file "consumes" the data. Doing a second read returns the empty string.

>>> f = open('poem.txt')
>>> s = f.read()          # whole file in string
>>> s
'Roses are red\nViolets are blue\nThis Does Not Rhyme\n'
>>> 
>>> f.read()          # reading again gets nothing
''

Babynames Demo

Social Security Administration's (SSA) baby names data set of babies born in the US going back more than 100 years. This part of the project will load and organize the data. Part-b of the project will build out interactive code that displays the data.

New York Times: Where Have All The Lisas Gone. This is the article that gave Nick the idea to create this assignment way back when.

This is an endlessly interesting data set to look through: john and mary, ethel and emily.