Today: code for dict-count algorithm, when does python make a copy? More sophisticated nested-dict example, other ways to read a file

Dict-Count Algorithm

(started doing this in lecture-18)
Important class of dict algorithms
(Read: we'll use it a lot)
Note the key/value role in the counts dict below
In this case
key for each unique string seen
value is int count of that string

Dict Count Code Examples

> dict2 Demos and Exercises

Dict-Count Steps

1. Start with empty dict
2. For each string, test: not seen before?
3. Not seen before: store key = str, value = 1
4. Seen before: for key = str, value += 1
e.g. input strs: 'a', 'b', 'a', 'c', 'b'
Start with empty dict: {}
Dict-count algorithm builds: {'a': 2, 'b': 2, 'c': 1}

1. str-count1() - if/else

str_count1 demo, canonical dict-count algorithm

Demo: remove the code, type it in
Central test of this algorithm: not seen before?
if/else solution
Test: not seen before?
not seen before: counts[s] = 1
seen before: counts[s] += 1
This if/else approach is fine, but we'll see another way below

Solution code

def str_count1(strs):
    counts = {}
    for s in strs:
        # s not seen before?
        if s not in counts:
            counts[s] = 1   # first time
        else:
            counts[s] += 1  # every later time
    return counts

2. str-count2() - "Invariant" Version, no else

A slight "invariant" improvement on the above code
"Invariant" means something which is true in all case
Same central test: not seen s before?
1. I want to do this line for b>all cases:
counts[s] += 1
Problem: if s not seen before, this crashes (try it!)
Fix: if s not in there, make it be in there
2. Add if-statement for not-in case:
counts[s] = 0
I have a slight preference this version
It's one fewer lines and does not use else
All counting goes through that one += line
I like that all the action funnels through the one spot

Standard Dict-Count Code - "invariant" Version

def str_count2(strs):
    counts = {}
    for s in strs:
        if s not in counts:  # fix counts/s if not seen before
            counts[s] = 0
        # Invariant: now s is in counts one way or
        # another, so can do next step unconditionally
        counts[s] += 1
    return counts

Int Count - Exercise

Apply the dict-count algorithm to a list of int values, return a counts dict, counting how many times each int value appears in the list. Reminder of the key test: not seen this element before?

Char Count - Exercise

Apply the dict-count algorithm to chars in a string. Build a counts dict of how many times each char appears in a string so 'Coffee' returns {'c': 1, 'o': 1, 'f': 2, 'e': 2}.

Python and Copying

For more detail see guide: Python Not Copying

When Python uses an assignment = with a data structure like a list or a dict, Python does not make a copy of the structure. Instead, there is just the one list or dict, and multiple pointers pointing to it.

1. One List and One Dict

Here is code that creates one list and one dict, each with a variable pointing to it.

>>> lst = [1, 2, 3]
>>> d = {}
>>> d['a'] = 1

Memory looks like:
alt: one lst points to list, d points to dict

2. Store Reference To List in Dict

>>> d['b'] = lst

What does this do? Key: the = does not make a copy of the list. Instead, it stores an additional reference to the one list inside the dict.

Memory looks like:
alt: reference to list stored inside dict

3. lst.append() - What Happens?

There is just one list, and there are two references to it. This is fine. What does the following code do?

>>> lst.append(4)

What does memory look like now?

Memory looks like:
alt: list is modified

What do these lines of code print now?

>>> lst
???
>>> d['b']
???

Answer

Both lst and d['b'] refer to the same list, which is now [1, 2, 3, 4]

Dict-Count Chapter 1 Summary - Init/Incr

Our first use of dict was counting
Super important dict code pattern
Stereotypical not-in/in logic per data item
Invariant strategy:
1. "not in" - fix dict so key is present
Call this "init", e.g. value = 0
2. Then can assume key is in there, update value
Call this "increment", e.g. value += 1

Suppose "x" holds the key we're counting...

    if x not in counts:   # Fix so x is in there
       counts[x] = 0      # -Init
    counts[x] += 1        # -Increment

Dict - Chapter 2 - Nested

More sophisticated dict algorithms
Old: value is int counter 0, 1, 2, 3
Now: value is "nested" structure
a list, dict stored as a value inside a dict (like d/lst above)
email_hosts example below - nice, realistic example of this

Email Hosts Data

Have email strings
'abby@foo.com'
'bob@bar.com'
has one @
"user" is left of @ -> 'abby'
"host" is right of @ -> 'foo.com'

Email Hosts Challenge

High level: we have a big list of email addresses. We want to organize the data by host. For each host, build up a list of all the users for that host.

Given a list of email address strings. Each email address has one '@' in it, e.g. 'abby@foo.com', where 'abby' is the user, and 'foo.com' is the host.

Create a nested dict with a key for each host, and the value for that key is a list of all the users for that host, in the order they appear in the original list (repetition allowed).

emails:
  ['abby@foo.com', 'bob@bar.com', 'abe@foo.com']

returns hosts dict:
  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

Nested - Key/Value Types

When working a nested dict problem, it's good to keep in mind the type of each key and its value. This info guides code that reads or writes in the dict - when do you do += and when do you do .append(). What we have for this problem - will refer to this when writing a key line of code:

1. Each key is a host string, e.g. 'foo.com'

2. The value for each key is a list of users for that host, e.g. ['abby', 'abe']

Emil Hosts Example

> dict3 Emails - nested dict problem

Here is the code to start with. The "not in" structure still applies.

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]
        # your code here
        pass
    return hosts

1. Think about the invariant line first. What is the append line? Look above at the definition for each key and value.

2. Need to init for the not-in case. For counting the init was: 0. Now the init is: [].

Issue: `hosts[host]`

This line is very hard to read, like what on earth is it?

Recall that the hosts dict looks like:

  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

In hosts dict, each key is a host, and each value is a list of user names.

Style Technique: Decomp By Var

Instead of using hosts[host] as is, put its value into a well named var, spelling out what sort of data it holds. This is a big help and is how our solution is written. Note how the names in this line of code confirm that the logic is correct: users.append(user) This depends on the "shallow" feature of Python data (above), e.g. hosts[host] returns a reference to the embedded list to us.

No:

    hosts[host].append(user)

Yes:

    users = hosts[host]
    users.append(user)

Email Hosts Solution Code

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]

        # key algorithm: init/increment
        if host not in hosts:
            hosts[host] = []
        users = hosts[host]  # decomp by var
        users.append(user)
    return hosts

Drawing of the Email Hosts Sequence

alt:what hosts memory looks like, adding one user

File Reading 2.0

See guide for more details: File Reading and Writing

We've done this way: for line in f:
Advantage: low memory use even with huge file
Look at a few different reading forms
"with" form - handles closing automatically
'r' for reading (the default)
'w' for file-writing

Standard "with" to open a text file for reading:

with open(filename) as f:
    # use f in here

The form below is equivalent to above since 'r' is the default, meaning read the file. 'w' means write the file from RAM to the file system. See the guide above for sample writing code.

with open(filename, 'r') as f:
    # use f in here

Can specify encoding (default depends on your machine / locale). Encoding 'utf-8' is what many files use. Try this if you get a UnciodeDecodeError

with open(filename, encoding='utf-8') as f:
    # use f

Older way to open() a file (use in interpreter)

f = open(filename)
# use f
# f.close() when done
# "with" does the .close() automatically

File Loop

Most common, process 1 line at a time. Uses the least memory.

for line in f:
    # process each line

Alternative: r.readlines()

f.readlines() - return list of line strings, can do slices etc. to access lines in a custom order. Each line has the '\n' at its end. Use str.strip() to strip off whitespace from the ends of a line.

>>> f = open('poem.txt')  # alternative to with, use in interpreter
>>> lines = f.readlines()
>>> lines
['Roses are red\n', 'Violets are blue\n', '"RED" BLUE.\n']
>>> 
>>> line = lines[0]    # first line
>>> line
'Roses are red\n'
>>> line.strip()       # strip() - remove whitespace from ends
'Roses are red'
>>>
>>> lines[1:]   # slice to grab subset of lines
['Violets are blue\n', '"RED" BLUE.\n']

Alternative: f.read()

read() - whole file into one string. Handy if you can process the whole thing at once, not needing to go line by line. Reading from a file "consumes" the data. Doing a second read returns the empty string.

>>> f = open('poem.txt')
>>> s = f.read()          # whole file in string
>>> s
'Roses are red\nViolets are blue\n"RED" BLUE.\n'
>>> 
>>> >>> f.read()          # reading again gets nothing
''

f.read() -> str.split() -> list

String function str.split(), saw before
With no parameters, splits on whitespace, returns a list of "words"
Spaces and '\n' count as whitespace
Could use it with text read from file
Change the whole file into a list of words
Can loop over the list, never needed to mess with: for line in f
A very easy to access the contents of a file

>>> f = open('poem.txt')
>>> s = f.read()
>>> 
>>> s
'Roses are red\nViolets are blue\n"RED" BLUE.\n'
>>> 
>>> s.split()
['Roses', 'are', 'red', 'Violets', 'are', 'blue', '"RED"', 'BLUE.']