Today: code for dict-count algorithm, when does python make a copy? More sophisticated nested-dict example, other ways to read a file

Dict-Count Algorithm

Dict Count Code Examples

> dict2 Demos and Exercises

Dict-Count Steps

1. str-count1() - if/else

str_count1 demo, canonical dict-count algorithm

Solution code

def str_count1(strs):
    counts = {}
    for s in strs:
        # s not seen before?
        if s not in counts:
            counts[s] = 1   # first time
        else:
            counts[s] += 1  # every later time
    return counts

2. str-count2() - "Invariant" Version, no else

Standard Dict-Count Code - "invariant" Version

def str_count2(strs):
    counts = {}
    for s in strs:
        if s not in counts:  # fix counts/s if not seen before
            counts[s] = 0
        # Invariant: now s is in counts one way or
        # another, so can do next step unconditionally
        counts[s] += 1
    return counts

Int Count - Exercise

Apply the dict-count algorithm to a list of int values, return a counts dict, counting how many times each int value appears in the list. Reminder of the key test: not seen this element before?

Char Count - Exercise

Apply the dict-count algorithm to chars in a string. Build a counts dict of how many times each char appears in a string so 'Coffee' returns {'c': 1, 'o': 1, 'f': 2, 'e': 2}.


Python and Copying

For more detail see guide: Python Not Copying

When Python uses an assignment = with a data structure like a list or a dict, Python does not make a copy of the structure. Instead, there is just the one list or dict, and multiple pointers pointing to it.

1. One List and One Dict

Here is code that creates one list and one dict, each with a variable pointing to it.

>>> lst = [1, 2, 3]
>>> d = {}
>>> d['a'] = 1

Memory looks like:
alt: one lst points to list, d points to dict

2. Store Reference To List in Dict

>>> d['b'] = lst

What does this do? Key: the = does not make a copy of the list. Instead, it stores an additional reference to the one list inside the dict.

Memory looks like:
alt: reference to list stored inside dict

3. lst.append() - What Happens?

There is just one list, and there are two references to it. This is fine. What does the following code do?

>>> lst.append(4)

What does memory look like now?

Memory looks like:
alt: list is modified

What do these lines of code print now?

>>> lst
???
>>> d['b']
???

Answer

Both lst and d['b'] refer to the same list, which is now [1, 2, 3, 4]


Dict-Count Chapter 1 Summary - Init/Incr

Suppose "x" holds the key we're counting...

    if x not in counts:   # Fix so x is in there
       counts[x] = 0      # -Init
    counts[x] += 1        # -Increment

Dict - Chapter 2 - Nested

Email Hosts Data

Email Hosts Challenge

High level: we have a big list of email addresses. We want to organize the data by host. For each host, build up a list of all the users for that host.

Given a list of email address strings. Each email address has one '@' in it, e.g. 'abby@foo.com', where 'abby' is the user, and 'foo.com' is the host.

Create a nested dict with a key for each host, and the value for that key is a list of all the users for that host, in the order they appear in the original list (repetition allowed).

emails:
  ['abby@foo.com', 'bob@bar.com', 'abe@foo.com']

returns hosts dict:
  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

Nested - Key/Value Types

When working a nested dict problem, it's good to keep in mind the type of each key and its value. This info guides code that reads or writes in the dict - when do you do += and when do you do .append(). What we have for this problem - will refer to this when writing a key line of code:

1. Each key is a host string, e.g. 'foo.com'

2. The value for each key is a list of users for that host, e.g. ['abby', 'abe']

Emil Hosts Example

> dict3 Emails - nested dict problem

Here is the code to start with. The "not in" structure still applies.

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]
        # your code here
        pass
    return hosts

1. Think about the invariant line first. What is the append line? Look above at the definition for each key and value.

2. Need to init for the not-in case. For counting the init was: 0. Now the init is: [].

Issue: hosts[host]

This line is very hard to read, like what on earth is it?

Recall that the hosts dict looks like:

  {
   'foo.com': ['abby', 'abe'],
   'bar.com': ['bob']
  }

In hosts dict, each key is a host, and each value is a list of user names.

Style Technique: Decomp By Var

Instead of using hosts[host] as is, put its value into a well named var, spelling out what sort of data it holds. This is a big help and is how our solution is written. Note how the names in this line of code confirm that the logic is correct: users.append(user) This depends on the "shallow" feature of Python data (above), e.g. hosts[host] returns a reference to the embedded list to us.

No:

    hosts[host].append(user)

Yes:

    users = hosts[host]
    users.append(user)

Email Hosts Solution Code

def email_hosts(emails):
    hosts = {}
    for email in emails:
        at = email.find('@')
        user = email[:at]
        host = email[at + 1:]

        # key algorithm: init/increment
        if host not in hosts:
            hosts[host] = []
        users = hosts[host]  # decomp by var
        users.append(user)
    return hosts

Drawing of the Email Hosts Sequence

alt:what hosts memory looks like, adding one user


File Reading 2.0

See guide for more details: File Reading and Writing

Standard "with" to open a text file for reading:

with open(filename) as f:
    # use f in here

The form below is equivalent to above since 'r' is the default, meaning read the file. 'w' means write the file from RAM to the file system. See the guide above for sample writing code.

with open(filename, 'r') as f:
    # use f in here

Can specify encoding (default depends on your machine / locale). Encoding 'utf-8' is what many files use. Try this if you get a UnciodeDecodeError

with open(filename, encoding='utf-8') as f:
    # use f

Older way to open() a file (use in interpreter)

f = open(filename)
# use f
# f.close() when done
# "with" does the .close() automatically

File Loop

Most common, process 1 line at a time. Uses the least memory.

for line in f:
    # process each line

Alternative: r.readlines()

f.readlines() - return list of line strings, can do slices etc. to access lines in a custom order. Each line has the '\n' at its end. Use str.strip() to strip off whitespace from the ends of a line.

>>> f = open('poem.txt')  # alternative to with, use in interpreter
>>> lines = f.readlines()
>>> lines
['Roses are red\n', 'Violets are blue\n', '"RED" BLUE.\n']
>>> 
>>> line = lines[0]    # first line
>>> line
'Roses are red\n'
>>> line.strip()       # strip() - remove whitespace from ends
'Roses are red'
>>>
>>> lines[1:]   # slice to grab subset of lines
['Violets are blue\n', '"RED" BLUE.\n']

Alternative: f.read()

read() - whole file into one string. Handy if you can process the whole thing at once, not needing to go line by line. Reading from a file "consumes" the data. Doing a second read returns the empty string.

>>> f = open('poem.txt')
>>> s = f.read()          # whole file in string
>>> s
'Roses are red\nViolets are blue\n"RED" BLUE.\n'
>>> 
>>> >>> f.read()          # reading again gets nothing
''

f.read() -> str.split() -> list

>>> f = open('poem.txt')
>>> s = f.read()
>>> 
>>> s
'Roses are red\nViolets are blue\n"RED" BLUE.\n'
>>> 
>>> s.split()
['Roses', 'are', 'red', 'Violets', 'are', 'blue', '"RED"', 'BLUE.']