L19

Today: new data type dict and the dict-count algorithm

Python `dict` - Hash Table - Fast

Python "dict" type, a "dictionary"
Stores values, each value under a "key"
In CS generally known as a "hash table"
Sounds like a real hacker thing
CS106B - implement dictionary from scratch
Dict is a bit advanced, compared to basic string/list
Defining features of a dict:
1. Can get or set a value under a key chosen by the programmer
2. The get/set operations are fast

For more details sees the chapter in the Guide: Dict

Dict Story Arc

Have: have some big data set, data is not organized, perhaps random order
Real world data looks like this
Dict strategy
1. Pick out data item to use as key
2. Load all the data, storing each item under its appropriate key
3. Done: now the data is organized by key, only needed to handle each item once
The dict is fast doing get/set by key, its defining superpower
Job interview pattern:
Interview question has some messed up data
Best answer inevitably uses a dict to organize the data
Because the dict is powerful and fast...
Interviewers cannot resist using it

Restaurant - From Chaos to Order

Suppose you are out ordering dinner at a restaurant, and the order is proceeding in a chaotic way, with the people throwing out their orders out in random order:

Alice: I'd like to start with a cup of gazpacho
Bob:   I like beignets for dessert
Alice: Then a ceaser salad
Zoe:   I'll have lasagna
Bob:   Actually two orders of beignets
Alice: Then I'll have tacos
Bob:   And a hot dog
...

People mention the parts of their order piece by piece in no organized order - fine. However, what is needed for the kitchen is to organize each order by person. In a dict, we choose the person as the key for their order, and organize the data that way.

Once all the data is loaded, organized by key, it looks like this:

Alice: gazpacho, ceasar, tacos
Bob: hot dog, two orders of beignets
...

This is what the dictionary does - data comes in randomly and the dict can organize it by a chosen "key" part of the data, here the name.

What does `x[y]` Mean?

What does this mean in Python generally...

 x[y]

We have an outer thing x. Refer to an inner thing inside it identified by y.

The dict will use square bracket [ ] also, and will follow this outer/inner pattern.

Dict Basics

alt:python dict key/value pairs 'a'/'alpha' 'g'/'gamma' 'b'/'beta'

Organize data around keys
1. Set - for a key, store one associated value
In drawing with an arrow: key -> value
2. Get - look up the value for any key

Dict-1 - Set key:value into Dict

Create empty dict: d = {}
Set: d[key] = value
e.g. d['a'] = 'alpha'
Set creates that key entry in dict if needed
Overwrites any previous value for that key
i.e. Each key has one value

>>> d = {}             # Start with empty dict {}
>>> d['a'] = 'alpha'   # Set key/value
>>> d['g'] = 'gamma'
>>> d['b'] = 'beta'
>>> # Now we have built the picture above
>>> # Python can input/output a dict using
>>> # the literal { .. } syntax.
>>> d
{'a': 'alpha', 'g': 'gamma', 'b': 'beta'}
>>>

Dict-2 - Get value out of Dict

Get a value out of a dict by key
Get: d[key] - returns the value for that key
e.g. d['a'] returns 'alpha'
Note Left/Right of =
On left of = - setting
On right of = - getting
Handy: +=
Does a get/set series on the value
This will be a handy pattern
e.g. d['a'] += '!!!'
Equivalent to: d['a'] = d['a'] + '!!!'
Adds '!!!' to end of that value

>>> s = d['g']         # Get by key
>>> s
'gamma'
>>> d['b']
'beta'
>>> d['a'] = 'apple'   # Overwrite 'a' key
>>> d['a']
'apple'
>>>
>>> # += modify value
>>> d['a'] += '!!!'
>>> d['a']
'apple!!!'
>>>
>>> d
{'a': 'apple!!!', 'g': 'gamma', 'b': 'beta'}
>>>

Dict-3 - Get Error / "in" Test

There is one big catch
Problem: get d[key] only works if the key is in the dict
If the key is not in the dict, get d[key] is an error
Solution: use in to check if a key is in the dict
Habit: when you see d[key] .. think if that key is good
Pattern: before using a key to get a value, in-check the key
Sort of the "guard" pattern again
Note: the in check is for keys, not values
All dict logic is on the key, the value is just stored

>>> # Can initialize dict with literal
>>> d = {'a': 'alpha', 'g': 'gamma', 'b': 'beta'}
>>>
>>> val = d['x']         # Key not in -> Error
Error:KeyError('x',)
>>>
>>> 'a' in d             # "in" key tests
True
>>> 'x' in d
False
>>> 
>>> # Guard pattern (else ..)
>>> if 'x' in d:
      val = d['x']
>>>
>>> # "in" uses key, not value, so this does not work:
>>> 'alpha' in d 
False
>>>

Dict Summary - 3 Operations

1. Set: d[key] = value
2. Get: x = d[key]
3. In check: key in d
The dict logic is always by key, the value is just stored
These are all fast even with millions of key/value pairs

Dict = Memory, Meals Examples

At a high level, the dict is like a memory store - code can store a piece of information at one time, retrieve it later.

Meals problems: use dict to remember what food was eaten under the keys 'breakfast', 'lunch', 'dinner'. For example meals['breakfast'] = 'apple'`

>>> meals = {}
>>> meals['breakfast'] = 'apple'
>>> meals['lunch'] = 'donut'
>>>
>>> # time passes, other lines run
>>>
>>> # what was lunch again?
>>> meals['lunch']
'donut'
>>> 
>>> # did I have breakfast and dinner yet?
>>> 'breakfast' in meals
True
>>> 'dinner' in meals
False
>>>

Basic Dict Code Examples - Meals

Look at the dict1 "meals" exercises on the experimental server

> dict1 meals exercises

With the "meals" examples, the keys are 'breakfast', 'lunch', 'dinner' and the values are like 'hot dot' and 'bagel'. A key like 'breakfast' may or may not be in the dict, so need to "in" check first.

bad_start() - check for bad breakfast - return True if no breakfast or if it is 'candy'
candyish() - check for candy lunch or dinner
enkale() - if 'candy' for dinner, change it to 'kale'

Preface: Think About `dict[key]`

Often pulling up a value by its key

val = d[key]

Think about - is that key always valid? If the key is not in the dict, get a KeyError crash when accessing it. Therefore, have "in" logic to check if key is present before accessing with square brackets

if key in d:
    val = d[key]

1. bad_start()

> bad_start()

bad_start(meals): Given a "meals" dict which contains key/value pairs like 'lunch' -> 'hot dog'. The possible keys are 'breakfast', 'lunch', 'dinner'. Return True if there is no 'breakfast' key in meals, or the value for 'breakfast' is 'candy'. Otherwise return False.

Try running code without the "in" check - see the KeyError.

bad_start() Solution Code

Question: is the meals['breakfast'] == 'candy' line safe? Yes. The earlier if-statement guards the [ ].

def bad_start(meals):
    if 'breakfast' not in meals:
        return True
    if meals['breakfast'] == 'candy':
        return True
    return False
    # Can be written with "or" / short-circuiting
    # if 'breakfast' not in meals or meals['breakfast'] == 'candy':

2. enkale()

> enkale()

enkale(meals): Given a "meals" dict which contains key/value pairs like 'lunch' -> 'hot dog'. The possible keys are 'breakfast', 'lunch', 'dinner'. If the key 'dinner' is in the dict with the value 'candy', change the value to 'kale'. Otherwise leave the dict unchanged. Return the dict in all cases.

enkale() Solution Code

Demo: work out the code, see key error

Cannot access meals['dinner'] in the case that dinner is not in the dict, so need logic to avoid that case.

def enkale(meals):
    if 'dinner' in meals and meals['dinner'] == 'candy':
        meals['dinner'] = 'kale'
    return meals

Typical pattern: "in" check guards the meals['dinner'] access, since the short-circuit and only proceeds when the first test is True. Could write it out in this longer form with two if-statements which is ok — works exactly the same as the above and/short-circuit form:

def enkale(meals):
    if 'dinner' in meals:
        if meals['dinner'] == 'candy':
            meals['dinner'] = 'kale'
    return meals

Exercise: is_boring()

> is_boring()

is_boring(meals): Given a "meals" dict. We'll say the meals dict is boring if lunch and dinner are both present and are the same food. Return True if the meals dict is boring, False otherwise.

Idea: could solve without worrying about the KeyError first. Then put in the needed "in" guard checks.

Dict Observations

Dict Random Order

Note: the order of the keys in the dict is kind of random
It is the order they were added
Simplest to think of it as random
Key type is typically a str or int (immutable)
Value type can be anything (str, list, ...)
The get/set/in logic is on the keys
The values are just dumb payload - stored, not used by in/[]

"in" Guard Pattern

Very often see "in" checks just before key access
Accessing meals['dinner'] = an error if dinner not in the dict
Therefore: check dinner in meals first
Only access meals['dinner'] when the key is present
The and on the following line does this, only proceeding when in is True:
if 'dinner' in meals and meals['dinner'] == 'candy':
Aka "short circuiting" of boolean expressions

Key and Value - Different Roles

Note that get/set/in are all by key
The key is the control, value is just dumb payload
Could say key/value are asymmetric, having specialized roles
YES set by key: d['a'] = 'alpha'
YES get by key: d['a'] -> 'alpha'
YES in check of key: 'a' in d -> True
NO, in check by value: 'alpha' in d
Does not work
get/set/in all work with key

Dict vs. List - Keys

Dict and List both remember things
What's the difference?
Keys!
The "keys" for list are always index numbers
0, 1, 2, 3, ... len-1
The "keys" for dict, you choose!
Any string or int etc. works as a key
Can set the keys in any order

Dict-Count Algorithm

Extremely important dict algorithm pattern
(Read: we'll use it a lot)
A "counts" dict:
We have some big data set
Store a key for each distinct value in the data
The value for each key is count of occurrences of that key in the data
e.g. strs: 'a', 'b', 'a', 'c', 'b'
Compute output "counts" dict: {'a': 2, 'b': 2, 'c': 1}

Dict Count Code Examples

> dict2 Count exercises

Dict-Count Algorithm Steps

1. Start with empty dict counts = {}
2. For each str, test: not seen before?
3. Not seen before: store key = str, value = 1
4. Seen before: key = str, value = value + 1

Dict-Count abacb

Go through these strs
strs = ['a', 'b', 'a',  'c',  'b']

Sketch out counts dict here:

Counts dict ends up as {'a': 2, 'b': 2, 'c': 1}:

alt: counts a 2 b 2 c 1

strs: 'a', 'b', 'a', 'c', 'b'
Each distinct str is a key in the dict
The value for each key is the number of times it is seen
Algorithm: loop through all s, update dict with counts as we go
Each s: seen this before or not?

1. str-count1() - if/else

> str_count1()

str_count1 demo, canonical dict-count algorithm

Central test of this algorithm: not seen before?
if/else solution
Test: not seen before?
not seen before: counts[s] = 1
seen before: counts[s] += 1
This if/else approach is fine, but we'll see another way below
Demo: write code on board, then fix in next step

str_count1() Solution

def str_count1(strs):
    counts = {}
    for s in strs:
        # s not seen before?
        if s not in counts:
            counts[s] = 1   # first time
        else:
            counts[s] +=1   # every later time
    return counts

2. str-count2() - Unified/Invariant Version, no else

> str_count2()

A slight unified/invariant improvement on the above code
Same central test: not seen s before?
If not seen before - set to zero - aka "fix" dict for that s
counts[s] = 0
With fix done, following line works for all cases:
counts[s] += 1
"Unified" - 1 line works for all cases ("invariant")
I have a very slight preference this version
It's one fewer lines and does not use else
All counting goes through that one += 1 line

Standard Dict-Count Code - Unified/Invariant Version

def str_count2(strs):
    counts = {}
    for s in strs:
        # fix counts/s if not seen before
        if s not in counts:
            counts[s] = 0
        # Unified: now s is in counts one way or
        # another, so this works for all cases:
        counts[s] += 1
    return counts

Int Count - Exercise

> int_count()

Apply the dict-count algorithm to a list of int values, return a counts dict, counting how many times each int value appears in the list.

Char Count - Exercise

> char_count()

Apply the dict-count algorithm to chars in a string. Build a counts dict of how many times each char, converted to lowercase, appears in a string so 'Coffee' returns {'c': 1, 'o': 1, 'f': 2, 'e': 2}.

Python dict - Hash Table - Fast