Today: dict output, wordcount.py example program, list functions, List patterns, state-machine pattern
An "exception" in Python halts the program with an error message and notes the line number. You have seen these many, many times. It's possible to write an exception handler which catches the exception and takes some action, but most programs do not do that. By far the most common strategy is that an exception simply halts the program with its error message. This is a fine strategy and what we sill do for CS106A.
The last line of the error message describes the specific problem, and the "traceback" lines above give context about the series of function calls / line-numbers which lead to the error. Generally just look at the last couple lines to see the error and the line of code where it occurred. We can prompt an exception easily enough with some bad code in the interpreter
>>> s = 'Hello' >>> s[9] Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: string index out of range >>>
Python's types like string and list raise exceptions when they are fed bad data, as in the example above. Your code can raise an exception on its own, too.
If the program's input data has a problem so the computation cannot continue, the best strategy is raising an exception with an error message to halt the program at that point. The next step is up to whoever is running the program. Perhaps they typed the filename wrong, say. The line raise Exception(message)
like this:
>>> raise Exception('This went wrong') Traceback (most recent call last): File "<stdin>", line 1, in <module> Exception: This went wrong
Python has a taxonomy of different sorts of exceptions that code can raise, but the above is the simplest and that's what we'll do for HW6.
Some programming systems would not halt when the data was bad, instead trying to stumble forward, pretending that the missing data was the empty string or whatever to see if that would work. This turned out to be a bad strategy, as it hid the underlying error. Imagine debugging that system, where input a is wrong, but the program stumbles forward to fail with bad data b a few lines later. That's harder to debug, as the underlying issue is obscured. So the best practice is: just halt with a real error message right where the bad data is detected, and that's what we do for HW6.
Thus far examples look like this, loading and organizing the data in the dict:
counts = {} for line in f: ... counts[xxx] = yyy ...
What about getting data out of the dict?
Wordcount example below - show the full load-up and print-out lifecycle.
dict.keys()
>>> # Load up dict >>> d = {} >>> d['a'] = 'alpha' >>> d['g'] = 'gamma' >>> d['b'] = 'beta' >>> >>> # d.keys() - list-like "iterable" of keys, >>> # loop over keys to see all of dict >>> d.keys() dict_keys(['a', 'g', 'b']) >>> >>> # d.values() - not used as often >>> d.values() dict_values(['alpha', 'gamma', 'beta'])
for key in d.keys():
Say we want to print the contents of a dict. Loop over d.keys()
to see every key, look up the value for each key. This works fine and accesses all of the dict. The only problem is that the keys are in random order.
>>> d = {'a': 'alpha', 'g': 'gamma', 'b': 'beta'} >>> for key in d.keys(): ... print(key, '->', d[key]) ... a -> alpha g -> gamma b -> beta
sorted(lst)
>>> nums = [5, 2, 7, 3, 1] >>> sorted(nums) [1, 2, 3, 5, 7] >>> >>> strs = ['banana', 'alpha', 'donut', 'carrot'] >>> sorted(strs) ['alpha', 'banana', 'carrot', 'donut']
sorted(d.keys())
>>> d.keys() # random order - not pretty dict_keys(['a', 'g', 'b']) >>> >>> sorted(d.keys()) # sorted order - nice ['a', 'b', 'g'] >>> >>> for key in sorted(d.keys()): ... print(key, '->', d[key]) ... a -> alpha b -> beta g -> gamma
for key in d:
>>> d = {'a': 'alpha', 'g': 'gamma', 'b': 'beta'} >>> for key in sorted(d): ... print(key) ... a b g
list(xxx)
The d.keys()
is not exactly a list. You can loop over it and take len(), but square bracket [ ]
does not work. If you have a list-like and need an actual list, you can form one with list()
as below. Typically this is not needed for CS106A, as looping is good enough.
>>> # list-like: loop and len() work >>> d.keys() dict_keys(['a', 'g', 'b']) # list-like >>> len(d.keys()) 3 >>> >>> d.keys()[2] # [ ] no work TypeError: 'dict_keys' object is not subscriptable >>> >>> strs = list(d.keys()) # make real list >>> strs ['a', 'g', 'b'] # now [ ] works >>> strs[2] 'b' >>>
The wordcount program below reads in a text, separates out all the words, builds a count dict to count how often each word appears, and finally produces a report with all the words in alphabetical order, each with its count, this:
$ python3 wordcount.py somefile.txt aardvark 1 anvil 3 ban 1 boat 4 be 19 ...
The program loads up a dictionary to count the words in the file, and then produces an alphabetic order list of each word with its count. In a later lecture, we'll look at other features, such as how the data flows between main() and the other functions.
The file **redblue.txt** has punctuation added to our old poem, so we can see how wordcount.py cleans up each word for counting. The file **alice-book.txt** has the whole text of the book Alice in Wonderland.
$ cat redblue.txt Roses are red Violets -are- blue "RED" BLUE. $ $ python3 wordcount.py redblue.txt are 2 blue 2 red 2 roses 1 violets 1 $ $ python3 wordcount.py alice-book.txt # whole book ...lots... yourself 10 youth 6 zealand 1 zigzag 1 $
This is the core of the program. Reads the text of the file, splits it into individual words. Converts each work to a clean, lowercase form. Builds and returns a counts dict, counting how many times each word occurs. The code is below, and explanations of sub-parts follow afterwards.
def read_counts(filename): """ Given filename, reads its text, splits it into words. Returns a "counts" dict where each word is the key and its value is the int count number of times it appears in the text. Converts each word to a "clean", lowercase version of that word. >>> read_counts('test1.txt') {'a': 2, 'b': 2} >>> read_counts('test2.txt') {'b': 1, 'a': 2} >>> read_counts('test3.txt') {'bob': 1} """ with open(filename) as f: text = f.read() # read file as string words = text.split() # splits on whitespace counts = {} for word in words: word = word.lower() cleaned = clean(word) # style: call fn once, store in var if cleaned != '': # subtle - cleaning may leave only '' if cleaned not in counts: counts[cleaned] = 0 counts[cleaned] += 1 return counts
The most common form of reading is line by line: for line in f:
Alternative - read file into string in one step: text = f.read()
This reads in the whole contents of the file into a single string. This is easier than going line by line, instead getting the whole text in one step. This makes the most sense if we do not need to handle each line of data on its own. Minor point: this requires enough RAM memory to hold all the bytes of the file, while for-line-in-f uses much less RAM.
>>> with open('poem.txt') as f: ... text = f.read() ... >>> text 'Roses are red\nViolets are blue\nThis does not rhyme\n' >>>
For more detail, so the Guide chapter: File Read and Write
s.split()
TrickNormally we split like this: parts = line.split(',')
However, calling s.split()
with no parameters within the parenthesis performs a special "whitespace" split,
looking for chars like space and newline to separate the text into pieces.
>>> s = 'Line1 is here\nThis-be -line- 2\n' >>> s.split() ['Line1', 'is', 'here', 'This-be', '-line-', '2']
It doesn't have any knowledge of language to separate the "words" exactly. It just separates where there is one or more whitespace char, which is good enough.
The clean(s) function is used to clean punctuation from the edges of words, like given '--woot!'
extract just 'woot'
. It is written as a black-box function with Doctests, of course! The counting code uses this to clean up each word pulled from the file.
clean('--woot!') -> 'woot' clean('red.') -> 'red'
Look at source code and Doctests of clean() in wordcount.py
In the print_counts() function, the counts dict is passed in and the code uses the standard v2 sorted-keys print code seen above. This prints out all the words and their counts, one per line, in alphabetical order. This code is what produces the alphabetized output above.
def print_counts(counts): """ Given counts dict, print out each word and count one per line in alphabetical order, like this aardvark 1 apple 13 ... """ for word in sorted(counts.keys()): print(word, counts[word])
Try more realistic files. Try the file alice-book.txt - the full text of Alice in Wonderland full text, 27,000 words. Time the run of the program, see if the dic†/hash-table is as fast as they say with the command line "time" command (Windows command shown below). The second run will be a little faster, as the file is cached by the operating system.
$ time python3 wordcount.py alice-book.txt ... ... youth 6 zealand 1 zigzag 1 real 0m0.103s user 0m0.079s sys 0m0.019s
Here "real 0.103s" means regular clock time, 0.103 of a second, aka 103 milliseconds, aka about a tenth of a second elapsed to run this command.
Note in Windows, you need the "Powershell" terminal, not the more primitive terminal PyCharm may be set for. Here are instructions for enabling PowerShell.
Windows PowerShell equivalent to "time" the run of a command:
$ Measure-Command { py wordcount.py alice-book.txt }
Let's try it with the book A Tale of Two Cities which is 133,000 words.
$ time python3 wordcount.py tale-of-two-cities.txt ... lots of printing ... zealous 2 real 0m0.122s user 0m0.083s sys 0m0.020s $
So that takes 0.12 seconds. There are about 133,000 words in the Tale of Two Cities. How many accesses to the dict are there for each word, conservatively:
if word not in counts: # 1 dict "in" counts[word] = 0 # (not counting this one) counts[word] += 1 # 1 dict get, 1 dict set
Each word hits the dict at least 3 times: 1x "in", then don't count the possible = 0, then 1 get and 1 set for the +=. So how long does each dict access take?
>>> 0.12 / (133000 * 3) 3.007518796992481e-07
Ten to the -7 is a tenth of a millionth, so with our back-of-envelope math here, the dict is taking 3/10 of a millionth of a second per dict access. In reality it's faster than that, as we are not separating out the time for the file reading, splitting, and word-cleaning which went in to the 0.12 seconds. Nonetheless the basic claim about dicts is here - the dict is very fast accessing per key, even if the number of keys is large. In CS106B, you look at the internals of the dictionary more closely.
See also Python guide Lists
We'll call the basic list features we've used so far the 1.0 features - you can get quite far with just those.
>>> nums = [] >>> nums.append(1) >>> nums.append(0) >>> nums.append(6) >>> >>> nums [1, 0, 6] >>> >>> 6 in nums True >>> 5 in nums False >>> 5 not in nums True >>> >>> nums.index(6) 2 >>> nums[0] 1 >>> >>> for n in nums: ... print(n) ... 1 0 6
>>> lst = ['a', 'b', 'c'] >>> lst2 = lst[1:] # slice without first elem >>> lst2 ['b', 'c'] >>> lst ['a', 'b', 'c'] >>> lst3 = lst[:] # copy whole list >>> lst3 ['a', 'b', 'c'] >>> # can prove lst3 is a copy, modify lst >>> lst[0] = 'xxx' >>> lst ['xxx', 'b', 'c'] >>> lst3 ['a', 'b', 'c']
Now we'll look some functions that are related to lists and we will use all of these.
>>> nums = [45, 100, 2, 12] >>> sorted(nums) # numeric [2, 12, 45, 100] >>> >>> nums # original unchanged [45, 100, 2, 12] >>> >>> sorted(nums, reverse=True) [100, 45, 12, 2] >>> >>> strs = ['banana', 'apple', 'donut', 'arple'] >>> sorted(strs) # alphabetic ['apple', 'arple', 'banana', 'donut'] >>>
>>> min([1, 3, 2]) 1 >>> max([1, 3, 2]) 3 >>> min([1]) # len-1 works 1 >>> min([]) # len-0 is an error ValueError: min() arg is an empty sequence >>> >>> min(['banana', 'apple', 'zebra']) # strs work too 'apple' >>> max(['banana', 'apple', 'zebra']) 'zebra' >>> >>> min(1, 3, 2) # w/o list form 1 >>> max(1, 3, 2) 3
Compute the sum of a collection of ints or floats, like +.
>>> nums = [1, 2, 1, 5] >>> sum(nums) 9
Strategy: prefer using Python built-ins to writing the code yourself
Look at the "listpat" exercises on the experimental server
> listpat exercises. This section starts with basic "accumulate" pattern problems. The later problems require more sophisticated state-machine solutions.
Many functions we've done before actually fit the state-machine pattern. Start the state variable as empty, += in the loop. Known as the "accumulate" pattern — start a variable empty, built up the answer there.
# 1. init state before loop result = '' loop: .... # 2. update in the loop if xxx: result += yyy # 3. Use state to compute result return result
Use the state-machine strategy outlined below to solve something a little more interesting.
> min()
The style "len rule": we have Python built-in functions like len() min() max() list(). Avoid creating a variable with the same name as an important function, like "min" or "list". This is why our solution uses "best" as the variable to keep track of the smallest value seen so far instead of "min".
def min(nums): # best tracks smallest value seen so far. # Compare each element to it. best = nums[0] for num in nums: if num < best: best = num return best
If we think about it carefully, we could loop over nums[1:]
to avoid one comparison, but that extra complication is not worthwhile.
Say sections of '@..!'
sections in a string should be changed to uppercase, like this:
'This code @has no bugs! probably' -> 'This code HAS NO BUGS probably' 'I @am hungry! right @now!' -> 'I AM HUNGRY right NOW'
1. Have a boolean variable up_mode
- True if changing chars to uppercase, False otherwise. Init to False.
2. When seeing a '@' or '!', change up_mode
to True or False as appropriate
3. When processing a regular ch, look at up_mode to see what to do
4. Use an if/elif structure to look for '@' '!' or regular char
up_mode: FFFFFFFFFFTTTTTTTTTTTTTFFFFFFFFF 'This code @has no bugs! probably'
def upper_code(s): result = '' up_mode = False # State variable for ch in s: # Detect: @, !, regular char if ch == '@': up_mode = True elif ch == '!': up_mode = False else: if up_mode: result += ch.upper() else: result += ch return result
Say we have a code where most of the chars in s are garbage. Except each time there is a digit in s, the next char goes in the output. Maybe you could use this to keep your parents out of your text messages in high school.
'xxyy9H%vvv%2i%t6!' -> 'Hi!'
I can imagine writing a while loop to find the first digit, then taking the next char, .. then the while loop again ... ugh.
take_next
VarHave a boolean variable take_next which is True if the next char should be taken (i.e. the char of the next iteration of the loop) and False otherwise.
Write a nice, plain loop through all the chars. Set take_next
to True when you see a digit. For each char, look at take_next
to see if it should be taken. The exact details of the code in the loop are unusually tricky.
This is such a nice approach vs. trying to solve it with a bunch of while loops.
Type in some code that is an attempt. Run it, see the output, work from there. Compared to most problems, I think this problem is easiest to debug by looking at the wrong output. Put some code in there, run it, and go from there.
You could solve this using index numbers and -1. However, it's worth working out this state-machine approach which does not rely on index numbers at all.
def digit_decode(s): result = '' take_next = False for ch in s: if take_next: result += ch take_next = False if ch.isdigit(): take_next = True # Set take_next at the bottom of the # loop, taking effect on the next char # at the top of the loop. return result
Could set take_next for both True and False cases with an if/else structure vs. setting to False in the upper if.
Previous pattern:
# 1. Init with not-in-list value previous = None for elem in lst: # 2. Use elem and previous in loop # 3. last line in loop: previous = elem
Here is a visualization of the "previous" strategy - the previous variable points to None
, or some other chosen init value for the first iteration of the loop. For later loops, the previous variable lags one behind, pointing to the value from the previous iteration.
count_dups(): Given a list of numbers, count how many "duplicates" there are in the list - a number the same as the value immediately before it in the list. Use a "previous" variable.
The init value just needs to be some harmless value such that the == test will be False. None
often works for this.
def count_dups(nums): count = 0 previous = None # init for num in nums: if num == previous: count += 1 previous = num # set for next loop return count
Social Security Administration's (SSA) baby names data set of babies born in the US going back more than 100 years. This part of the project will load and organize the data. Part-b of the project will build out interactive code that displays the data.
New York Times: Where Have All The Lisas Gone. This is the article that gave Nick the idea to create this assignment way back when.
This is an endlessly interesting data set to look through: john and mary, jennifer, ethel and emily, trinity and bella and dawson, blanche and stella and stanley, michael and miguel.
Optional more state-machine practice for fun
'%abc^Yxyz' -> '%xxx^yyyy' 'AaBcdEf' -> 'aabbbee'
Given a string s. Return a version of s where every alphabetic char is replaced by 'x'. In addition, for each uppercase alphabetic char, e.g. 'A', replace the later alphabetic chars with its lowercase form, e.g. replace with 'a' instead of 'x'.
def alpha_replace(s): result = '' replace = 'x' # Replace alphas with this for ch in s: if ch.isupper(): # Update replace replace = ch.lower() if ch.isalpha(): result += replace else: result += ch return result
A neat example of a state-machine approach. Optional for later.
The "hat" code is a more complex way way to hid some text inside some other text. The string s is mostly made of garbage chars to ignore. However, '^'
marks the beginning of actual message chars, and '.'
marks their end. Grab the chars between the '^'
and the '.'
, ignoring the others:
'xx^Ya.xx^y!.bb' -> 'Yay!'
Solve using a state-variable "copying" which is True when chars should be copied to the output and False when they should be ignored. Strategy idea: (1) write code to set the copying variable in the loop. (2) write code that looks at the copying variable to either add chars to the result or ignore them.
There is a very subtle issue about where the '^'
and '.'
checks go in the loop. Write the code the first way you can think of, setting copying to True and False when seeing the appropriate chars. Run the code, even if it's not going to be perfect. If it's not right (very common!), look at the got output. Why are extra chars in there? How to rearrange the loop to fix it?