Python Language Guide

Introduction to the main Python language features.
Nick Parlante

Contents

Python

Python is an extremely popular computer language for solving many types of problems. Compared to other languages Python is in the "programmer efficient" niche - allowing the programmer to specify what the want pretty easily, perhaps at the cost of running a little slower or using more memory.

Python is distributed for free as open source software, and the central web site for Python information is python.org. Python was created and for many years was run by Guido Van Rossum, who is kind of a rock star for creating such a world-influencing technology.

Python is cross-platform, meaning a python program developed on Mac OS X, can likely also run on Windows or Linux without any change to the code.

Python version 3, used here, made some changes vs. Python 2.x, in particular changing strings to be unicode, and the print() function to be a function. For the most part, Python 2.x code looks very similar to Python 3.x code. All discussion here is Python 3.

Python Variable model

A variable in Python is like a named box that holds a reference to a value. Variables are assigned a value using a single equal sign =. Assigning to a variable creates the variable if necessary (some languages make the programmer pre-declare their variables).

x = 10
y = 'hello'
y = 'bye'

variable x refers to 4, variable y refers to 'bye'

Assigning a value into a variable overwrites any previous reference stored in that variable, as shown with the variable named "y" above. Values in Python store their type, so in the example above the number 10 is tagged with the type "int", and the string 'hello' is tagged with the type "str".

In Python, = assignment to a variable always works in a shallow way, not copying any data, just rearranging references to data. The line above y = 'bye' does not change the string data on the right. It only moves the arrows around on the left.

Memory - Garbage

Using = with a variable, sets that variable to "point to" that value in memory. Using = again sets the variable to point to something else, leaving the original value in memory. If nothing refers to a value in memory (no arrow to it), it is called "garbage". Eventually the bytes of such garbage values are recycled to hold new values, called "garbage collection".

x = 'Python'
y = x + '!'
y = 'Yay'

x is 'Python' y is 'Yay'

The above sequence works fine. The intermediate string 'Python!' is left as garbage. Many algorithms throw off some garbage as they go along, computing their final answer.

In some languages, garbage-collection is a task for the programmer - ugh! In modern languages like Python, Java, and Javascript, it's done automatically by the language.

Numbers - int and float

Surprisingly, there are two distinct types of numbers for doing arithmetic in a computer - "int" for whole integer numbers, and "float" for numbers like 3.14, such as on a calculator.

Int Type

The Python "int" type represents integer values. Addition, subtraction, and multiplication work the usual way with + - *.

>>> 1 + 2 * 3
7
>>> 10 - 2 * 3
4
>>> (10 - 2) * 3
24

Multiplication and division have a higher "precedence" than addition and subtraction, so they evaluated first in an expression. After accounting for precedence, the arithmetic is done left-to-right. Add in parenthesis to change the order of evaluation.

The ** operator does exponentiation, so 3 ** 2 is 32

>>> 3 ** 2
9
>>> 2 ** 10
1024

Unusually, Python int values do not have a maximum. Python allocates more and more bytes to store the int as it gets larger. The number of particles in the universe when I was in school was thought to be about 2100, playing the role of handy very-large-number. In Python, we can write an expression with that number and it just works.

>>> 2 ** 100
1267650600228229401496703205376
>>> 2 ** 100 + 1
1267650600228229401496703205377

Memory use approximation: int values of 256 or less are stored in a special way that uses very few bytes. Other ints take up about 24 bytes apiece.

Int Division //

Division is tricky with ints, since it takes in two ints but does not necessarily yield an int - e.g. 5 / 2 is 2.5. You should only use "/" if you want a float value.

Many times an algorithm makes the most sense if all of the values are kept as ints. Therefore (Python v3), Python has the // operator which does "int" division - meaning it rounds-down to the next int.

>>> 5 / 2    # "/" yields a float, not what we wanted
2.5
>>> 5 // 2   # "//" rounds down to int
2
>>> 4 // 2
2
>>> 10 // 3
3

Int Mod %

The "modulus" or "mod" operator % is essentially the remainder after division. So 11 % 5 is 1 - if we divide 11 by 5, 1 is the leftover remainder.

>>> 11 % 5
1
>>> 12 % 5
2
>>> 15 % 5
0
>>> 1 % 5
1
>>> 10 % 5
0
>>> 10 % 0
Traceback (most recent call last):
ZeroDivisionError: integer division or modulo by zero

The modulus is 0, it means the division can out evenly, like (15 % 5) above. The best practice is to only think about mod with non-negative numbers. Modding by 0 is an error, just like dividing by 0.

Float Type

Floating point numbers are used to do math with real quantities, such as a velocity or angle. The regular math operators + - * / ** all work with floats, producing a float result. If an expression mixes some int values and some float values, the math is converted to float - a one-way street called "promotion" to float.

>>> 1.0 + 2.0 * 3.0
7.0
>>> 
>>> 1.0 + 2 * 3
7.0
>>> 
>>> 2.1 ** 2
4.41

Famously, floating point numbers have a tiny error term that builds up way off 15 digits or so to the right. Mostly this error is not shown when the float value is printed, but it will appear sometimes. This is an intrinsic limitation of floating point values in the computer. (Perhaps also why CS people are drawn to do their algorithms with int.)

>>> 0.1 + 0.1
0.2
>>> 0.1 + 0.1 + 0.1
0.30000000000000004

Therefore, do not use == with float values, which would be confused by the error term. To compare float values, subtract them and compare the absolute value of the difference to some small delta.

Memory use approximation: float values take up about 24 bytes apiece.

Builtin Functions

There are more than 50 built-in functions in python, the full list is here: https://docs.python.org/3/library/functions.html

Here are the most commonly used functions (we'll add to this list later):

Convert to type: int() str() list() bool() float()
e.g. int('600') - > 600
e.g. range(10) does not print usefully, use list():
list(range(10)) -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Collection of elements (like str or list): len(), in, [ ]
-len(lst)
in - check if a value is in a collection
-e.g. 6 in [1, 2, 3, 4, 5, 6, 7] -> True
Works uniformly for many collections we'll see (as len() is uniform)

meta: type('hello') - returns the type of value parameter: int, str, float, list, ...
Typically you do not call this yourself, but the system uses it on your behalf

min(), max(),
-Return the min or max value according to < or >
-Take either a series of parameters separated by commas, or an iterable
-e.g. min(4, 5, 1) -> 1
-e.g. min([1, 2, 3]) -> 1
-e.g. min(['banana', 'apple']) -> 'apple' # < applied to strings
-Takes an optional key= parameter (see sorting/lambda section)

Math: round(n), abs(n) absolute value, pow(a, b) a to the b power
-pow can be written as "a ** b"
-Later we'll see the math module with lots of math functions

Python if - de-facto True/False

The Python if statement looks like this

if test:
    print('test is true!')

Any value can be used in the test expression - string, int, float, list, ...

Python has a de-facto true/false system, where all the "empty" values count as de-facto False: '', None, 0, 0.0, empty-list, empty-dict

Any other value de-facto counts as True: a non-empty string, a non-zero int, a non-empty list. Many languages use this anything-empty-is-false system. The bool() function takes any value and returns a formal bool False/True value, so that's a way to see the empty=False interpretation:

>>> bool(0)
False
>>> bool(0.0)
False
>>> bool('')   # empty string - False
False
>>> bool([])   # empty list - False
False
>>> bool(None)
False
>>> bool(6)       # non-zero int - True
True
>>> bool('hi')    # non-empty string - True
True
>>> bool([1, 2])  # list of something - True
True

If test = vs. ==

if a = 6:       # syntax error, meant ==
    print('hi')

The if-test requires some value. It's easy to accidentally type in = instead of == in an if-test. Fortunately in Python an expression like a = 6 does not represent any value itself, so the if will generate a syntax error, flagging the mistake. In the language C, typing this sort of mistake just silently runs incorrectly, providing many, many, many hours of debugging fun.

Do Not Write x == True:

Suppose some foo() function is supposed to return True or False. Do not write an if-test like this:

    if foo() == True:   # NO NO NO == True
        print('yay')

Instead, let the if/while take the value directly like this:

    if foo():           # YES this way
        print('yay')

    
    # Or this way for False
    if not foo(): 
        print('not yay')

The reason is that the flow of data from foo() could easily use the de-facto convention, like 1 for True and 0 for False. That would not be an ideal design, but Python makes that easy to do even by accident. It's better to not depend on precisely the values True and False, instead code that works for any de-facto True/False value.

Python if-else

The optional 'else:' part of an if provides code to run if the test if not True. Else is perfect if in the code we want to do exactly one of action-1 or action-2.

# set message according to score
if score > high_score:
    message = 'New high score!'
else:
    message = 'Oh well!

There's also 'elif' where a series of if-tests can be strung together (mnemonic: 'elif' is length 4, like 'else'):

if s == 'a':
    # a case
elif s == 'b':
    # b case
else:
    # catch-all case

Exactly 1 of the 3 cases will run. However, the logic of when each case runs can get very tangled. Combining if/else with and/or can make a deceptively complicated structure. What must be true for case c below? It is not transparent from the code.

if score > high and s != 'alice':
    # a
elif s == 'bob':
    # b
else:
    # c

Answer: c happens when s is not 'bob' but also (score <= high or s == 'Alice or both)

If/else chains are fine, just don't think they are trivial.

Alternative to if/else chains

Nick style idea: I only use else when it is truly needed. If the code can work using only a plain-if, I prefer to write it that way. This works extra well with decomposed out functions, where 'return' can be used to bail out for the first few cases, leaving the main case below without indentation, like this:

def foo(s):
    if len(s) < 4:
        return 'too small!'

    if len(s) > 100:
        return 'too big!'

    # rest of computation having screened out too-short and too-long
    # notice: no if-else structure, not indented down here
    ...

You can use else if you prefer, just thinking about possible alternative structure here.

Boolean Comparison Operators

The Python "bool" type (short for "boolean") has only two values - True and False. The most common way to get a boolean value is a comparison like > ...

4 > 2 → True
4 > 5 → False
4 > 4 → False
4 >= 4 → True

Comparison operators..

== : (2 equals signs together) equal, works for int, string, list, dict, .. most types. Not recommended to use with float values

!= : not-equal (uses == under the hood)

< <= : less-than, less-or-equal. These work for most types that have an ordering: numbers, strings, dates. For strings, < provides alphabetic ordering. Uppercase letters are ordered before lowercase. It is generally an error to compare different types, like this error: 4 < 'hello'

> >= : greater than greater-or-equal.

A custom defined type can provide its own definition of == by defining a function with the special name __eq__() and analogous functions define < etc.

Constructs like if test: can use any type - int, string, list .. - as as test value, not just boolean. See the if section for examples of this behavior.

Boolean And Or Not - Precedence

Boolean expressions like a > 10, can be combined with and or not, like this:

if a >= 0 and a <= 100:
    print('yay')

Python is unique in using plain words like "and" for this. many languages use "&&" for and, "||" for or.

a and b → True if both are True

a or b → True if one or the other or both are True

not a → Inverts True/False

There is "precedence" between and/or/not, analogous to arithmetic precedence, where "*" goes before "+". Precedence order: not is highest precedence, followed by and, followed by or. Mnemonic: not is like unary minus e.g. '-6', and is like *, or is like +.

Q: what does the following do:

if a < 6 and b < 6 or c < 6:

The "and" goes first (higher precedence), so the above is equivalent to this form with parenthesis added to show in what order the comparisons are evaluated:

if (a < 6 and b < 6) or c < 6:

To force the or to go first, put in parenthesis like this:

if a < 6 and (b < 6  or c < 6):

If you are unsure, you can always add parenthesis to an expression to clarify what order you want and/or/not to apply.

Boolean Short-Circuit

Suppose you have an int i, and you want to check if the char at that index is alphabetic .. but only if i is valid. You can write that this way...

if i < len(s) and s[i].isalpha():...

This works because the boolean evaluation goes left to right, stopping ("short-circuiting") as soon as it can. So in the example above, if i < len(s) is False (i.e. i is large), the whole expression evaluates to False. In particular the s[i].isalpha() is not evaluated. Attempting to evaluate it would be an error anyway, since i is too large, creating an IndexError. Writing the i<len(s) test to the left of the s[i] in effect protects it, and this is a common programming pattern. Most computer languages use short circuiting in the their boolean expressions like this.

Strings

A python string represents a sequence of unicode characters (this was an area of language-cleanup in the move to Python 3.0). Python strings are written between single or double quotes. Commonly strings are written on one line.

a = 'hello'
b = ''   # the empty string, length 0
c = "Python Spaces Ok"

A backslash \ in a string "escapes" a special char we wish to include in the string, such as a quote or \n newline. Common backslash escapes:

\'   # single quote
\"   # double quote
\\   # a backslash
\n   # newline char

A string using \n:

a = 'First line\nSecond line\nThird line\n'

Python strings can be written within triple ''' or """, in which case they can span multiple lines. This is useful for writing longer blocks of text.

a = """First line
Second line
Third line
"""

Each character is drawn from the unicode character set, which includes the "characters" or pretty much every language on earth. Also, common emoji have been added. Every unicode character is defined by a unicode "code point" which is basically a big int value that identifies that character. Unicode characters can be written using the "hex" version of their codepoint, e.g. "03A3" is the "Sigma" char Σ, and "2665" is the heart emoji char ♥.

Hexadecimal aside: Hex numbers uses the digits 0-9 plus the letters A-F to write a number like this: 7F9A or 7f9a. Two hex digits together like 9A or FF represent the value stored in one byte. Traditionally, unicode uses hex digits like this to write the 2 or 4 byte codepoint of a unicode char.

You can write a unicode char out in python with \u followed by 4 hex digits of its codepoint:

>>> s = 'hi \u03A3'
>>> s
'hi Σ'
>>> len(s)
4
>>> 'I \u2665'
'I ♥'

For more than 4-hex-digits in Python use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the toilet emoji 1F6BD, and the inevitable 1F489.

>>> 'in the \U0001F6BD'
'in the 🚽'
>>> s = 'oh \U0001F489'
>>> len(s)
4

Not all computers have the ability to display all unicode chars, so the display of a string may fall back to something like \x0001F489 -- telling you the hex digits for the char, even though it can't draw it.

String Char Access [ ], Concatenate +

The len() function returns the length of a string. The len() function in Python is omnipresent - it's used to retrieve the length of every data type, with string just a first example. Chars are accessed with zero-based indexing with square brackets, so the valid index numbers are 0..len-1. Accessing a too large index number is an error. Strings are immutable, so they cannot be changed once created. Code to compute a different string always creates a new string in memory to represent the result (e.g. + below), leaving the original strings unchanged.

>>> s = 'Hello'
>>> len(s)
5
>>> s[0]
'H'
>>> s[4]
'o'
>>> s[5]
IndexError: string index out of range
>>> s[0] = 'x'   # no, string is immutable
TypeError: 'str' object does not support item assignment

The + combines (aka "concatenates") strings to make a bigger string. This creates new strings to represent the result, leaving the original strings unchanged.

>>> s1 = 'Hello'
>>> s2 = 'There'
>>> s3 = s1 + ' ' + s2
>>> s3
'Hello There'
>>> s1
'Hello'

Concatenate + only works with 2 or more strings, not for example to concatenate a string and an int. Call the function str(4) to make a string out of an int, then concatenation works.

>>> 'score:' + 6
TypeError: can only concatenate str (not "int") to str
>>> 'score:' + str(6)
'score:6'

String Search Functions

Here are the most useful Python string functions (mostly noun.verb form):

x in s - true if string x appears anywhere in s. Use find() below if you want to know the index where it appears, but in is great for simple testing.

>>> 'c' in 'abcd'
True
>>> 'aa' in 'XXaaZZ'  # test string can be any length
True

s.find(x) - searches s left to right, returns int index where string x appears, or -1 if not found. In this and other searches, characters much match exactly, so 'a' matches 'a', but does not match 'A'.

>>> s = 'Python'
>>> s.find('y')
1
>>> s.find('tho')
2
>>> s.find('xx')
-1

s.find(x, start_index) - variant of find(), beginning the search at the given index

String Test Functions

These functions return a boolean True/False test result about a string

s.startswith(x) - True if s start with string x

s.endswith(x) - True if s ends with string x

>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True

Character class tests - it's handy to divide characters into broad classes: "alphabetic" chars like 'abc' that make words, "digits" '0', '1' to make numbers, "space" chars like space, tab, and newline. Then there are all the other miscellaneous characters like '$' and '^' which are not alphabetic, digit, or space.

s.isdigit() - True if all chars in s are digits '0..9'

s.isalpha() - True for alphabetic word char, i.e. a-z A-Z (applies to "word" characters in other unicode alphabets too)

s.isalnum() - alphanumeric, just combines isalpha() and isdigit()

s.isspace() - True for whitespace char, e.g. space, tab, newline

s.isupper(), s.islower() - True for uppercase / lowercase alphabetic. False for characters like '9' and '$' which do not have upper/lower versions.

>>> '6'.isdigit()
True
>>> 'a'.isalpha()
True
>>> '$'.isalpha()
False
>>> s = '\u03A3'
>>> s
'Σ'
>>> s.isalpha()
True
>>> 'a'.islower()
True
>>> 'a'.isupper()
False
>>> '$'.islower()
False
>>> '\n'.isspace()
True

String Change Functions

Upper/lower case conversion

s.lower() - returns a version of s where each char is converted to its lowercase form, chars like '$' are unchanged. The original s is unchanged - a good example of strings being immutable. Each unicode alphabet includes its own rules about upper/lower case.

s.upper() - returns an uppercase version of s

s.strip() - return a version of s with the whitespace characters from the very start and very end of the string all removed. Handy to clean up strings parsed out of a file.

>>> '   hi there  \n'.strip() 
'hi there'

s.replace(old, new) - returns a version of s where all occurrences of old have been replaced by new. Does not pay attention to word boundaries, just replaces every instance of old in s. Replacing with the empty string effectively deletes the matching strings.

>>> 'this is it'.replace('is', 'xxx')
'thxxx xxx it'
>>> 'this is it'.replace('is', '')
'th  it'

String Loops

Standard i/range() loop goes through indices of all chars in s:

for i in range(len(s)):
    # use s[i] in here

The "foreach" loop works on strings too, accessing each char. Unlike the above form, here you do not have access to the index of each char as it accessed.

for char in s:
    # use char in here

list('abc') of a string yields a list ['a', 'b', 'c'] of its chars.

More details at official Python String Docs

String Slices

string 'Python' shown with index numbers 0..5

Slice syntax is a powerful way to refer to sub-parts of a string instead of just 1 char. s[ start : end ] - returns a substring from s beginning at start index, running up to but not including end index. If the start index is omitted, starts from the beginning of the string. If the end index is omitted, runs through the end of the string. If the end index is too large (out of bounds), the slice just runs through the end of the string. This is the one case where Python is permissive about wrong/out-of-bounds indexes.

>>> s = 'Python'
>>> s[2:4]
'th'
>>> s[2:]
'thon'
>>> s[:5]
'Pytho'
>>> s[2:999]
'thon'
>>> s[4:4]  # a slice with zero chars in it
''

Negative numbers also work within [ ] and slices: -1 is the rightmost char, -2 is the char to its left, and so on. This is convenient when you want to extract chars relative to their position from the end of the string.

>>> s[-1]
'n'
>>> s[-2:]
'on'

String split()

str.split(',') is a string function which divides a string up into a list of string pieces based on a "separator" parameter that separates the pieces:

>>> 'a,b,c'.split(',')
['a', 'b', 'c']
>>> 'a:b:c'.split(':')
['a', 'b', 'c']

A returned piece will be the empty string if we have two separators next to each other, e.g. the '::', or the separator is at the very start or end of the string:

>>> ':a:b::c:'.split(':')
['', 'a', 'b', '', 'c', '']

Special whitespace: split with no arguments at all splits on whitespace (space, tab, newline), and it groups multiple whitespace together. So it's a simple way to break a line of text into 'words' based on whitespace (note how the punctuation is lumped onto each 'word'):

>>> 'Hello there,     he said.'.split()
['Hello', 'there,', 'he', 'said.']

File strategy: a common pattern is to use 'for line in f' to loop over the lines in a file and 'line.split()' to break each line up into pieces. Some text file formats have a format that split() works on easily.

String Join

','.join(lst) is a string function which is approximately the opposite of split: take a list of strings parameter and forms it into a big string, using the string as a separator:

>>> ','.join(['a', 'b', 'c'])
'a,b,c'

The elements in the list should be strings, and join just puts them all together to make one big string. Note that split() and join() are both noun.verb on string. The list is just passed in as a parameter.

Lists

A list can contain a linear series of any data type: strings, ints, other lists. The things inside a list are generically called "elements". Unlike strings, lists are "mutable" - they can be changed.

The first element of the list is at index [0] and last element is at length-1, i.e. len(lst)-1.

lst = [] - create empty list

len(lst) - access length of string

lst[0] - access individual elements with square brackets

for x in lst: - loop over contents, do not modify lst during loop

x in lst - boolean test if x is in lst (just like for string)

lst.append(x) -- add x to the end of lst, increasing its length by 1. The easiest way to add to a list. Does not return anything. Changes lst in place.

lst.extend(lst2) - add all the elements of lst2 on to the end of lst. Alternately + can be used: (lst + lst2) puts all their elements together to make a new, big list, e.g. [1, 2] + [3, 4] yields [1, 2, 3, 4]. Does not return anything.

Append vs. extend example:

lst = [1, 2, 3]
x =[4, 5]
# what happens .append vs. extend

# append:
lst.append(x)  -> x is added as an *element* so lst is [1, 2, 3, [4, 5]]

# extend:
lst.extend(x)  -> all elements of x are added at end, so lst is [1, 2, 3, 4, 5]

lst.pop() -- remove the element from the end of the list and return it, decreasing the length of the list by 1, i.e. the opposite of append()

lst.pop(index) -- alternate version with the index to remove is given, e.g. lst.pop(0) removes the first element from the list.

lst.insert(index, x) -- insert the element x so it is at the given index, shifting elements towards the end of the list as needed. Use index=len(lst) to insert at the end. Append() is simpler since it just goes on the end without any shifting and you don't have to think about index numbers. So it seems like I never use insert().

lst.copy() -- returns a copy of lst. You could use this to loop over a list and also modify it - loop over a copy, modify the original. (mentioned for completeness, I don't think this will ever come up.)

lst.reverse() -- reverse lst "in place", that means no new list is created. lst is modified where it is. (Related: reversed(lst) takes a list or similar linear collection, returns a reversed version, not modifying the original).

lst.sort() -- sort lst into increasing order in place, modifying it. By default numbers sort increasing by value, text sorts increasing alphabetically. We'll study custom sorting in more detail later. (Related: sorted(lst) takes a list or similar linear collection, returns a sorted version, not modifying the original).

min(lst) max(lst) - Return the smallest or largest element in lst. Uses the same underlying < foundation as sort(), but much faster than sorting the whole list. Raises an error if the list is empty. Note: these are functions like len(), not noun.verb.

lst.index(x) -- Look for first instance of x in lst and return its index. Raises an error if x is not in there (!). Therefore check with 'in' first, and only if x is in there call index(). Translation: there is no .find() for lists which IMHO seems like a real omission.

More details at official Python List Docs

List Slices

Slices work to pull out parts of list just as with strings.

lst = ['a', 'b', 'c']
lst[:2] -> ['a', 'b']
lst[2:] -> ['c']

The original list is not modified, this creates a new list populated with elements from the original. Omitting both start and end indexes yields a copy -- lst[:]

Dict Type

The dict "dictionary" type is very important because it takes in disorganized data, organizes it, and it is fast. The dict here is based on the "hash table" data structure. You will learn how to build and analyze a hash table in CS106B, here we're happy to just use them. Many important algorithms leverage hash tables in some way since they can handle large data sets and remain fast.

'dict' is the name of the Python dict type, so we'll use just 'd' as a generic variable name.

d = {}                  # Create empty dict
d['ca'] = 'California'  # 1. Set key/value pairs into dict
d['ok'] = 'Oklahoma'
d['nj'] = 'New Jersey'
d['tx'] = 'Texas'
val = d['nj']           # 2. Retrieve value by key
val = d['xx']           # fails with KeyError
check = 'nj' in d       # 3. in check -> True


dict with keys ca ok nj tx

1. d['ca'] = 'California' - with equal-sign: d[key] = xxx creates a key/value pair in the dict. If a pair was in there already, it is overwritten.

2. val = d['nj'] - referring to d[key] retrieves the value for that key, or is a KeyError. Code needs to check that the key is in before doing [ ] retrieval.

3. if 'nj' in d: - use in to check if a key is in the dict or not. This sort of 'in' works for lists too, but for dicts it's much faster.

Key point: dicts (hash tables) are fast. Even if the dict has 1 million key/value pairs in it, performing the get/set/in of single key is very fast. If the dict grows to hold 10 million key/value pairs, the speed is largely unchanged.

Strategy: therefore if we are reading a file or a network taking in disorganized data, load the data into a dict choosing the key and value definitions we want to output. The data will become organized by that key, even though it was in random order as it was read.

The type used as key should be "immutable", often a string or int (or tuple when we get to those).

Dict Counting

Here is the canonical logic to load up a dict - in this example the code counts the number of occurrences of each word, but many useful dict operations will follow this basic loop/in/out pattern.

def strs_counts(strs):
    """
    Given a list of strings, return a 'counts' dict of
    str->count key/value pairs
    """
    counts = {}
    for s in strs:
        # fix up not-in case
        if not s in counts:
            counts[s] = 0
        # invariant: at this line, s is in
        counts[s] += 1
        # alternate 'else' solution:
        #if not s in counts:
        #    counts[s] = 1
        #else:
        #    counts[s] += 1
    return counts

Getting Data out of Dict

1. len(d) - as you would guess, the number of key/value pairs in the dict

2. d.get(key) - retrieves the value for a key, but if the key is not there, returns None by default (vs. throwing an error like [ ]). A 2 parameter form d.get(key, missing-value) specifies what value to return if the key is missing. This forms an alternative to writing if/in logic to check if the key is in.

3. d.keys() - returns an iterable of all the keys in dict (in a random order). Can loop over this to get each key. The keys alone, no values.

4. d.values() - returns an iterable of all the values in dict (in a random order). Can loop over this to get each value.

5. d.items() returns an iterable of the key,value pairs. This works with a particular sort of double-variable loop, see below. If you want to look at all of the key,value pairs, this is the most direct way.

The most common pattern for looping over a dict, using sorted() function which returns a linear collection sorted into increasing order:

def print_counts2(counts):
    # sort the keys for nicer output
    for key in sorted(counts.keys()):
        print(key, counts[key])

The double-variable key,value loop (more detailed explanation in the tuples section below)

def print_counts3(counts):
    # key,value loop over .items()
    # unsorted
    for key,value in counts.items():
        print(key, value)

Tuples

The 'tuple' is a Python type which is like a small list. Tuples are like a list but written within parenthesis, here is a tuple of three strings: ('a', 'b', 'c')

Accessing a tuple works like a list -- len, [ ], in, slices, loops -- but the big difference is that a tuple is immutable. It is built once and then never changes. There is no append() function.

>>> t = ('a', 'b', 'c')
>>> len(t)
3
>>> t[0]
'a'
>>> 'c' in t
True
>>> t[0:2]
('a', 'b')
>>> t[0:2]
('a', 'b')
>>> t[0] = 'hi'   # no changing it!
Error:'tuple' object does not support item assignment

Tuples are useful where you have a fixed number of things to keep as a group -- e.g. an x,y coordinate pair stored in a len-2 tuple like (4, 10) keeps the x,y together as a unit. In contrast, a list is useful when you have an unlimited number of items you want to store together, such as the typical pattern of reading out of a file and using lst.append() to put all the items together in one list.

Recall that dict-keys should be immutable. Therefore if you have some composite data that you want to use as a dict key, e.g. a string and an int, form them into a tuple ('meh', 16) and then use that as the key.

Tuples and Assignment

One quirky but important use of tuples is assignment multiple variable at once, like this:

>>> (x, y) = (12, 4)
>>> x
12
>>> y
4

The length of the left-hand-side and right-hand-side must be the same (CS terminology protip "lhs" and "rhs".)

Tuples and Multiple Value Return

Communicating in to a function is a rich area: you can have any number of parameters, they each get a name. How do you communicate out of a function .. return 1 value. Returning 1 value covers 90% of the cases. But sometimes it really makes sense to return 2 or more values. The Pythonic way to do this is to return a tuple packing together the values. Like this

def min2(a, b):
    """
    Given 2 ints, returns (True, min_val) if at least
    one is negative and min_val is the minimum value.
    Otherwise returns (False, None)
    """
    if a<0 or b<0:
        return (True, min(a, b))
    return (False, None)

There are 2 requirements:

1. The function doc string must state the size and content-types of the returned tuple. There's no way the caller code can be written correctly without this information.

2. All paths should return a tuple of that same length, even if it is (None, None), so that standard looking calling code of the form (x, y) = fn(..) works without crashing should it hit the None case.

Calling min2 looks like this, catching the len-2 tuple result into 2 variables.

    (negative, min_val) = min2(a, b)
    if negative:
       ...

Tuples and Dicts

One handy use of tuples is the dict.items() function, which returns the entire contents of the dict as an list of len-2 (key, value) tuples.

>>> d = {'a':1, 'd':4, 'c':2, 'b':3}
>>> d
{'a': 1, 'd': 4, 'c': 2, 'b': 3}
>>> d.items()
dict_items([('a', 1), ('d', 4), ('c', 2), ('b', 3)])  # (key, value) tuples
>>> sorted(d.items())   # same tuples, sorted by key
[('a', 1), ('b', 3), ('c', 2), ('d', 4)]

Sorting of tuples goes left-to-right within each tuple -- first sorting by all the [0] values across the tuples, then by [1], and so on.

Once all the data has been loaded into a dict, it's natural to have a process-all-the-data phase. This can be written as a loop over the above d.items() list. The loop syntax below takes one tuple off the list for each iteration, setting the two variable, key and value each time:

# Example: for loop setting key/value for each iteration
for key,value in d.items():
    # use key and value in here

This is a special version of the for loop, where there are multiple variables, and the number of variables matches the size of a tuples coming off the list. The above example, looping key,value over dict.items() is probably the most common use of this multiple-variable variant of the for loop.

Sorting

Sorting is a classic computer algorithm and all computer languages provide it for you. There are piles of CS research and language features on just this one topic which I would venture say is now a solved problem. (In CS106B you'll design and analyze your own sort algorithm.) Here we'll look at the sorted() function which works on any linear data structure.

Sorting 1.0

The default built-in sorted() orders the elements into increasing order. In a very Pythonic twist, the elements can be any type (str, int, float, ..) so long as that type supports < and == correctly.

The simplest use of sorted is just sorting a linear series of strs or ints...

>>> a = [23, 42, 4, 8, 15, 16]
>>> sorted(a)
[4, 8, 15, 16, 23, 42]
>>> a
[23, 42, 4, 8, 15, 16]   # original lst not changed
>>>
>>> b = ['banana', 'apple']
>>> sorted(b)
['apple', 'banana']

The elements should be < comparable to each other - int vs. int, str vs. str is fine. Comparing < str to int does not have an intuitive definition and is an error (and that's got to one of the more readable error messages!):

>>> sorted([2, 3, 'hi'])
Error:'<' not supported between instances of 'str' and 'int'

Sorting 2.0 - Tuples

Say you have a list of tuples, giving a web domain name and a path on that domain. A common goal for this data is to sort first by domain, but when domains are equal, fall-back to sorting by path. This fall-back strategy is exactly what sorted() does, ordering first by [0], then falling back to [1] and so on.

>>> domains = [('stanford.edu', '/meh'), ('google.com','/search'), ('stanford.edu', '/admit')]
>>> sorted(domains)
[('google.com', '/search'), ('stanford.edu', '/admit'), ('stanford.edu', '/meh')]

Stable - a common feature of good sort algorithm is that it only re-orders elements from the original if they are not ==. So if we had 2 copies of the ('stanford.edu', '/admit') tuple in that list, they would stay in the same order after sorting. The built in sorting algorithm is stable in this way.

Sorting 3.0 - Custom Sorting

Say you have a list of strings, and you want to sort in a way other than alphabetic. Python has a very nice way to do this, the 'key' function.

The 'key' function is a function of 1 argument, that produces a value to use in the sort-comparison instead of the actual list value. The key value is a sort of proxy for the sorting, but the actual values are what is sorted. The other option for sorting is a reverse=True option to switch to output in descending order (reverse=False is the default).

Sort Key Example 1

Sort strings by their length. Introduce by_len(s) function that returns length. When calling sorted, pass in key=by_len. Note: no parens! We are indicating what function to use, but we are not calling it.

>>> a = ['aaaa', 'bb', 'z']
>>> sorted(a)
['aaaa', 'bb', 'z']
>>> def by_len(s): return len(s)  # writing def as 1-line in the console
>>> sorted(a, key=by_len)
['z', 'bb', 'aaaa']
>>> sorted(a, key=by_len, reverse=True)
['aaaa', 'bb', 'z']

Actually can just use the built-in function 'len' but typically you need to write your own by_len or whatever function to capture the sorting you want.

Sort Key Example 2

Say we have (str, int) 'food' tuples where the int is a count, such as from a dict-count algorithm.

>>> foods = [('apple', 5), ('banana', 2), ('chocolate', 137), ('bread', 10)]
>>> sorted(foods)
[('apple', 5), ('banana', 2), ('bread', 10), ('chocolate', 137)]

Say we want to sort these same tuples, but increasing by the count. Introduce

>>> def by_count(food): return food[1] 
>>> sorted(foods)
[('apple', 5), ('banana', 2), ('bread', 10), ('chocolate', 137)]
>>> sorted(foods, key=by_count)

What if we want the highest count first? That's for the reverse=True parameter.

>>> sorted(foods, key=by_rating, reverse=True)
[('chocolate', 137), ('bread', 10), ('apple', 5), ('banana', 2)]

Aside: How much time does sorting take?

An algorithm that makes 1 pass over a list, doing a fixed-cost computation for each element could be called "linear" time cost. If you double the length of the list, the time taken roughly doubles. Many algorithms we've worked out have this linear 1-pass-over-the-data quality.

Sorting is more expensive than linear. It has to go back and forth repeatedly, some elements have to be examined many times. For CS106A, just remember that sorting is medium expensive.

File Reading

1. Here is the canonical code to open a file, read all the lines out of it, handling one line at a time. The advantage of processing 1 line at a time is that we do not require memory to hold every byte of the file at once - files can be quite big! This way we can process a file larger than the install memory of the computer, which is why every language supports something like this. This example uses Python's newish 'with' form which automates closing the file properly. The 'r' here indicates reading. We'll talk about 'w' writing forms later.

with open(filename, 'r') as f:
    for line in f:
        # process line (line str has '\n' at its end)
        print(line, end='')

2. Sometimes it is more handy to read the whole text file in as one big string (requiring enough bytes of memory to hold the whole file contents). The f.read() function does this, and here are 3 common things to do with the text from read():

with open(filename, 'r') as f:
    text = f.read()  # the whole file contents is in str variable 'text'
                     # with '\n' chars embedded in it
    
    # a. Could use split() to make a words-list out of the whole file.
    # The '\n' are treated like spaces by split()
    words = text.split()

    # b. Say we want print the text out unchanged.
    # The text probably ends with '\n' already so just printing
    # it you get an extra '\n' at the end. Therefore use end=''
    print(text, end='')

    # c. Say you want to process the lines anyway, str.splitlines()
    # finds all the '\n' type chars, and divides the text into a list
    # of lines. The '\n' are omitted from each returned line by default.
    lines = text.splitlines()

3. Can read the entire file into memory as a list of string lines:

with open(filename, 'r') as f:
    lines = f.readlines()
    # lines is a list of the lines you would get with 'for line in f'
    # now can access them all at once
    # but requires much more memory

Loops

1. for in - 'foreach'

Loop through the elements in various liner list-type things, called a 'foreach' in many languages.

# words is a list of strings
for word in words:
  # use word

Do not modify the lst during a foreach. It can work in some cases, but it's an unreliable practice. If you really need to modify lst, make a copy of lst, loop over the copy, and safely modify lst.

Foreach is often paired with the range() function:

# string s
for i in range(len(s)):  # all indexes of s
    if s[i].isdigit():
        ...

2. while

while test:

Check the test, run all the 'body' lines inside the loop, check the test again, and so on. Can run the body zero times if the test is False the first time.

i = 0
while i < len(s) and s[i].isdigit():
    i += 1

break and continue

'break' exits the loop immediately. Use to make up a loop-exit test midway through the loop body.

while True:
    s = obtain_string()
    if len(s) == 0:
        break
    ...

'continue' is used less frequently. It jumps immediately back to the top/test of the loop, discarding the current run of the body. Many languages have these same 'break' and 'continue' operations.

Canonical Loops Phrases

Looping over collections is very common. Here are the most common forms.

Loop over int range

for i in range(10):
  print(i)
# 0 1 2 3 4 5 6 7 8 9

Loop over elements in a list

If you do not need to know the index numbers as you go, you can just use for/in to see each element once. Do not modify the list during iteration. (In fact range(10) returns something like a list so this form and the one above are the same.)

words = ['hi', 'there!', ...]
for word in words:
    # use word in here
    print(word)

Loop over each index into a collection

For a linear storage (e.g. a list, a string), index numbers are 0, 1, ... len-1. With this form you access each element with [i], but since you know the value of i too, you can use it in slices or whatever.

s = 'Hello'
for i in range(len(s)):
  print(s[i])
# H e l l o

Load a list with data

Reading data out of a file or something, loading it up into a list.

lst = []
for i in range(10):
  lst.append(i)
# lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Programming Style - Readable

Experience shows that the best code is "readable" - someone sweeping their eye over it can see what it does and how its parts fit together.

Variable Names

A good variable names reflects what data it holds so lines with that data make sense. Typing a very long variable name is a chore, so we want a name just long enough to show what's going on; 's2' is not good, but 'start' or 'mid' are enough.

Below is an example from the midterm review - lots of string variables. If they had names like s1 s2 s3 it's a lot harder to get right. Creating meaningful+short variable names lets the programmer narrate their own ideas as they write each line. The last line is a nice example of names supporting the code. We would say that the code "reads" nicely, showing what it does.

foo() Example Function Problem - Given a string s of even length, if the string length is 2 or less, return it unchanged. Otherwise take off the first and last chars. Consider the remaining middle piece. Split the middle into front and back halves. Swap the order of these two halves, and return the whole thing with the first and last chars restored. So 'x1234y' yields 'x3412y'.

def foo(s):
    if len(s) <= 2:
        return s
    first = s[0]
    last = s[len(s)-1]
    mid = s[1:len(s)-1]
    halfway = len(mid) // 2  # index of halfway
    return first + mid[halfway:] + mid[:halfway] + last

Allowable 1-Letter Names

In a real program, most variables are not 1-letter, since the variables are carrying real data around, and the variable name reflects that type of data. But there are a few idiomatic uses where the 1-letter variable is quite readable, as the 1-letter form is a kind of standard for certain situations:

Collection Data Structures - 's'

Variable names for lists, dicts, .. that hold many things work well with 's' at the end. One sign this is working well is writing the foreach loop, and the logic of the names confirms you have it right:

# e.g. 'lines' is a list of text lines from a file
for line in lines:
    ...

# e.g. 'nums' is a list of int values
for num in nums:
    ...

# e.g. 'weights' is a list of weight values
for weight in weights:
    ...

# e.g. 'word_counts' is a dict mapping word -> count
for word in word_counts.keys():
    count = word_counts[word]
    ...

Names vs. Bugs - Time Saver!

Tracking down bugs is a very time consuming. Often a bug creeps in when you lose track of - a word vs. a list of words vs. a dict of lists of words. Being careful with your variable names as you go code up your steps can be a big time saver by cutting down on these bugs.

Function Decomposition

Decomposition refers to separating out smaller functions to compute sub-parts of the big problem. Divide and conquer!

PEP8

The 'PEP8' standard is a series of code formatting and other standards for Python code which help keep things clean. Here's the top issues IMHO...

Lots more detail: Official PEP8

Python Modules and Import

A Python "module" is a container of code ready-to-use. One reason coding in Python is so productive is the great variety of off-the-shelf code ready to use from your code, and modules are how this code is organized.

There are more than 100 "standard" modules included with Python 3 -- ranging from math functions, to parsing Excel CSV files, to reading and writing .zip archive files.

- List Of Modules

This is not the sort of list you memorize. Rather, when you come to part of your problem that may have an off-the-shelf solution, look through the modules to see what's available.

There are also modules which exist but which are not part of the standard Python distribution. Such modules need to be downloaded and installed separately (google "pip3"). Standard modules are the most common and easiest to work with, since they are already installed.

Import

To use a module, it must be imported by its name. This is typically done at the top of the Python file. Here's an example that imports the 'sys' module, an extremely common thing to do. The sys module contains some variables and functions for interacting with the operating system of the computer which starts and stops programs.

#!/usr/bin/env python3

# typically import a few modules at the top of the file,
# here the standard 'sys' module
import sys


def main():
    args = sys.argv[1:]

    if len(args) == 0:
        print('Expected a command line argument')
        sys.exit(1)
        print("We don't get to this line!")
...

The sys module contains the historically named 'argv' list which is the list of command line arguments to when this program was started. Elements inside a module are referred to using a dot, so here we see 'sys.argv'.

The sys module also includes a function called exit() which exits the program immediately. An old convention is that a program exit has an int status code, 0 = ran correctly, non-zero means there was some problem, so in this example we pass 1 to exit. Note that the exit() function is referred to with module-dot notation.