Today: better/shorter code invariant, more string functions, unicode
See course page for timing, logistics, lots of practice problems. Finish Crypto program, first, take a day off, then worry about the exam. You might plan on spending Sun evening practicing.
Topics on the exam: simple Bit (hw1), images/pixels/nested-loops (hw2), 2-d grids (hw3), strings, loops, simple lists (hw4)
Topics not on exam: bit decomposition problems, bluescreen algorithm, writing main(), file reading
The bad news / good news of it
Today, I'll show you some techniques where we have code that is already correct, but we can write it in a better, shorter way. It's intuitively satisfying to have a 10 line thing, and shrink it down to 6 lines that reads better.
> Better problems
speeding(speed, birthday): Compute speeding ticket fine as function of speed and birthday boolean. Rule: speed under 50, fine is 100, otherwise 200. If it's your birthday, the allowed speed is 5 mph more. Challenge: change this code to be shorter, not have so many distinct paths.
The code below works correctly. You can see there is one set of lines each for the birthday/not-birthday cases. What exactly is the difference between these two sets of lines?
def speeding(speed, birthday):
if not birthday:
if speed < 50:
return 100
else:
return 200
else: # is birthday
if speed < 55:
return 100
else:
return 200
def speeding(speed, birthday):
# Set limit var
limit = 50
if birthday:
limit = 55
# Unified - one if stmt works for all cases
if speed < limit:
return 100
return 200
This is a handy pattern seen above to set a variable according to a boolean - initialize (set) the variable to a default value first. Then an if-statement detects if we need to initialize it to something else.
limit = 50
if is_birthday:
limit = 55
# limit is now set, one way or another
# Equivalently could do it with "else"
if not is_birthday:
limit = 50
else:
limit = 55
Change this code to be better / shorter. Look at lines that are similar - make an invariant.
ncopies(word, n, suffix): Given name string, int n, suffix string, return
n copies of string + suffix.
If suffix is the empty string, use '!' as the suffix.
Challenge: change this code to be shorter,
not have so many distinct paths.
Before:
def ncopies(word, n, suffix):
result = ''
if suffix == '':
for i in range(n):
result += word + '!'
else:
for i in range(n):
result += word + suffix
return result
Solution: use logic to set "suffix" to hold the suffix to use for all cases. Later code just uses suffix vs. separate if-stmt for each case.
def copies(word, n, suffix):
result = ''
# Set suffix if necessary to value to use
if suffix == '':
suffix = '!'
# Unified: one loop, using suffix
for i in range(n):
result += word + suffix
return result
> match()
match(a, b): Given two strings a and b. Compare the chars of the strings at index 0, index 1 and so on. Return a string of all the chars where the strings have the same char at the same position. So for 'abcd' and 'adddd' return 'ad'. The strings may be of any length. Use a for/i/range loop. The starter code works correctly. Re-write the code to be shorter.
Before:
def match(a, b):
result = ''
if len(a) < len(b):
for i in range(len(a)):
if a[i] == b[i]:
result += a[i]
else:
for i in range(len(b)):
if a[i] == b[i]:
result += a[i]
return result
def match(a, b):
result = ''
# Set length to whichever is shorter
length = len(a)
if len(b) < len(a):
length = len(b)
for i in range(length):
if a[i] == b[i]:
result += a[i]
return result
See guide for details: Strings
Thus far we have done String 1.0: len, index numbers, upper, lower, isalpha, isdigit, slices, .find().
There are more functions. You should at least have an idea that these exist, so you can look them up if needed. The important strategy is: don't write code manually to do something a built-in function in Python will do for you. The most important functions you should have memorized, and the more rare ones you can look up.
These are very convenient True/False tests for the specific case of checking if a substring appears at the start or end of a string. Also a pretty nice example of function naming.
>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True
>>> s ='this is it'
>>> s.replace('is', 'xxx') # returns changed version
'thxxx xxx it'
>>>
>>> s.replace('is', '')
'th it'
>>>
>>> s # s not changed
'this is it'
Recall how calling a string function does not change it. Need to use the return value...
# NO: Call without using result:
s.replace('is', 'xxx')
# s is the same as it was
# YES: this works
s = s.replace('is', 'xxx')
>>> s = ' this and that\n' >>> s.strip() 'this and that'
>>> s = '11,45,19.2,N'
>>> s.split(',')
['11', '45', '19.2', 'N']
>>> 'apple:banana:donut'.split(':')
['apple', 'banana', 'donut']
>>>
>>> 'this is it\n'.split() # special whitespace form
['this', 'is', 'it']
>>> foods = ['apple', 'banana', 'donut'] >>> ':'.join(foods) 'apple:banana:donut'
>>> 'Alice' + ' got score:' + str(12) # old: + and str()
'Alice got score:12'
>>>
>>> '{} got score:{}'.format('Alice', 12) # new: format()
'Alice got score:12'
>>>
In the early days of computers, the ASCII character encoding was very common, encoding the roman a-z alphabet. ASCII is simple, and requires just 1 byte to store 1 character, but it has no ability to represent characters of other languages.
Each character in a Python string is a unicode character, so characters for all languages are supported. Also, many emoji have been added to unicode as a sort of character.
Every unicode character is defined by a unicode "code point" which is basically a big int value that uniquely identifies that character. Unicode characters can be written using the "hex" version of their code point, e.g. "03A3" is the "Sigma" char Σ, and "2665" is the heart emoji char ♥.
Hexadecimal aside: hexadecimal is a way of writing an int in base-16 using the digits 0-9 plus the letters A-F, like this: 7F9A or 7f9a. Two hex digits together like 9A or FF represent the value stored in one byte, so hex is a traditional easy way to write out the value of a byte. When you look up an emoji on the web, typically you will see the code point written out in hex, like 1F644, the eye-roll emoji 🙄.
You can write a unicode char out in a Python string with a \u followed by the 4 hex digits of its code point. Notice how each unicode char is just one more character in the string:
>>> s = 'hi \u03A3' >>> s 'hi Σ' >>> len(s) 4 >>> s[0] 'h' >>> s[3] 'Σ' >>> >>> s = '\u03A9' # upper case omega >>> s 'Ω' >>> s.lower() # compute lowercase 'ω' >>> s.isalpha() # isalpha() knows about unicode True >>> >>> 'I \u2665' 'I ♥'
For a code point with more than 4-hex-digits, use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the fire emoji 1F525, and the inevitable 1F4A9.
>>> 'the place is on \U0001F525' 'the place is on 🔥' >>> s = 'oh \U0001F4A9' >>> len(s) 4
The history of ASCII and Unicode is an example of ethics.
In the early days of computing in the US, computers were designed with the ASCII character set, supporting only the roman a-z alphabet. This hurt the rest of the planet, which mostly doesn't write in English. There is a well known pattern where technology comes first in the developed world, is scaled up and becomes inexpensive, and then proliferates to the developing world. Computers in the US using ASCII hurt that technology pipeline. Choosing a US-only solution was the cheapest choice for the US in the moment, but made the technology hard to access for most of the world. This choice is somewhere between ungenerous and unethical.
Unicode takes 2-4 bytes per char, so it is more costly than ASCII. Cost per byte aside, Unicode is a good solution - a freely available standard. If a system uses Unicode, it and its data can interoperate with the other Unicode compliant systems.
The cost of supporting non-ASCII data can be related to the cost of the RAM to store the unicode characters. In the 1950's every byte was literally expensive. An IBM model 360 could be leased for $5,000 per month, non inflation adjusted, and had about 32 kilobytes of RAM (not megabytes or gigabytes .. kilobytes!). So doing very approximate math, figuring RAM is half the cost of the computer, we get a cost of about $1 per byte per year.
>>> 5000 * 12 / (2 * 32000) 0.9375
So in 1950, Unicode is a non-starter. RAM is expensive.
What does the RAM in your phone cost today? Say your phone costs $500 and has 8GB of RAM (conservative). Say the RAM is all the cost and the rest of the phone is free. What is the cost per byte?
The figure 8 GB is 8 billion bytes. In Python, you can write that as 8e9 - like on your scientific calculator.
>>> 500 / 8e9 # 8 GB 6.25e-08 >>> >>> 500 / 8e9 * 100 # in pennies 6.2499999999999995e-06
RAM costs nothing today - 6 millionths of a cent per byte. This is the result of Moore's law. Exponential growth is incredible.
Sometime in the 1990s, RAM was cheap enough that spending 2-4 bytes per char was not so bad, and around then is when Unicode was created. Unicode is a standard way of encoding chars in bytes, so that all the Unicode systems can transparently exchange data with each other.
With Unicode, the tech leaders were showing a little generosity to all the non-ASCII computer users out there in the world.
With Unicode, there is just one Python that works in every country. A world of programmers contribute to Python as free, open source software. We all benefit from that community, vs. each country maintaining their own in-country programming language, which would be a crazy waste of duplicated effort.
So being generous is the right thing to do. But the story also shows, that when you are generous to the world, that generosity may well come around and help you as well.