L14

Today: exam prep, hardware tour, string functions, unicode

Midterm Tuesday Evening

See course page for timing, logistics, lots of practice problems. Finish Crypto program, first, take a couple days off, then worry about the exam. You might plan on spending, say, Sun and Mon evening working practice problems.

Topics on the exam: simple Bit (hw1), images/pixels/nested-loops (hw2), 2-d grids (hw3), strings, loops, simple lists (hw4)

Topics not on exam: bit decomposition problems, bluescreen algorithm, writing main(), file reading, int div //

CS Coding Exam

The bad news / good news of it

High School; 90% = A
That's not how Stanford Engineering works
We'll compute a median/curve, decide where the grades are
We want to scare you a bit now, so you practice
Closed note, closed computer
Closed note = normal looking problems
Not grading on syntax, we award partial credit
e.g. list.add(12) - full credit, should be lst.append(12)
e.g. missing colon - full credit
e.g. write "if" instead of "while" - not full credit
The exam topics are predictable
Important code patterns .. these are in lecture and hw problems
Exam is made of these same, familiar looking patterns
Therefore: studying for this exam pays off
Exam = 60 minutes
Enough time to solve
Not enough time to learn it
Ideal: "Oh, I've solved something like this before"
No HW5 going out until after the exam .. clearing out time for you to practice

How To Practice For A CS Exam

Just looking over the solutions is mostly worthless, psychological trap
Get a problem
Don't look at the solution
Get a blank sheet of paper
Try to solve it like on the exam
Then you can compare your solution to ours
Repeat!

Practice Problems - Reps

Exam practice problems - on course page
Last year's exam
Lecture problems
Homework problems
Initially: peek at lecture example - fine
Goal: can do it on your own
Section problems
Also: practice mode on experimental server - scroll down to see button

Computer Hardware

What is a Computer?

You have one on your person all day. You're debugging code for one. You see the output of them constantly. What is it and how does it work?

1. Why is it called Silicon Valley?

Silicon valley may be here because of Stanford
Stanford Prof Fred Terman -> Hewlett and Packard (1939) -> Silicon valley
Orchards and cheap real estate at that time!
Think silicon chip
silicon chip
Tiny transistors are "etched" onto the chip
PSA: Silicon (chips) and Silicone (rubbery stuff) - two easily confused words

2. Moore's Law

Moore's Law: transistors per chip doubles every 2 years
(Moore's law appears to be slowing from the 2 year cadence around 2020)
i.e. smaller transistors, fit more per chip ... cheaper!
Since 1965, an incredible run of improvement
Think about phone 6 years ago (junior high)
6 years = 3 doublings = 8x
Was 32GB storage per phone .. now 256 GB is the minimum, 8x more
Moore's law!

Features on a chip - "nm" Generations

Features on chip - try to make smaller on new generation of chips
A nanometer - 1 billionth of a meter - "nm"
These chip measures are approximate, different definitions per manufacturer
2020 - 5nm features
2022 - 3nm features
Future: working on 2nm
Point: smaller features, can fit more on a chip

Aside: Chip Factories are Expensive, Amazingly Complex

Recent shortage of chips in the news
Chip factories now cost around $10 billion
Each step of Moore's requires more expensive equipment
See:Bloomberg - chips are hard to make
See video:ASML tin droplet laser system - it's hard to believe how complex the system is
This is the technology making your phone chips
Talk of building US chip factories for security

Quick Tour of How Computers Work

Your Python code runs on computer hardware, using CPU, RAM etc.
Let's look for a minute at how those parts work
Terminology explained below:
CPU, RAM, storage, operating system, process, core

Computer - CPU, RAM, Storage

alt: computer is made of CPU, RAM, storage

3 parts of the computer (or phone)
1. CPU
Central Processing Unit
The brains, 2 GHz, simple instructions
CPU does work (RAM stores the work)
e.g. run a line: a = b + c
2. RAM
Random Access Memory
Temporary store of bytes for CPU
Stores code and variables of program
Not persistent (power-off = erased)
3. Storage
Storage in laptop / phone / USB key
aka "persistent storage"
Storage in the form of files, folders
Measured in bytes, like RAM, but cheaper per byte
Your phone might have 8GB of RAM, but 256GB of storage
"Persistent" = keeps state even if powered-off
1MB = 1 megabyte = 1 million bytes
1GB = 1 gigabyte = 1 billion bytes
1TB = 1 terabyte = 1 trillion bytes

Extra: GPU

GPU Graphics Processing Unit
Optimized for pixel processing, lots of float arithmetic
Some AI problems run best on GPUs also
Ordinary code runs on the CPU, not the GPU
The GPU has its own distinct computer language
Used by, say, game developers
Or for certain AI problems
Most programmers never write GPU code, it's a specialty

Extra: CPU types: x86, Arm, Risc-V

There are different types of CPU: x86, Arm, and more recently Risc-V. Low-level software created for one will not run on another. (Python is portable - your Python code will work without modification on many different CPUs). The x86 processors are associated with Intel and AMD and have had a long dominance dating back to the creation of the PC which used an x86 processor in 1982. More recently Arm licenses processors which are totally dominant in cell phones, and more recently Apple has used them in computers. Arm chips are a more modern design compared to x86. Most recently, Risc-v is a open/royalty-free type of CPU, where a manufacturer has the freedom to make them without permission (Arm and x86 are quite the opposite.) I would not be surprised to see Risc-V grow in importance, as openness has a long history of bringing in a lot of investment and creativity.

Want to talk about running a computer program...

A Running Program is a "Process"
Gets its own area in RAM

A running program is known as a "process"
Each process gets its own area in RAM
The areas in RAM are kept separate from each other
Multiple processes can run at one time
When a process exits, its RAM space is reclaimed
The "Operating System" manages the processes
Your computer has a utility to show the list of running processes (below)

For example, we have cat.py - a python program. When not running, it is just a file sitting in storage (a file which you wrote!). To run the cat.py program, a "process" is created with space in RAM, and the CPU runs it there. When the program exits, the process is destroyed and the space in RAM can be used for something else. alt:each running program is a process, gets its own area in RAM

Operating System (OS)

"Operating System" (OS) manages CPU, RAM etc.
e.g. Windows, iOS, Android, Mac OS, Linux
Starts and stops programs
Manages memory between processes
Manages files
When you type commands in the "terminal"
You are typing commands to your operating system
Run a program: python3 crazycat.py alice.txt
List files: ls (or old windows "dir")
Show the contents of files: cat poem.txt (or windows "type")

CPU / Cores

The CPU is divided into "cores"
Each CPU core can run code in RAM
So a 4-core CPU can run 4 processes simultaneously
Often a core is "idle" waiting for something to do
Adding cores provide diminishing returns
For typical work, 8 cores only slightly more useful than 2 cores
Only so much simultaneous work to be done
Certain tasks, like video encoding can use many cores
Cores use more power when running, less when idle
This why the fans spin up to cool the CPU when it is active

RAM holds code + variables

The area in RAM for each process holds:
1. The code to run
2. The values for the variables
The CPU core runs the code, manipulates the values
The CPU core does not run Python directly
Instead the CPU core runs a very simple "machine code"
1 line of Python code is expanded to about 10 "machine code" instructions to run on the CPU

alt:process area in RAM area stores both code and values, CPU core runs the code of process1

Multiple Processes at Once

Multiprocessing
The CPU will run one process for a fraction of a second
Then switch to run another process for a fraction of a second
Create the illusion that all the processes are running at once
Even if there's just one CPU core

$alt:CPU core switches to run process2 for a fraction of a second$

Browser Tab = Process

Each process in RAM is isolated from the other processes
Each tab in your browser is supposed to be isolated from the other tabs
Modern web browsers implement tabs by running each tab in your browser as its own process
Most tabs don't do much computation, but animation heavy tabs can use a lot of CPU
Look at the processes on your computer, you will likely see processes with names like "Chrome Helper" or "Firefox Web Content". Each of these holds the data and runs the code of one tab
This shows how a web page in your browser is using your local CPU and RAM to do animations and whatnot

Demo: See the Processes on your Computer
Process Manager

"Process" = a running program with space in RAM
Use an OS utility to see all the processes on your computer:
On the Mac: Applications > Utilities > Activity Monitor
On Windows: Task Manager
See dozens of processes, mostly idle
"%CPU" = the percentage of one core used by a process
Can kill a wayward process from in here BTW

Python Shields us from Hardware Details - Great!

Python allows us to write code to solve the problems we want without needing to know the details of the CPU and RAM. This is progress, much as its useful to be able to ride a bicycle without knowing the details of, say, its wheel bearings. That said, here we will look at how CPU and RAM are used to get a feel for the whole picture.

Hardware Demo Program

Nick's Hardware Squandering Program!

> hardware-demo.zip

Demo: computer is mostly idle to start. An idle CPU does not create much heat. When the CPU starts running hard, it generates heat, and often the laptop fan will start running to cool the CPU. This program is an infinite loop, see the code below. It uses 100% of one core. If the fan running is running on your laptop, use Activity Monitor (Mac), Task Manager (Windows) to see what programs are running, see CPU% and MEM%.

Core function of -cpu feature:

def use_cpu(n):
    """
    Infinite loop counting a variable 0, 1, 2...
    print a line every n (0 = no printing)
    """
    i = 0
    while True:
        if n != 0 and i % n == 0:
            print(i)
        i = i + 1

Try 1000 first ... yikes! Try 1 million instead. Type ctrl-c in the terminal to kill the process.

$ python3 hardware-demo.py -cpu 1000000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
^CTraceback (most recent call last):
  File "hardware-demo.py", line 66, in 
    main()
  File "hardware-demo.py", line 56, in main
    use_cpu(n)
  File "hardware-demo.py", line 24, in use_cpu
    i = i + 1
KeyboardInterrupt

(ctrl-c to exit)

Run It Twice

Demo: Nick opens a second terminal. This needs to be done outside of PyCharm - see the Command Line chapter. Run a second copy of hardware-demo.py. Look in the process manager .. now see two programs running at once.

(optional) Let's Talk About RAM

When code reads and writes values, those values are stored in RAM. RAM is a big array of bytes, read and written by the CPU.

Say we have this code

n = 10
s = 'Hello'
lst = [1, 2, 3]
lst2 = lst

Every value in use by the program takes up space in RAM.

alt:python values each taking space in RAM

RAM

RAM - Random Access Memory
"random access" = can access any byte at will
Each Python value is stored using bytes in RAM

How Many Bytes does a Python Value Use?

Say each Python value has, approximately, 16 bytes of fixed overhead
Here's how it works out
An int value like 2561 takes 8 bytes + 16 overhead = 24 bytes
The string 'hello' - is 2 bytes per char (10) + 16 = 26 bytes
If the string were 100 chars long, that would 200 + 16 = 216 bytes

Demo using -mem, Look in activity monitor (task manager), "mem" area, 100 = 100 MB per second. Watch our program use more and more memory of the machine. Program exits .. not in the list any more! Fancy: try killing off the process from inside the process manager window.

$ python3 hardware-demo.py -mem 100
Memory MB: 100
Memory MB: 200
Memory MB: 300
Memory MB: 400
Memory MB: 500
Memory MB: 600
Memory MB: 700
^CTraceback (most recent call last):
...
KeyboardInterrupt
(ctrl-c to exit)

String - More Functions

See guide for details: Strings

Thus far we have done String 1.0: len, index numbers, [ ], in, upper, lower, isalpha, isdigit, slices, .find().

There are more functions. You should at least have an idea that these exist, so you can look them up if needed. The important strategy is: don't write code manually to do something a built-in function in Python will do for you. The most important functions you should have memorized, and the more rare ones you can look up.

s.startswith() s.endswith()

These are very convenient True/False tests for the specific case of checking if a substring appears at the start or end of a string. Also a pretty nice example of function naming.

>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True

String - strip()

Removes whitespace chars from either end
Use inside for line in f to trim off `'\n'

>>> s = '   this and that\n'
>>> s.strip()
'this and that'

String - split()

Nice feature to parse a line of text
e.g. from a file line 11,45,19.2,N
str.split() -> array of strings
str.split(',') - split on ',' substring
str.split() - with zero parameters
a special form of split()
splits on 1 or more whitespace chars
combines multiple whitespace chars
handy primitive "word" from line feature
We'll re-visit this when we get to some appropriate file reading

>>> # Say read a line like this from file
>>> line = 'Smith,Astrid,112453,2022'
>>> parts = line.split(',')
>>> parts
['Smith', 'Astrid', '112453', '2022']  # split into parts
>>> parts[0]
'Smith'
>>> parts[2]
'112453'
>>>
>>> 'apple:banana:donut'.split(':')
['apple', 'banana', 'donut']
>>> 
>>> 'this    is     it\n'.split()  # special whitespace form
['this', 'is', 'it']

String - join()

Mentioning for completeness
Reverse of split()
Given list of strings, puts them together to make a big string
Mnemonic: str.split() and str.join()
The string is the noun in noun.verb form

>>> foods = ['apple', 'banana', 'donut']
>>> ':'.join(foods)
'apple:banana:donut'

Recall: String + and str()

Recall
Use + str() function to put string together

>>> name = 'Alice'
>>> score = 12
>>> 'Alice' + ' got score:' + str(score)
'Alice got score:12'
>>>

Format String - New

Put a lowercase 'f' to the left of the string literal, making a specially treated "format" string. For each curly bracket {..} in the string, Python evaluates the expression within and pastes the resulting value into the string. Super handy! The expression has access to local variables. We do not need to call str() to convert to string, it's done automatically.

>>> name = 'Alice'
>>> 
>>> f'this is {name}'
'this is Alice'
>>> 
>>> score = 12
>>> f'{name} got score:{score}'
Alice got score:12
>>>

Optional: Limit Digits `{x:.4}`

Add ':.4' after the value in the curly braces to limit decimal digits printed. There are many other "format options", but this is the one I use the most by far.

>>> x = 2/3
>>> f'value: {x}'
'value: 0.6666666666666666'
>>> f'value: {x:.4}'
'value: 0.6667'

String Unicode

In the early days of computers, the ASCII character encoding was very common, encoding the roman a-z alphabet. ASCII is simple, and requires just 1 byte to store 1 character, but it has no ability to represent characters of other languages.

Each character in a Python string is a unicode character, so characters for all languages are supported. Also, many emoji have been added to unicode as a sort of character.

Every unicode character is defined by a unicode "code point" which is basically a big int value that uniquely identifies that character. Unicode characters can be written using the "hex" version of their code point, e.g. "03A3" is the "Sigma" char Σ, and "2665" is the heart emoji char ♥.

Hexadecimal aside: hexadecimal is a way of writing an int in base-16 using the digits 0-9 plus the letters A-F, like this: 7F9A or 7f9a. Two hex digits together like 9A or FF represent the value stored in one byte, so hex is a traditional easy way to write out the value of a byte. When you look up an emoji on the web, typically you will see the code point written out in hex, like 1F644, the eye-roll emoji 🙄.

You can write a unicode char out in a Python string with a \u followed by the 4 hex digits of its code point. Notice how each unicode char is just one more character in the string:

>>> s = 'hi \u03A3'
>>> s
'hi Σ'
>>> len(s)
4
>>> s[0]
'h'
>>> s[3]
'Σ'
>>>
>>> s = '\u03A9'  # upper case omega
>>> s
'Ω'
>>> s.lower()     # compute lowercase
'ω'
>>> s.isalpha()   # isalpha() knows about unicode
True
>>>
>>> 'I \u2665'
'I ♥'

For a code point with more than 4-hex-digits, use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the fire emoji 1F525, and the inevitable 1F4A9.

>>> 'the place is on \U0001F525'
'the place is on 🔥'
>>> s = 'oh \U0001F4A9'
>>> len(s)
4

Ethics of Generosity and Unicode

Generosity is Good

Your life goal is not to consume everything just for yourself
Part of your life is contributing to others
Most closely, your family
But the circle of people to contribute to extends outwards
Including people around the world you don't know
You do not need to live a vow of poverty
But there should be generosity to others in your life
Happiness research:
Being generous is a source of personal happiness

History of Unicode and Python

The history of ASCII and Unicode is an example of ethics.

ASCII

One byte per char, but only a-z roman alphabet. Not so helpful for non English speaking world.

In the early days of computing in the US, computers were designed with the ASCII character set, supporting only the roman a-z alphabet. This hurt the rest of the planet, which mostly doesn't write in English. There is a well known pattern where technology comes first in the developed world, is scaled up and becomes inexpensive, and then proliferates to the developing world. Computers in the US using ASCII hurt that technology pipeline. Choosing a US-only solution was the cheapest choice for the US in the moment, but made the technology hard to access for most of the world. This choice is somewhere between ungenerous and unethical.

Unicode Technology

Unicode takes 2-4 bytes per char, so it is more costly than ASCII.

Cost per byte aside, Unicode is a good solution - a freely available standard. If a system uses Unicode, it and its data can interoperate with the other Unicode compliant systems.

Unicode vs. RAM Costs vs. Moore's Law

The cost of supporting non-ASCII data can be related to the cost of the RAM to store the unicode characters. In the 1950's every byte was literally expensive. An IBM model 360 could be leased for $5,000 per month, non inflation adjusted, and had about 32 kilobytes of RAM (not megabytes or gigabytes .. kilobytes!). So doing very approximate math, figuring RAM is half the cost of the computer, we get a cost of about $1 per byte per year.

>>> 5000 * 12 / (2 * 32000)
0.9375

So in 1950, Unicode is a non-starter. RAM is expensive.

RAM Costs Today

What does the RAM in your phone cost today? Say the RAM cost of your phone is $500 and it has 8GB of RAM. What is the cost per byte?

The figure 8 GB is 8 billion bytes. In Python, you can write that as 8e9 - like on your scientific calculator.

>>> 500 / 8e9   # 8 GB
6.25e-08
>>> 
>>> 500 / 8e9 * 100  # in pennies
6.2499999999999995e-06

RAM costs nothing today - 6 millionths of a cent per byte. This is the result of Moore's law. Exponential growth is incredible.

Unicode Makes Sense in 1990s

Sometime in the 1990s, RAM was cheap enough that spending 2-4 bytes per char (unicode) was not so bad compared to 1 byte per char (ASCII). The Unicode standard was created around this time. Unicode is a standard way of encoding chars in bytes, so that all the Unicode systems can transparently exchange data with each other.

With Unicode, the tech leaders were showing a little generosity to all the non-ASCII computer users out there in the world.

Generosity and Python Story

Python created by Guido van Rossum in 1991
From the Netherlands
Where they speak Dutch
What language encoding was used for Python?
Unicode, of course
Therefore, Python works with data in Palo Alto, Tokyo, Stockholm .. everywhere

With Unicode, there is a single Python language that can be used in every country - US, China, India, Netherlands.

A world of programmers contribute to Python as free, open source software. We all benefit from that community, vs. each country maintaining their own in-country programming language, which would be a crazy waste of duplicated effort.

Ethic: Generosity

So being generous is the right thing to do. But the story also shows, that when you are generous to the world, that generosity may well come around and help you as well.