Python for Probability

We’ll hold two Python review sessions to get you up to speed on what you’ll need for the problem sets. They will be Friday week 1 and week 2 after lecture. The first session is basic Python review, and the second session covers Numpy and other data science tools. You are welcome to attend the sessions you’ll find helpful.

Resources
Installing Python
1. Installing an editor (also called an IDE)
Introduction to Python
Installing Packages
1. Using Packages
Python for Data Scientists

Resources

Use the additional resources below to see some examples or try it yourself. If you get stuck or have additional questions, reach out on Ed or stop by office hours.

Replay (coming soon)
Colab Notebook

Installing Python

First, navigate to the Python website (python.org/downloads) and download the version for your operating system (Mac or Windows). Run the installer after downloading.

Validating your installation

To check if Python is installed, follow these steps:

Open a command prompt (Terminal on Mac, Windows Terminal on Windows)

Run Python (python3 on Mac, py on Windows) and hit enter.

(optional) Add the --version flag to show the installed version of Python (python3 --version on Mac, py --version on Windows)

Installing an editor (also called an IDE)

You can choose either editor to download. We suggest VS Code as we think the setup and use is smoother. But it is up to you!

VS Code

Download Visual Studio Code (VS Code) from code.visualstudio.com

Install the Python extension (Extensions -> Search -> Python) to get language features in the editor.

Pycharm

Download Pycharm Community Edition from jetbrains.com/pycharm/download

Important!: Make sure it's the community edition - you may need to scroll.

Add an interpreter (Customize -> All Settings -> Python Interpreter -> Add Interpreter -> Add Local Interpreter -> System Interpreter). See the slides for screenshots.

Introduction to Python

Basic Syntax

To run Python:

In the terminal: run python (python3 on Mac, py on Windows) to open the interpreter.
In your editor (VS Code / Pycharm / something else), make a new file (my_file.py) and run it in from the terminal with python my_file.py

Print anything with print("Hello World") and comment with a hashtag (# this is my comment)

Style comment! Comments in your code make it easier to follow, and we love readable code

Variables

To create variables, we name a value with the assignment operator (the equals sign): x = 5 or my_var = "hi there!". Unlike some other languages, we don't need to specify the type of the variable (more on that in a second!).

If you do need to switch from one type to the other, we can use a cast (var = str(5) or num = int("10")). You can also get the type of the variable with type(my_var) (which is great for debugging with a print statement).

Style comment! Use snake case for multi-word variable names! (e.g. my_favorite_number, best_python_variable)

Data Types

Primitives

Booleans: True or False
Integers: whole numbers (-5, 3, etc.)
Floats: decimal numbers (3.14, 4.0)
Strings: collections of characters ("hello world", "this is my string")

Collections

Lists: an ordered (not sorted) sequence of elements (["Apple", "Banana", "Clementine"])
- Zero-indexed (the first position is considered position 0)
- You can access and assign positions with brackets (my_list[0] = "Apricot")
Sets: an unordered sequence of unique elements (no repeats) ({3, 4, 5})
Dictionaries: a data structure that stores key-value pairs (no repeat keys)
- ```
my_dict = {"apples": 3,
           "bananas": 4,
           "clementines": 5,
          }
      
```
- You can access and assign values with brackets (my_dict["apples"] = 5)
- You can also add new key-value pairs (my_dict["dates"] = 6)

Useful functions for collections:

len(my_collection): returns the "size" of the collection
- Length of a list, number of elements in a set, number of keys in a dictionary, etc.
sum(my_collection): returns the sum of the elements in the collection
- Only works for collections of numbers (integers or floats)

Booleans

A boolean is a value that's either True or False.

Note: Python uses the capitalized version - lowercase doesn't work

We can combine boolean values with boolean operators:

and: True if both values are True (a and b)
or: True if either (one or both) values are True (a or b)
not: Flips the value from True to False or False to True (not a)

We can get a boolean value either through its name (True, False) or through a comparison operator:

Less than: a < b
Less than or equal to: a <= b
Greater than: a > b
Greater than or equal to: a >= b
Equal to: a == b
Not equal to: a != b

Arithmetic (Math) Operators

Addition: a + b
Subtraction: a - b
Multiplication: a * b
Division: a / b
Modulo (remainder from division): a % b
Exponentiation (raising to a power): a ** b
Floor division (rounding down, no decimal): a // b

Variable Assignment Operators

Assign: a = 5
Add and assign: a += 5 (same as a = a + 5)
Subtract and assign: a -= 5 (same as a = a - 5)
Multiply and assign: a *= 5 (same as a = a * 5)
Divide and assign: a /= 5 (same as a = a / 5)
Modulo and assign: a %= 5 (same as a = a % 5)
Exponent and assign: a **= 5 (same as a = a ** 5)
Floor divide and assign: a //= 5 (same as a = a // 5)

Control Flow

A conditional statement is a block of code that runs if a condition is True. We use the if keyword to start a conditional statement.

if condition:
    print("This code runs if the condition is True")

We can also add an else block to run code if the condition is False.

if condition:
    print("This code runs if the condition is True")
else:
    print("This code runs if the condition is False")

We can also add an elif block to check multiple conditions.

if condition1:
    print("This code runs if condition1 is True")
elif condition2:
    print("This code runs if condition2 is True and condition1 is False")
else:
    print("This code runs if neither condition1 nor condition2 is True")

We can also loop through a block of code based on a condition using a while loop.

while condition:
    print("This code runs while the condition is True")

We can also loop through a block of code a specific number of times or over each item in a collection using a for loop.

To run something a specific number of times:

for i in range(5):
    print("This code runs 5 times")

To loop over each item in a collection:

for val in [1, 2, 3, 4, 5]:
    print(val)

We can also stop a loop early using the break keyword.

Functions

We can create a function using the def keyword.

def my_func():
    print("Running my function")

We then call the function by using its name followed by parentheses.

my_func()

We can also pass arguments to a function. Notice that we can set default values for arguments.

def my_func_with_args(a, b=5, c=10):
    print(a, b, c)

We then call the function with the arguments in the parentheses and can specify which argument we're passing if we want to.

my_func_with_args(1, c=20)

Installing Packages

Python can be augmented by adding packages that we can use in our code. In In CS 109, we use 1 built-in package (Math) and 3 external:

Pandas: data manipulation and analysis
SciPy: math and statistics algorithms
Numpy: data structures (matrices) and statistics

To install packages, we use the Python Package Manager in the terminal:

python -m pip install pandas scipy numpy

Using Packages

You can import libraries in your files so you can use their built-in tools and functions in your code! At the top of your file, import all of the libraries you need (import math).

We use 'as' to give nicknames to the libraries to make them easier to use.

import numpy as np
import pandas as pd
import scipy.stats as stats

Python for Data Scientists

Reading Files

We can read files in Python using the built-in open function and then reading the file line by line.

with open("filename.csv") as f:
    for line in f:
        print(line)

However, Pandas is a much easier way to read files, especially CSVs (a spreadsheet format).

import pandas as pd
      
dataframe = pd.read_csv(“filename.csv”)
print(dataframe)

Spreadsheet Reading with Pandas

Let's start by reading in a CSV file with Pandas.

dataframe = pd.read_csv(“filename.csv”)

We can view the data in the dataframe:

dataframe.head(): shows the first 5 rows of the dataframe
dataframe.tail(): shows the last 5 rows of the dataframe
dataframe.columns: shows the column names of the dataframe

We can also read specific data from the dataframe:

A column: dataframe[“column_name”]
A row by index: dataframe.iloc[index]
Multiple rows by slice: dataframe[start:end]

Or we can select data based on a condition:

Rows matching a label: df.loc[dates[0]]
All rows where a column meets a condition: df.loc[:, ['A', 'B']]

We can also convert the dataframe to a numpy array:

data = dataframe.to_numpy()

The Math Library

Some useful functions in the math library:

Square root: math.sqrt(25)
Factorial: math.factorial(5)
Choose (n choose k): math.comb(5, 2) computes $\binom{5}{2}$
Natural exponent: math.exp(3) computes $e^3$
Natural logarithm: math.log(10) computes $\ln(10)$

Matrices with Numpy

We can make an array with numpy:

import numpy as np
      
my_array = np.array([[1, 2, 3], 
                     [4, 5, 6])

To examine the array we can:

Look at it's size (called "shape"): my_array.shape
Index into it: my_array[0, 1] (row 0, column 1)
Slice it: my_array[0, :] (row 0, all columns)
Transpose it: my_array.T or my_array.transpose()

We can also do math with arrays:

Add: my_array + my_array
Subtract: my_array - my_array
Multiply: my_array * my_array
Divide: my_array / my_array
Dot product: np.dot(my_array, my_array.T)
Matrix multiplication: np.matmul(my_array, my_array.T)

Numpy also lets us do some useful statistic operations on arrays:

Mean: my_array.mean()
Min: my_array.min()
Max: my_array.max()
Sum: my_array.sum()
Standard deviation: my_array.std()

Statistics with SciPy

We'll typically use the stats module from scipy for statistical operations, especially with random variables. The stats module has a lot of built-in functions for common distributions that let us do PMFs, CDFs, mean, variance, standard deviation, and sampling.

For any SciPy random variable X, we can use the following functions:

PMF: X.pmf(k) computes $P(X = k)$
CDF: X.cdf(k) computes $P(X \leq k)$
Mean: X.mean() computes $E[X]$
Variance: X.var() computes $Var(X)$
Standard deviation: X.std() computes $\sqrt{Var(X)}$
Sampling: X.rvs(size) samples size values from $X$

Note: make sure to import the stats module with import scipy.stats as stats

Binomial

Let $X \sim Bin(n, p)$ be a binomial random variable with $n$ trials and probability of success $p$. We can create X in Python with:

X = stats.binom(n, p)

Poisson

Let $X \sim Poi(\lambda)$ be a Poisson random variable with rate $\lambda$. We can create X in Python with:

X = stats.poisson(lambda)

Geometric

Let $X \sim Geo(p)$ be a geometric random variable with probability of success $p$. We can create X in Python with:

X = stats.geom(p)

Normal

Let $X \sim N(\mu, \sigma^2)$ be a normal random variable with mean $\mu$ and variance $\sigma^2$. We can create X in Python with:

X = stats.norm(mu, sigma)

Important! The second parameter for the normal distribution is the standard deviation, not the variance.

Exponential

Let $X \sim Exp(\lambda)$ be an exponential random variable with rate $\lambda$. We can create X in Python with:

X = stats.expon(scale=1/lambda)

Important! The parameter for the exponential distribution is the "scale" ($1/\lambda$) instead of the rate.

Uniform

Let $X \sim Uni(a, b)$ be a uniform random variable on the interval $[a, b]$. We can create X in Python with:

X = stats.uniform(a, b-a)

Important! The second parameter for the uniform distribution is the width of the interval, not the upper bound.

Beta

Let $X \sim Beta(\alpha, \beta)$ be a beta random variable with shape parameters $\alpha$ and $\beta$. We can create X in Python with:

X = stats.beta(alpha, beta)