# Python Tutorial

## Collections

Python has several built-in types that are useful for storing and manipulating data: list, tuple, dict. Here is the official Python documentation on these types (and many others): https://docs.python.org/3/library/stdtypes.html.

### Lists

Lists are mutable arrays. Let's see how they work.

In [1]:
names = ["Zach", "Jay"]

In [2]:
# Index into list by index
print(names[0])

Zach


In [3]:
# Append to list (appends to end of list)
names.append("Richard")
print(names)

['Zach', 'Jay', 'Richard']


In [4]:
# Get length of list
print(len(names))

3


In [5]:
# Concatenate two lists
# += operator is a short hand for list1 = list1 + list2 (can also be used for -, *, / and on other types of variables)
names += ["Abi", "Kevin"]
print(names)

['Zach', 'Jay', 'Richard', 'Abi', 'Kevin']


In [6]:
# Two ways to create an empty list
more_names = []
more_names = list()

In [7]:
# Create a list that contains different data types, this is allowed in Python
stuff = [1, ["hi", "bye"], -0.12, None]
print(stuff)

[1, ['hi', 'bye'], -0.12, None]


List slicing is a useful way to access a slice of elements in a list.

In [8]:
numbers = [0, 1, 2, 3, 4, 5, 6]

# Slices from start index (inclusive) to end index (exclusive)
print(numbers[0:3])

[0, 1, 2]


In [9]:
# When start index is not specified, it is start of list
# When end index is not specified, it is end of list
print(numbers[:3])
print(numbers[5:])

[0, 1, 2]
[5, 6]


In [10]:
# : takes the slice of all elements along a dimension, is very useful when working with numpy arrays
print(numbers[:])

[0, 1, 2, 3, 4, 5, 6]


In [11]:
# Negative index wraps around, start counting from the end of list
print(numbers[-1])
print(numbers[-3:])
print(numbers[3:-2])

6
[4, 5, 6]
[3, 4]


### Tuples

Tuples are immutable arrays. Let's see how they work.

In [12]:
# Use parentheses for tuples, square brackets for lists
names = ("Zach", "Jay")

In [13]:
# Syntax for accessing an element and getting length are the same as lists
print(names[0])
print(len(names))

Zach
2


In [14]:
# But unlike lists, tuples do not support item re-assignment
names[0] = "Richard"

TypeError: 'tuple' object does not support item assignment

In [15]:
# Create an empty tuple
empty = tuple()
print(empty)

# Create a tuple with a single item, the comma is important
single = (10,)
print(single)

()
(10,)


## Dictionary

Dictionaries are hash maps. Let's see how they work.

In [16]:
# Two ways to create an empty dictionary
phonebook = {}
phonebook = dict()

In [17]:
# Create dictionary with one item 
phonebook = {"Zach": "12-37"}
# Add another item
phonebook["Jay"] = "34-23"

In [18]:
# Check if a key is in the dictionary
print("Zach" in phonebook)
print("Kevin" in phonebook)

True
False


In [19]:
# Get corresponding value for a key
print(phonebook["Jay"])

34-23


In [20]:
# Delete an item
del phonebook["Zach"]
print(phonebook)

{'Jay': '34-23'}


## Loops

In [21]:
# Basic for loop
for i in range(5):
    print(i)

0
1
2
3
4


In [22]:
# To iterate over a list
names = ["Zach", "Jay", "Richard"]
for name in names:
    print(name)

Zach
Jay
Richard


In [23]:
# To iterate over indices and values in a list
# Way 1
for i in range(len(names)):
    print(i, names[i])
    
print("---")

# Way 2
for i, name in enumerate(names):
    print(i, name)

0 Zach
1 Jay
2 Richard
---
0 Zach
1 Jay
2 Richard


In [24]:
# To iterate over a dictionary
phonebook = {"Zach": "12-37", "Jay": "34-23"}

# Iterate over keys
for name in phonebook:
    print(name)

print("---")

# Iterate over values
for number in phonebook.values():
    print(number)

print("---")

# Iterate over keys and values
for name, number in phonebook.items():
    print(name, number)

Zach
Jay
---
12-37
34-23
---
Zach 12-37
Jay 34-23


## NumPy
NumPy is a Python library, which adds support for large, multi-dimensional arrays and matrices, along with a large collection of optimized, high-level mathematical functions to operate on these arrays.

You may need to install numpy first before importing it in the next cell. 

There are many ways to manage your packages, but the workflow we suggest for this class is to use Anaconda.
 - Download Anaconda. Create a conda environment when you work on a new project.
 - Activate your conda environment and install libraries using conda or pip if they are not available in conda.
 - If you are running scripts on command line, run inside your conda environment.
 - If you are using a Jupyter notebook, add your conda environment to your Jupyter notebook: https://towardsdatascience.com/get-your-conda-environment-to-show-in-jupyter-notebooks-the-easy-way-17010b76e874. Create your Jupyter notebook and verify you're in your conda environment kernel (top right of notebook should display the name). If you're not, go to the Kernel tab on the top left and click Change kernel to change to your conda environment kernel.

In [25]:
# Import numpy
import numpy as np

In [26]:
# Create numpy arrays from lists
x = np.array([1,2,3]) 
y = np.array([[3,4,5]]) 
z = np.array([[6,7],[8,9]])

# Let's take a look at their shapes.
# When working with numpy arrays, .shape will be a very useful debugging tool 
print(x.shape)
print(y.shape)
print(z.shape)

(3,)
(1, 3)
(2, 2)


Vectors can be represented as 1-D arrays of shape (N,) or 2-D arrays of shape (N, 1) or (1, N). But it's important to note that the shapes (N,), (N, 1), and (1,N) are not the same and may result in different behavior (we'll see some examples below involving matrix multiplication and broadcasting).

Matrices are generally represented as 2-D arrays of shape (M, N).

The best way to ensure your code gives you the behavior you expect is to keep track of your array shapes and try out small test cases or refer back to documentation when you are unsure.

### Array Operations

There are many NumPy operations that can be used to reduce a numpy array along an axis.

Let's look at the np.max operation (documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.max.html).

In [27]:
x = np.array([[1,2],[3,4], [5, 6]]) 

In [28]:
print(np.max(x, axis = 1))

[2 4 6]


In [29]:
print(np.max(x, axis = 1, keepdims = True))

[[2]
 [4]
 [6]]


Next, let's look at some matrix operations. Let's take an element-wise product (Hadamard product).

In [30]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[3, 3], [3, 3]])
print(A)
print(B)
print("---")
print(A * B)

[[1 2]
 [3 4]]
[[3 3]
 [3 3]]
---
[[ 3  6]
 [ 9 12]]


We can do matrix multiplication with np.matmul or @.

In [31]:
# One way to do matrix multiplication
print(np.matmul(A, B))

# Another way to do matrix multiplication
print(A @ B)

[[ 9  9]
 [21 21]]
[[ 9  9]
 [21 21]]


We can take the dot product or a matrix vector product with np.dot. 

In [32]:
u = np.array([1, 2, 3])
v = np.array([1, 10, 100])

print(np.dot(u, v))

# Can also call numpy operations on the numpy array, useful for chaining together multiple operations
print(u.dot(v))

321
321


In [33]:
W = np.array([[1, 2], [3, 4], [5, 6]])
print(v.shape)
print(W.shape)

# This works.
print(np.dot(v, W))
print(np.dot(v, W).shape)

(3,)
(3, 2)
[531 642]
(2,)


In [34]:
# This does not. Why?
print(np.dot(W, v))

ValueError: shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)

In [35]:
# We can fix the above issue by transposing W.
print(np.dot(W.T, v))
print(np.dot(W.T, v).shape)

[531 642]
(2,)


###  Indexing

Slicing / indexing numpy arrays is a extension of the Python concept of slicing (lists) to N dimensions.

In [36]:
x = np.random.random((3, 4))

# Selects all of x
print(x[:])

[[0.67967409 0.7503561  0.2819389  0.47239277]
 [0.8377827  0.91115093 0.57307322 0.2862079 ]
 [0.60423802 0.33463655 0.97074304 0.39552708]]


In [37]:
# Selects the 0th and 2nd rows 
print(x[np.array([0, 2]), :])

print("---")

# Selects 1st row as 1-D vector and and 1st through 2nd elements
print(x[1, 1:3])

[[0.67967409 0.7503561  0.2819389  0.47239277]
 [0.60423802 0.33463655 0.97074304 0.39552708]]
---
[0.91115093 0.57307322]


In [38]:
# Boolean indexing
print(x[x > 0.5])

[0.67967409 0.7503561  0.8377827  0.91115093 0.57307322 0.60423802
 0.97074304]


In [39]:
# 3-D vector of shape (3, 4, 1)
print(x[:, :, np.newaxis])

[[[0.67967409]
  [0.7503561 ]
  [0.2819389 ]
  [0.47239277]]

 [[0.8377827 ]
  [0.91115093]
  [0.57307322]
  [0.2862079 ]]

 [[0.60423802]
  [0.33463655]
  [0.97074304]
  [0.39552708]]]


### Broadcasting

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations.

**General Broadcasting Rules**

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when:
- they are equal, or
- one of them is 1 (in which case, elements on the axis are repeated along the dimension)

More details: https://numpy.org/doc/stable/user/basics.broadcasting.html

In [40]:
x = np.random.random((3, 4))
y = np.random.random((3, 1))
z = np.random.random((1, 4))

# In this example, y and z are broadcasted to match the shape of x. 
# y is broadcasted along dim 1.
s = x + y
# z is broadcasted along dim 0.
p = x * z

In [41]:
print(x.shape)
print(y.shape)
print(s.shape)

(3, 4)
(3, 1)
(3, 4)


In [42]:
print(x.shape)
print(z.shape)
print(p.shape)

(3, 4)
(1, 4)
(3, 4)


Let's look at a more complex example.

In [43]:
a = np.random.random((3, 4))
b = np.random.random((3, 1))
c = np.random.random((3, ))

What is the expected broadcasting behavior for these operations? What do the following operations give us? What are the resulting shapes?

In [44]:
result1 = b + b.T

print(b.shape)
print(b.T.shape)
print(result1.shape)
print(result1)

(3, 1)
(1, 3)
(3, 3)
[[1.5683191  1.32822013 1.16548899]
 [1.32822013 1.08812117 0.92539003]
 [1.16548899 0.92539003 0.76265888]]


In [45]:
result2 = a + c

print(a.shape)
print(c.shape)
print(result2.shape)
print(result2)

ValueError: operands could not be broadcast together with shapes (3,4) (3,) 

In [46]:
result3 = b + c

print(b.shape)
print(c.shape)
print(result3.shape)
print(result3)

(3, 1)
(3,)
(3, 3)
[[1.7830505  0.91483839 1.21397415]
 [1.54295154 0.67473942 0.97387519]
 [1.3802204  0.51200828 0.81114405]]


### Efficient NumPy Code

When working with numpy arrays, avoid explicit for-loops over indices/axes at all costs. For-loops will dramatically slow down your code (~10-100x).

We can time code using the %%timeit magic. Let's compare using explicit for-loop vs. using numpy operations.

In [47]:
%%timeit
x = np.random.rand(1000, 1000)
for i in range(100, 1000):
    for j in range(x.shape[1]): 
        x[i, j] += 5

334 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [48]:
%%timeit
x = np.random.rand(1000, 1000)
x[np.arange(100,1000), :] += 5 

6.23 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
