CS 168: The Modern Algorithmic Toolbox
- 6/3: Because of the shifted due dates for the miniproject, I will not be having office hours today. Instead, I'll have office hours Wednesday 11-noon, as well as some extra hours Friday 3-4pm.
- Here is mini-project 9, with
It is due Thursday, June 6. Mini-project #9 solution template is available.
- 5/27: Memorial Day: No class today, and I will not hold office hours today (out of town).
- 5/24: Here
is last year's final exam. This year's will have a similar format. [You will be allowed to bring a single 2-sided page of notes to the exam, provided you make the notes yourself.]
- 5/22: Mini-project #8 is due Wednesday the 29th at midnight, not Tuesday.
- 5/21: Mini-project #8 solution template is available.
- 5/20: Mini-project #8 is available. Here is the laurel_yanny.wav audio clip for Part 3, and a zipped folder worldsYikes.zip containing examples that we created illustrating that the 30-point research-oriented bonus part is not impossible : ) The miniproject is due Wednesday, May 29th at 11:59pm.
- 5/17: For the rest of the quarter, Jeff's office hours (Thurs, 3-5pm) will be held in Gates B26.
- 5/13: Mini-project #7 is available. Here is the data
file of national parks for Part 2. For Part 3, the QWOP scripts are available for MATLAB here and Python here. It is due Tuesday, May 21st at 11:59pm.
- 5/6: Clarification on Mini-project #5, parts 1d and 1e: The word-embeddings should be 100-dimensional vectors, corresponding to the rows of the 10000x100 matrix U whose columns are the first/top 100 left singular vectors of the scaled co-occurrence. (If you use the untruncated SVD, and your vectors have dimension 10k, the results will be similar, but noisier because the bottom singular vectors are essentially just random noise.) Sorry for the confusion, and the very late clarification. [If you already did this part, your grade won't be affected, though the results won't be as surprising/compelling.]
- 5/6: Mini-project #6 is available.
It is due Tuesday, May 14th at 11:59pm. Here is the data
file for Part 2.
- 4/29: Mini-project #5 is available. It is
due Tuesday, May 7th at 11:59pm. The data files you will need are: dictionary.txt, co_occur.csv, analogy_task.txt, and p5_image.gif.
- 4/29: Greg's 3-4pm OH are canceled. Sorry for the inconvenience.
- 4/22: Mini-project #4 is available. It is
due Tuesday, April 30th at 11:59pm. Here is the data set for Part 1.
- 4/15: Mini-project #3 is available. It is due Tuesday, April 23rd at 11:59pm.
- 4/9: Mini-project #2 is available.
It is due Tuesday, April 16th at 11:59pm. The dataset is available here.
- Here is some starter code for drawing
heatmaps in Python (and/or check Stack Overflow).
- 4/9: Here
is a review (courtesy of EE263) of the most basic aspects of vectors and matrices (e.g., how
to multiply them). Most relevant for CS168 are pages 1--9 (up to "Block
matrices and submatrices") and the "Linear functions" and "Linear
equations" sections on pages 10--11.
- 4/8: Mini-project #2 will be posted in the morning....
- 4/4: TAs' office hours are posted below. Melody will hold the first office hour Friday, 3:45-5:45pm, in Gates 119.
- 4/1: Mini-project #1 is available.
It is due Tuesday, April 9th (at 11:59pm). Please submit via Gradescope, entry code 9NYKNZ.
- Here is some starter code for drawing
histograms in Python (and/or check Stack Overflow).
- 4/1: First class, 1:30-2:50pm in 420-040. Welcome to CS168!
Gregory Valiant (Office hours: Mon 3-4pm, Gates 470. Email: last name at stanford.edu).
(Office hours: Thurs 3-5pm, Gates B26,
Email: jbarratt at stanford.edu).
(Office hours: Mon 4-6pm, Gates 498,
(Office hours: Fri 3:45-5:45, Gates 159 (except for 5/10, in 104),
Email: mguan at stanford.edu).
(Office hours: Wed 5:30-7:30, Gates 159,
Time/location: 1:30 - 2:50pm Mon/Wed in 420-040.
Piazza site: Here.
Prerequisites: CS107 and CS161, or permission from the instructor.
This course will provide a rigorous and hands-on
introduction to the central ideas and algorithms that constitute the
core of the modern algorithms toolkit. Emphasis will be on
understanding the high-level theoretical intuitions and principles
underlying the algorithms we discuss, as well as developing a concrete
understanding of when and how to implement and apply the algorithms.
The course will be structured as a sequence of one-week
investigations; each week will introduce one algorithmic idea, and
discuss the motivation, theoretical underpinning, and practical
applications of that algorithmic idea. Each topic will be accompanied
by a mini-project in which students will be guided through a practical
application of the ideas of the week. Topics include modern techniques
in hashing, dimension reduction, linear and convex programming,
gradient descent and regression, sampling and estimation, compressive
sensing, and linear-algebraic techniques (principal components
analysis, singular value decomposition, spectral techniques).
Week 1: Modern Hashing
- Lecture 1 (Mon 4/1): Course introduction. Consistent hashing.
- Lecture 2 (Wed 4/3):
Property-preserving lossy compression.
From majority elements to approximate heavy hitters.
From bloom filters to the count-min sketch.
Week 2: Data with Distances (Similarity Search, Nearest Neighbor,
Dimension Reduction, LSH)
- Lecture 3 (Mon 4/8):
(Dis)similarity metrics: Jaccard, Euclidean, Lp.
Efficient algorithm for finding similar elements in small/medium (ie. <20)
dimensions using k-d-trees.
- Lecture 4 (Wed 4/10):
Curse of Dimensionality, kissing number.
Estimating Jaccard similarity using MinHash.
JL dimensionality reduction.
- A nice survey of "kissing number", and some other strange phenomena from high dimensional spaces:
Kissing Numbers, Sphere Packings, and some Unexpected Proofs (from 2000).
- Origins of MinHash at Alta Vista:
and Filtering Near-Duplicate
Documents (from 2000).
- Ailon/Chazelle, Faster
Dimension Reduction, CACM '10.
Hashing Algorithms for Approximate Nearest Neighbor in High
Dimensions, CACM '08.
- For much more on LSH, see this chapter of
the CS246 textbook (by Leskovec, Rajaraman, and Ullman).
Week 3: Generalization and Regularization
- Lecture 5 (Mon 4/15):
Generalization (or, how much data is enough?).
Learning an unknown function from samples from an unknown distribution.
Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.
- Lecture 6 (Wed 4/17): Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization.
- A recent paper arguing that, to understand why deep learning works, we need to rethink the theory of generalization. This paper is quite controversial, with one camp thinking that its conclusions are completely obvious, and the other camp thinking that it is revealing an extremely deep mystery. You decide for yourself! Paper is here.
Week 4: Linear-Algebraic Techniques:
Understanding Principal Components Analysis
- Lecture 7 (Mon 4/22):
Understanding Principal Component Analysis (PCA).
Minimizing squared distances equals maximizing variance.
Use cases for data visualization and data compression.
Failure modes for PCA.
- A nice exposition by 23andMe of the fact that the top 2 principal components of genetic SNP data of Europeans essentially recovers the geography of europe: nice exposition w. figures. Original Nature paper: Genes mirror geography in Europe, Nature, Aug. 2008.
(see also this blog post)
- There's tons of PCA tutorials floating around the Web (some good, some
not so good), which you are also permitted to refer to.
- Lecture 8 (Wed 4/24):
How PCA works. Maximizing variance as finding the
"direction of maximum stretch" of the covariance matrix.
The simple geometry of "diagonals in disguise."
The power iteration algorithm.
Week 5: Linear-Algebraic Techniques: Understanding the Singular Value Decomposition
- Lecture 9 (Mon 4/29):
Low-rank matrix approximations. The singular value decomposition (SVD), applications to matrix compression, de-noising, and matrix completion (i.e. recovering missing entries).
- Lecture 10 (Wed 5/1):
Tensor methods. Differences between matrices and tenors, the uniqueness of low-rank tensor factorizations, and Jenrich's algorithm.
- A blog post discussing Spearman's original experiment and the motivation for tensor methods: here.
- See chapter 3 of Ankur Moitra's course notes here for a more technical in depth discussion of tensor methods, and Jenrich's algorithm.
Week 6: Spectral Graph Theory
- Lecture 11 (Mon 5/6):
Graphs as matrices and the Laplacian of a graph. Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering.)
- Dan Spielman's excellent lecture notes for his semester-long course on Spectral Graph Theory. The notes include a number of helpful plots.
Amin Saberi offered a grad course last year quarter that is covering some of the research frontier of Spectral Graph Theory.
- Lecture 12 (Wed 5/8):
Spectral techniques, Part 2.
Interpretations of the second eigenvalue
(via conductance and isoperimetric number), and connections with the
speed at which random walks/diffusions convergence.
Week 7: Sampling and Estimation
- Lecture 13 (Mon 5/13):
Reservoir sampling (how to select a random sample from a datastream). Basic probability tools: Markov's inequality and Chebyshev's inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the Good-Turing estimate of the missing/unseen mass.
- A nice description of the probabilistic tools/approach that went into Nate Silver's Senate election forecasting model (from 2014): here.
- A somewhat recent paper of mine showing that one can accurately estimate the structure of the unobserved portion of a distribution---not just the total probability mass. paper link here.
- Lecture 14 (Wed 5/15):
Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC)
as approaches to solving hard problems by sampling from carefully
- A basic description of MCMC, here.
- Lecture notes from Persi Diaconis on MCMC, including a description of the MCMC approach to decoding substitution-ciphers,
- Example of MCMC used for fitting extremely complex biological models:
The human splicing code... Science, Jan. 2015.
- For those interested in computer Go: here is the Jan, 2016 Nature paper from Google's DeepMind group.
Week 8: The Fourier Perspective (and other bases)
- Lecture 15 (Mon 5/20):
Fourier methods, part 1.
- A very basic intro to Fourier transformations with some nice visualizations: here.
- A book version of a Stanford course on the Fourier Transform, which has many extremely nice applications: pdf here.
- Lecture 16 (Wed 5/22):
Fourier methods, part 2 (emphasis on convolutions).
- Lecture notes combined with Lecture 15.
Week 9: Sparse Vector/Matrix Recovery (Compressive Sensing)
- Lecture 17 (Wed 5/29):
- Lecture 18 (Mon 6/3):
Linear and convex programming. Matrix completion.
Bonus Lecture: Privacy-Preserving Computation
- Lecture 19 (Mon 6/5):
Differential Privacy in data analysis and machine learning.
Lecture notes will not be ready for a day or two...
- A fairly recent book on Differential Privacy by Cynthia Dwork and Aaron Roth: here (should be downloadable from stanford network).
- The original paper of Dwork, McSherry, Nissim, and Smith, introducing differential privacy.
- The 2016 paper of Papernot, Abadi, Erlingsson, Goodfellow, and Talwar, describing the "Private Aggregation of Teacher Ensembles" approach of training multiples models on subsets of the data, and using them to ``teach'' a new, private, model, by combining their predictions in a differentially private fashion: here.
- Assignments (75%): There will be 9 weekly mini-projects centered around the topics covered that week. Each mini-project contains both written and programming parts. The projects can be done individually or in pairs. If you work in a pair, only one member should submit all of the relevant files.
For the written part, you are encouraged to use LaTeX to typeset your homeworks;
we've provided a template for your
convenience. We will be using the GradeScope online submission system. You should have received an email saying that you've been enrolled in CS168 on Gradescope. If not, create an account on Gradescope using your Stanford ID and join CS168 using entry code 9NYKNZ.
For the programming part, you are encouraged to use matlab (tutorial), Numpy and Pyplot in Python (Python tutorial, Numpy tutorial, Pyplot tutorial), or some other scientific computing tool (with plotting). Here is a comprehensive python tutorial using IPython Notebook. IPython Notebook is an interactive computational environment, especially useful for scientific computing (tutorial on how to set up). For easy reference, you can also view the notebook here.
Assignments are released on Mondays, and are due at 11:59pm on Tuesdays the following week (both the written and the programming parts). No late assignments will be accepted, but we will drop your lowest assignment grade when calculating your final grade.
- Exam (25%):
Date: Monday, June 10th, 3:30 - 6:30 pm.
Except where otherwise noted, you may refer to your course notes, the
textbooks and research papers listed on the course Web
page only. You cannot refer to textbooks, handouts, or research
papers that are not listed on the course home page. If you do use any
approved sources, make you sure you cite them appropriately, and make
sure that all your words are your own.
You are also permitted to use general resources for whatever programming
language you choose to use.
You can discuss the problems verbally at a high level with other groups. And of course, you are encouraged to contact the course staff (via Piazza or office hours) for additional help.
Please follow the honor code.