Announcements

- Final Exam, Friday June 8th, 3:30-6:30pm @ Bishop Auditorium in Lathrop Library (closed book, though you can bring one double-sided sheet of notes that you prepare yourself).
- 6/4: Due to popular demand, we have posted solution sketches for last year's final exam on Piazza here. Feel free to use that page to discuss any of the problems.
- 5/30: Here is mini-project 9, with images. It is due Wednesday, June 6 at 11:59pm.
- 5/30: The final exam will be Friday, June 8th. It is closed book, though you can bring one double-sided sheet of notes that you prepare yourself (though feel free to discuss with eachother). Here is last year's final.
- 5/24: I updated the laurel/yanny .wav file for p8---originally it contained 2-channel audio (for left/right speakers) which corresponded to a 2x43k matrix; now it is just a single channel, so just a single vector. Also slightly updated footnote for problem 2...
- 5/23: Mini-project #8 is available. Here is the Laurel/Yanny .wav file for Part 3. This is due WEDNESDAY, May 30th at 11:59pm---note the extra day.
- 5/15: Mini-project #7 is available. Here is the data file of national parks for Part 2. For Part 3, the QWOP scripts are available for MATLAB here and Python here. It is due Tuesday, May 22nd at 11:59pm.
- 5/11: My Monday 4:30-5:30 office hour will be moved back to 5-6pm this coming week (5/14). Sorry for any inconvenience. -g
- 5/7: Mini-project 6 is posted. It is due Tuesday, May 15th at 11:59pm. Here is the data file for Part 2.
- 5/6: Minor correction to miniproject #5: in part 1(c), it said that since M is symmetric, for the SVD M = USV', it should be the case that U=V. This is not the case, though they are equal up to flipping the signs of some of their columns (and hence there is no point in looking at both the word embeddings derived from U and those derived from V---they will have identical properties. Sorry for any confusion this caused.
- 4/30: Mini-project #5 is available. It is due Tuesday, May 8th at 11:59pm. The data files you will need are: dictionary.txt, co_occur.csv, analogy_task.txt, and p5_image.gif.
- 4/23: Mini-project #4 is available. It is due Tuesday, May 1st at 11:59pm. Here is the data set for Part 1.
- 4/19: Here is a link to some notes on Gradient Descent, which might be helpful for mini-project #3 if you haven't seen gradient descent before. Gradient Descent Notes. It is due Tuesday, April 24th at 11:59pm.
- 4/17: Mini-project #3 is available. It is due Tuesday, April 24th at 11:59pm.
- 4/13: There was some confusion on Problem 3 (c); I re-worded that part of the problem to make it more clear. (Though any good answer to the original wording is also a valid answer to the re-worded question. Sorry for the confusion, and please download the latest version.
- 4/10: Fixed some minor typos in Miniproject 2 at 11pm. Please download the latest version.
- 4/9: Mini-project #2 is available. It is due Tuesday, April 17th at 11:59pm. The dataset is available here.
- Here is some starter code for drawing heatmaps in Python (and/or check Stack Overflow).
- 4/9: Here is a review (courtesy of EE263) of the most basic aspects of vectors and matrices (e.g., how to multiply them). Most relevant for CS168 are pages 1--9 (up to "Block matrices and submatrices") and the "Linear functions" and "Linear equations" sections on pages 10--11.
- 4/4: We updated the miniproject submission instructions, and set up the Gradescope submission page (entry code 986PDG).
- 4/2: Mini-project #1 is available.
It is due Tuesday, April 10th (at 11:59pm).
- Here is some starter code for drawing histograms in Python (and/or check Stack Overflow).

- 4/2: First class, 3-4:20pm in 320-105. Welcome to CS168!

Administrative Information

- Gregory Valiant (Office hours: Mon 4:30-5:30pm, Gates 470. Email: last name at stanford.edu).

**Course Assistants:**

- Matthew Katzman (Office hours: Mon 9-11am, Huang Basement, Email: mkatzman at stanford.edu).
- Stephen Mussmann (Office hours: Mon 1-3pm, Gates 459 (starting 4/16) Email: mussmann at stanford).
- Vatsal Sharan (Office hours: Tues 4:15-6:15pm, Gates 460, Email: first initial last name at stanford.edu).
- Warut Suksompong (Office hours: Fri 3-5pm, Gates 498, Email: jungs at stanford).

**Time/location:** 3:00 - 4:20pm Mon/Wed in 320-105.

**Piazza site:** Here.

**Prerequisites**: CS107 and CS161, or permission from the instructor.

Course Description

This course will provide a rigorous and hands-on introduction to the central ideas and algorithms that constitute the core of the modern algorithms toolkit. Emphasis will be on understanding the high-level theoretical intuitions and principles underlying the algorithms we discuss, as well as developing a concrete understanding of when and how to implement and apply the algorithms. The course will be structured as a sequence of one-week investigations; each week will introduce one algorithmic idea, and discuss the motivation, theoretical underpinning, and practical applications of that algorithmic idea. Each topic will be accompanied by a mini-project in which students will be guided through a practical application of the ideas of the week. Topics include modern techniques in hashing, dimension reduction, linear and convex programming, gradient descent and regression, sampling and estimation, compressive sensing, and linear-algebraic techniques (principal components analysis, singular value decomposition, spectral techniques).

Detailed Schedule

### Week 1: Modern Hashing

**Lecture 1 (Mon 4/2):**Course introduction. Consistent hashing. Supplementary material:- The Akamai paper: Karger/Lehman/Leighton/Levine/Lewin/Panigrahy, Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, STOC 1997.
- Akamai stories by co-founder Tom Leighton.
- Further implementation details (see Section 3).
- The Chord paper: Stoica et al., Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, SIGCOMM 2001.
- The Amazon Dynamo paper: DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store, SOSP 2007.
- Review videos on hashing: Operations and Applications, Implementation Details Part 1 and Part 2.

**Lecture 2 (Wed 4/4):**Property-preserving lossy compression. From majority elements to approximate heavy hitters. From bloom filters to the count-min sketch. Supplementary material:- Review videos on bloom filters: The Basics and Heuristic Analysis
- Broder/Mitzenmacher, Network Applications of Bloom Filters: A Survey, 2005.
- Cormode/Muthukrishnan, An Improved Data Stream Summary: The Count-Min Sketch and its Applications, 2003.
- One of several count-min sketch implementations.

### Week 2: Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)

**Lecture 3 (Mon 4/9):**Similarity Search. (Dis)similarity metrics: Jaccard, Euclidean, Lp. Efficient algorithm for finding similar elements in small/medium (ie. <20) dimensions using k-d-trees. Supplementary material:- Original paper of Bentley: Multidimensional binary search trees used for associative searching, 1975.
- Python scipy kd-tree implementation here.

**Lecture 4 (Wed 4/12):**Curse of Dimensionality, kissing number. Distance-preserving compression. Estimating Jaccard similarity using MinHash. JL dimensionality reduction. Supplementary material:- A nice survey of "kissing number", and some other strange phenomena from high dimensional spaces: Kissing Numbers, Sphere Packings, and some Unexpected Proofs (from 2000).
- Origins of MinHash at Alta Vista: Broder, Identifying and Filtering Near-Duplicate Documents (from 2000).
- Ailon/Chazelle, Faster Dimension Reduction, CACM '10.
- Andoni/Indyk, Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions, CACM '08.
- For much more on LSH, see this chapter of the CS246 textbook (by Leskovec, Rajaraman, and Ullman).

### Week 3: Generalization and Regularization

**Lecture 5 (Mon 4/16):**Generalization (or, how much data is enough?). Learning an unknown function from samples from an unknown distribution. Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.

**Lecture 6 (Wed 4/18):**Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization.- A nice/short technical description of recovering sparse signals via L1 regularization: Candes/Wakin'08.
- Very recent paper on generalization and implicit regularization in deep learning, which raises more questions than it answers: Understanding deep learning requires rethinking generalization, ICLR'17.

### Week 4: Linear-Algebraic Techniques: Understanding Principal Components Analysis

**Lecture 7 (Mon 4/23):**Understanding Principal Component Analysis (PCA). Minimizing squared distances equals maximizing variance. Use cases for data visualization and data compression. Failure modes for PCA. Supplementary material:- A nice exposition by 23andMe of the fact that the top 2 principal components of genetic SNP data of Europeans essentially recovers the geography of europe: nice exposition w. figures. Original Nature paper: Genes mirror geography in Europe, Nature, Aug. 2008.
- Eigenfaces (see also this blog post)
- There's tons of PCA tutorials floating around the Web (some good, some not so good), which you can also refer to.

**Lecture 8 (Wed 4/26):**How PCA works. Maximizing variance as finding the "direction of maximum stretch" of the covariance matrix. The simple geometry of "diagonals in disguise." The power iteration algorithm.

### Week 5: Linear-Algebraic Techniques: Understanding the Singular Value Decomposition

**Lecture 9 (Mon 4/30):**Low-rank matrix approximations. The singular value decomposition (SVD), applications to matrix compression, de-noising, and matrix completion (i.e. recovering missing entries).

**Lecture 10 (Wed 5/2):**Tensor methods!! Differences between matrices and tenors, the uniqueness of low-rank tensor factorizations, and Jenrich's algorithm. Supplementary material:

### Week 6: Spectral Graph Theory

**Lecture 11 (Mon 5/7):**Graphs as matrices and the Laplacian of a graph. Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering.) Supplementary material:- Dan Spielman's excellent lecture notes for his semester-long course on Spectral Graph Theory. The notes include a number of helpful plots.
- Amin Saberi offered a grad course in 2016 that covered some of the research frontier of Spectral Graph Theory.

**Lecture 12 (Wed 5/9):**Spectral techniques, part 2. Interpretations of the second eigenvalue (via conductance and isoperimetric number), and connections with the speed at which random walks convergence to the stationary distribution and the power iteration method.

### Week 7: Sampling and Estimation

**Lecture 13 (Mon 5/14):**Reservoir sampling (how to select a random sample from a datastream). Basic probability tools: Markov's inequality and Chebyshev's inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the Good-Turing estimate of the missing/unseen mass. Supplementary material:- A nice description of the probabilistic tools/approach that go into Nate Silver's political forecasting model: here.

**Lecture 14 (Wed 5/16):**Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC) as approaches to solving hard problems by sampling from carefully crafted distributions. Supplementary material:- A basic description of MCMC, here.
- Lecture notes from Persi Diaconis on MCMC, including a description of the MCMC approach to decoding substitution-ciphers, here.
- Example of MCMC used for fitting extremely complex biological models: The human splicing code... Science, Jan. 2015.
- For those interested in computer Go: here is the Jan, 2016 Nature paper from Google's DeepMind group.

### Week 8: The Fourier Perspective (and other bases)

**Lecture 15 (Mon 5/21):**Fourier methods, part 1. Supplementary material:

**Lecture 16 (Wed 5/23):**Fourier methods, part 2 (emphasis on convolutions).- Lecture notes combined with Lecture 15

### Week 9: Mathematical Programming and Sparse Vector/Matrix Recovery (Compressive Sensing)

**Lecture 17 (Wed 5/30):**Compressive sensing. Supplementary material:- A video and lecture notes from CS264 with more of the mathematics behind compressive sensing.
- Survey talk by Candes from ICM 2014.
- Rice's single-pixel camera
- More on potential applications in cameras.
- Developments in medical imaging.
- More resources on compressive sensing.

**Lecture 18 (Mon 6/4):**Linear and convex programming. Matrix completion. Supplementary material:- Linear programming FAQ, out of date but still with lots of useful info.
- For convex optimization at Stanford, start with Stephen Boyd.
- For more on matrix completion, start with Chapter 7 of Moitra's notes.

### Week 10: Privacy Preserving Data Analysis

**Lecture 19 (Wed 6/6):**Intro to Differential Privacy- Lecture notes will be posted at some point, but not before the final exam (Friday). This material will not be covered on the final.

- An excellent book on differential privacy, written by Cynthia Dwork and Aaron Roth, from 2014: here.

Coursework

**Assignments (75%)**: There will be 9 weekly mini-projects centered around the topics covered that week. Each mini-project contains both written and programming parts. The projects can be done individually or in pairs. If you work in a pair,**only one member**should submit all of the relevant files.For the written part, you are encouraged to use LaTeX to typeset your homeworks; we've provided a template for your convenience. We will be using the GradeScope online submission system. You should have received an email saying that you've been enrolled in CS168 on Gradescope. If not, create an account on Gradescope using your Stanford ID and join CS168 using entry code 986PDG.

For the programming part, you are encouraged to use matlab (tutorial), Numpy and Pyplot in Python (Python tutorial, Numpy tutorial, Pyplot tutorial), or some other scientific computing tool (with plotting). Here is a comprehensive python tutorial using IPython Notebook. IPython Notebook is an interactive computational environment, especially useful for scientific computing (tutorial on how to set up). For easy reference, you can also view the notebook here.

To turn in your programming part:

- Compress all your files into a .zip file, such
as
`p1.zip`

. - See the instructions on the mini-project for details about what files to submit. For example, we generally want your code in addition to your answers.
- Copy the zip file to the cardinal machine (don't miss the colon in the end): scp <zip file> <your SUNetID>@cardinal.stanford.edu:
- Log onto the cardinal machine:
ssh <your SUNetID>cardinal.stanford.edu

- Run the submit script (run the script without any argument to see its usage):
/usr/class/cs168/WWW/submit.py <project ID> .

Note: your file must have the proper name (such as

`p1.zip`

) for the submit script to process it.You can submit multiple times; each submission will just replace the previous one.

Assignments are released on Mondays, and are due at 11:59pm on Tuesdays the following week (both the written and the programming parts).

**No late assignments will be accepted**, but we will drop your lowest assignment grade when calculating your final grade.- Compress all your files into a .zip file, such
as
**Exam (25%)**: Date: Friday, June 8th, 3:30 - 6:30 pm.

Collaboration Policy

Except where otherwise noted, you may refer to your course notes, the
textbooks and research papers listed on the course Web
page **only**. You cannot refer to textbooks, handouts, or research
papers that are not listed on the course home page. If you do use any
approved sources, make you sure you cite them appropriately, and make
sure that all your words are your own.

You are also permitted to use general resources for whatever programming language you choose to use.

You can discuss the problems verbally at a high level with other groups. And of course, you are encouraged to contact the course staff (via Piazza or office hours) for additional help.

Please follow the honor code.