Announcements

- 6/1: Here is mini-project 9, with images. The miniproject is due Tuesday, June 9th at 11:59pm. This project is a bit shorter, though also note the new policy of dropping the 2 lowest miniproject scores...
- 5/25: Mini-project #8 is available. Here is the laurel_yanny.wav audio clip for Part 3, and a zipped folder worldsYikes.zip containing examples that we created illustrating that the 30-point research-oriented bonus part is not impossible : ) The miniproject is due Tuesday, June 2nd at 11:59pm. This miniproject will be much more fun to do in a group, so we
*strongly*suggest finding one, two, or three partners to work with. - 5:18: Mini-project #7 is available. Here is the data file of national parks for Part 2. For Part 3, the QWOP scripts are available for MATLAB here and Python here. It is due Tuesday, May 26th.
- 5/17: I'm uploading a lecture video for tomorrow's lecture. There will NOT be a live lecture at 1:30pm tomorrow.
- 5/11: Mini-project #6 is available. It is due Tuesday, May 19th at 11:59pm. Here is the data file for Part 2.
- 5/4: Mini-project #5 is available. It is due Tuesday, May 12th at 11:59pm. The data files you will need are: dictionary.txt, co_occur.csv, analogy_task.txt, and p5_image.gif.
- 4/29: Jeffrey's office hours will be Tuesdays 10-1pm from now on.
- 4/28: Mini-project #4 is available. It is due Tuesday, May 5th at 11:59pm. Here is the dataset for Part 2.
- 4/27: Greg's office hours today will be 3-4:30pm instead of the usual 3-5pm.
- 4/20: Mini-project #3 is now available. It is due Tuesday, April 28 at 11:59pm.
- 4/13: Mini-project #2 is now available. It is due Tuesday, April 21 at 11:59pm. The dataset is available here. This miniproject is probably the longest one of the quarter---the remaining mini-projects will range in length between mini-project 1 and mini-project 2.
- Here is some starter code for drawing heatmaps in Python (and/or check Stack Overflow). Here is a review (courtesy of EE263) of the most basic aspects of vectors and matrices (e.g., how to multiply them). Most relevant for CS168 are pages 1--9 (up to "Block matrices and submatrices") and the "Linear functions" and "Linear equations" sections on pages 10--11.
- 4/13: Lecture 3 video Password: a0#56?.=
- 4/13: Lecture 3 today will be live (and recorded) at 1:30pm today. Please see the zoom link from the Canvas calendar or Piazza. (I'm not publicly posting it here to avoid unwanted guests : )
- 4/9: Mini-project 1 asks you to plot some histograms: here is some starter code for drawing histograms in Python (and/or check Stack Overflow).
- 4/8: Sorry for all the technical issues with today's video--I think this will be completely resolved by next lecture (connecting directly to Zoom from my ipad seems to work). Thanks for your patience today. I really liked the live chat feature, and was thrilled that you all were very actively answering each others questions real-time...this might be one of the main benefits of live remote lectures versus in-person classes. Lets plan on a live lecture Monday, and we can have a class-wide vote regarding live vs pre-recorded after that. 4/7: For the miniprojects, we're going to allow groups of up to *four* students. As always, you should submit a single assignment with all of your names tagged via gradescope, and each member of the group must understand everything that is written.
- 4/6: Mini-project #1 is available.
It is due Tuesday, April 14th (at 11:59pm). Please submit via Gradescope, entry code M8BG7R.
- Here is some starter code for drawing
histograms in Python (and/or check Stack Overflow).

- 4/6: I'm aware there are a few issues with the course videos, including some times when the "whiteboard" seems to lag behind the audio, and also some compression artifacts that occur when I scroll from one whiteboard frame to another. The artifacts are actually not in my original videos, and instead are introduced when I upload them to Canvas. I'm looking into what can be done about this---one option would be to have the videos on Youtube. The whiteboard lag (only happened a few times in lecture 1) is a different matter...I'm looking into what I can do about that too. Sorry for this....
- 4/6: The ENROLLMENT CAP will be removed---it might take a day or two for this to be processed, but don't worry if you are on the waitlist.
- 4/5: My first office hours will be Monday 3-5pm. Connect via the Zoom portal on Canvas. See you there!
- 4/5: The first lecture video is posted on Canvas. Lets try to do Wednesday's lecture live, via Zoom. (It will still be recorded and the video will be available on Canvas with a few hours delay.) I do feel your live questions during lecture contributes significantly to the class experience....if this next live lecture has issues (zoom bombing, strange lags, etc. then I will go back to pre-recording videos.)
- 3/31: Add yourself to our class Piazza discussion forum here. Our weekly miniprojects will be submitted/graded via the GradeScope online submission system. Please create an account on Gradescope using your Stanford ID and join CS168 using entry code M8BG7R.
- 3/23: My current plan for lectures/videos is the following: Each lecture will be split into a couple of separate video segments/"chapters" which I will post. Having these separate chapters should make it easier for you to refer back to specific material. We will have a special Piazza thread for questions about the material/lecture, and I will post a new video 24hrs after the lecture is posted where I go through and address those questions that seem most helpful to the class. [In addition, we will have the usual written lecture notes that will also be posted.]
- 3/19: With the COVID situation, as with all stanford classes, CS168 will be entirely remote/online. I am still planning to cover all the material, and I am hoping and expecting this course to still be a fantastic experience!
There are two changes that you should know about:
- I'm happy to report that there will be no final exam, and 100% of your grade will be based on the weekly mini-projects. There will be a small amount of discretionary bonus points allocated to students who constructively contribute to the Piazza discussions and help improve lecture notes (by pointing out typos, unclear portions, etc.)

- In previous years, I encouraged students to work in pairs on the miniprojects, submitting one assignment with both names. Given the isolation due to COVID, I am tempted to try to encourage EVEN MORE collaboration, by allowing groups of up to *four* students, and including some more open-ended directions on the miniprojects (possibly as bonus). I plan to query the class to see if you support this change, and/or have other ideas for making the remote 168 experience even more exciting and stimulating.

Administrative Information

- Gregory Valiant (Office hours: Mon 3-5pm via Canvas' Zoom portal. Contact: Fastest response will be via Piazza, but you can also email me at my last name at stanford.edu)

**Course Assistants: **

- Jeffrey Barratt (Office hours: Tues 10am-1pm)
- Ben Cohen-Wang (Office hours: Thurs 7-9pm, Sun 7-9pm)
- Neha Gupta (Office hours: Mon 9-noon)
- Melody Guan (Office hours: Wed 4:10-7pm)
- Mingda Qiao (Office hours: Fri 1-4pm)
- Pras Ramakrishnan (Office hours: Fri 10am-1pm)
- Woody Wang (Office hours: Thurs 1-4pm)

**Lecture Time/location:** Mon/Wed, 1:30-2:50pm @ your sofa.

**Prerequisites**: CS107 and CS161, or permission from the instructor.

Course Description

This course will provide a rigorous and hands-on introduction to the central ideas and algorithms that constitute the core of the modern algorithms toolkit. Emphasis will be on understanding the high-level theoretical intuitions and principles underlying the algorithms we discuss, as well as developing a concrete understanding of when and how to implement and apply the algorithms. The course will be structured as a sequence of one-week investigations; each week will introduce one algorithmic idea, and discuss the motivation, theoretical underpinning, and practical applications of that algorithmic idea. Each topic will be accompanied by a mini-project in which students will be guided through a practical application of the ideas of the week. Topics include modern techniques in hashing, dimension reduction, linear and convex programming, gradient descent and regression, sampling and estimation, compressive sensing, and linear-algebraic techniques (principal components analysis, singular value decomposition, spectral techniques).

Detailed Schedule

### Week 1: Modern Hashing

**Lecture 1 (Mon 4/6):**Course introduction. Consistent hashing. Supplementary material:- The Akamai paper: Karger/Lehman/Leighton/Levine/Lewin/Panigrahy, Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web, STOC 1997.
- Akamai stories by co-founder Tom Leighton.
- Further implementation details (see Section 3).
- The Chord paper: Stoica et al., Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, SIGCOMM 2001.
- The Amazon Dynamo paper: DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store, SOSP 2007.
- Review videos on hashing: Operations and Applications, Implementation Details Part 1 and Part 2.

**Lecture 2 (Wed 4/8):**Property-preserving lossy compression. From majority elements to approximate heavy hitters. From bloom filters to the count-min sketch. Supplementary material:- Review videos on bloom filters: The Basics and Heuristic Analysis
- Broder/Mitzenmacher, Network Applications of Bloom Filters: A Survey, 2005.
- Cormode/Muthukrishnan, An Improved Data Stream Summary: The Count-Min Sketch and its Applications, 2003.
- One of several count-min sketch implementations.

### Week 2: Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)

**Lecture 3 (Mon 4/13):**Similarity Search. (Dis)similarity metrics: Jaccard, Euclidean, Lp. Efficient algorithm for finding similar elements in small/medium (ie. <20) dimensions using k-d-trees.- Lecture notes . Lecture video Password: a0#56?.=

- Original paper of Bentley: Multidimensional binary search trees used for associative searching, 1975.
- Python scipy kd-tree implementation here.

**Lecture 4 (Wed 4/15):**Curse of Dimensionality, kissing number. Distance-preserving compression. Estimating Jaccard similarity using MinHash. JL dimensionality reduction. Supplementary material:- A nice survey of "kissing number", and some other strange phenomena from high dimensional spaces: Kissing Numbers, Sphere Packings, and some Unexpected Proofs (from 2000).
- Origins of MinHash at Alta Vista: Broder, Identifying and Filtering Near-Duplicate Documents (from 2000).
- Ailon/Chazelle, Faster Dimension Reduction, CACM '10.
- Andoni/Indyk, Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions, CACM '08.
- For much more on Locality Sensitive Hashing, see this chapter of the CS246 textbook (by Leskovec, Rajaraman, and Ullman).

### Week 3: Generalization and Regularization

**Lecture 5 (Mon 4/20):**Generalization (or, how much data is enough?). Learning an unknown function from samples from an unknown distribution. Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.

**Lecture 6 (Wed 4/22):**Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization. Supplementary material:- A recent paper arguing that, to understand why deep learning works, we need to rethink the theory of generalization. This paper is quite controversial, with one camp thinking that its conclusions are completely obvious, and the other camp thinking that it is revealing an extremely deep mystery. You decide for yourself! Paper is here.

### Week 4: Linear-Algebraic Techniques: Understanding Principal Components Analysis

**Lecture 7 (Mon 4/27):**Understanding Principal Component Analysis (PCA). Minimizing squared distances equals maximizing variance. Use cases for data visualization and data compression. Failure modes for PCA. Supplementary material:- A nice exposition by 23andMe of the fact that the top 2 principal components of genetic SNP data of Europeans essentially recovers the geography of europe: nice exposition w. figures. Original Nature paper: Genes mirror geography in Europe, Nature, Aug. 2008.
- Eigenfaces (see also this blog post)
- There's tons of PCA tutorials floating around the Web (some good, some not so good), which you are also permitted to refer to.

**Lecture 8 (Wed 4/29):**How PCA works. Maximizing variance as finding the "direction of maximum stretch" of the covariance matrix. The simple geometry of "diagonals in disguise." The power iteration algorithm.

### Week 5: Linear-Algebraic Techniques: Understanding the Singular Value Decomposition

**Lecture 9 (Mon 5/4):**Low-rank matrix approximations. The singular value decomposition (SVD), applications to matrix compression, de-noising, and matrix completion (i.e. recovering missing entries).

**Lecture 10 (Wed 5/6):**Tensor methods. Differences between matrices and tensors, the uniqueness of low-rank tensor factorizations, and Jenrich's algorithm. Supplementary material:

### Week 6: Spectral Graph Theory

**Lecture 11 (Mon 5/11):**Graphs as matrices and the Laplacian of a graph. Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering.) Supplementary material:- Dan Spielman's excellent lecture notes for his semester-long course on Spectral Graph Theory. The notes include a number of helpful plots.
- Amin Saberi offered a grad course a few years ago that is covering some of the research frontier of Spectral Graph Theory.

**Lecture 12 (Wed 5/13):**Spectral techniques, Part 2. Interpretations of the second eigenvalue (via conductance and isoperimetric number), and connections with the speed at which random walks/diffusions converge.

### Week 7: Sampling and Estimation

**Lecture 13 (Mon 5/18):**Reservoir sampling (how to select a random sample from a datastream). Basic probability tools: Markov's inequality and Chebyshev's inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the Good-Turing estimate of the missing/unseen mass. Supplementary material:- A nice description of the probabilistic tools/approach that went into Nate Silver's Senate election forecasting model (from 2014): here.
- A somewhat recent paper of mine showing that one can accurately estimate the structure of the unobserved portion of a distribution---not just the total probability mass. paper link here.

**Lecture 14 (Wed 5/20):**Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC) as approaches to solving hard problems by sampling from carefully crafted distributions. Supplementary material:- A basic description of MCMC, here.
- Lecture notes from Persi Diaconis on MCMC, including a description of the MCMC approach to decoding substitution-ciphers, here.
- Example of MCMC used for fitting extremely complex biological models: The human splicing code... Science, Jan. 2015.
- For those interested in computer Go: here is the Jan, 2016 Nature paper from Google's DeepMind group.

### Week 8: The Fourier Perspective (and other bases)

**Lecture 15 (Mon 5/24):**Fourier methods, part 1. Supplementary material:

**Lecture 16 (Wed 5/26):**Fourier methods, part 2 (emphasis on convolutions).- Lecture notes combined with Lecture 15.

### Week 9: Sparse Vector/Matrix Recovery (Compressive Sensing)

**Lecture 17 (Mon 6/1):**Compressive sensing. Supplementary material:- A video and lecture notes from CS264 with more of the mathematics behind compressive sensing.
- Survey talk by Candes from ICM 2014.
- Rice's single-pixel camera
- More on potential applications in cameras.
- Developments in medical imaging.
- More resources on compressive sensing.

**Lecture 18 (Wed 6/3):**Linear and convex programming. Matrix completion. Supplementary material:- Linear programming FAQ, out of date but still with lots of useful info.
- For convex optimization at Stanford, start with Stephen Boyd.
- For more on matrix completion, start with Chapter 7 of Moitra's notes.

### Week 10: Privacy Preserving Computational

**Lecture 19:**Differential Privacy in data analysis and machine learning.- I'm really sorry, but I don't think I'll be able to finish Lecture notes until the end of this week, though the first chapter of the the Dwork/Roth book linked below is a fantastic starting place for learning about differential privacy.

- A fairly recent book on Differential Privacy by Cynthia Dwork and Aaron Roth: here (should be downloadable from stanford network).
- The original paper of Dwork, McSherry, Nissim, and Smith, introducing differential privacy.
- The 2016 paper of Papernot, Abadi, Erlingsson, Goodfellow, and Talwar, describing the "Private Aggregation of Teacher Ensembles" approach of training multiples models on subsets of the data, and using them to ``teach'' a new, private, model, by combining their predictions in a differentially private fashion: here.

Coursework

**Assignments (100% of Grade)**: There will be 9 weekly mini-projects centered around the topics covered that week. Each mini-project contains both written and programming parts. You are strongly encouraged to work in teams of two or three [this might change depending on your feedback after lecture 1]. If you work in a team**only one member**should submit all of the relevant files.For the written part, you are encouraged to use LaTeX to typeset your homeworks; we've provided a template for your convenience. We will be using the GradeScope online submission system. Please create an account on Gradescope using your Stanford ID and join CS168 using entry code M8BG7R.

For the programming part, you are encouraged to use matlab (tutorial), Numpy and Pyplot in Python (Python tutorial, Numpy tutorial, Pyplot tutorial), or some other scientific computing tool (with plotting). Here is a comprehensive python tutorial using IPython Notebook. IPython Notebook is an interactive computational environment, especially useful for scientific computing (tutorial on how to set up). For easy reference, you can also view the notebook here.

Assignments are released on Mondays, and are due at 11:59pm on Tuesdays the following week (both the written and the programming parts).

**No late assignments will be accepted**, but we will drop your lowest assignment grade when calculating your final grade.**Contributions to the 168 Community (Discretionary Grade Bumps)**: You are encouraged to help your fellow classmates when possible. This includes responding to Piazza questions when you know the answer, helping improve the course lecture notes by pointing out typos/errors/confusing parts, etc. [there will be a special Piazza folder specifically for providing this feedback on the lecture notes]. At the end of the course, I will bump up grades of those students who had the most positive impact on the 168 community, according to the (quite subjective) judgement of the TAs and me.

Collaboration Policy

You can discuss the problems at a high level with other groups and contact the course staff (via Piazza or office hours) for additional help. And of course, you are encouraged to help respond to Piazza questions .

You may refer to the course notes and research papers linked from the course webpage, and can also use other references that you find online. You may *NOT* consult solution sets or code from past offerings of this course. Services such as CourseHero are strictly forbidden, and if you try to sell my lecture notes or your solution sets to these sites, a well-paid lawyer representing Stanford will be in contact with you : ) [Please don't do this!] Of course, you are expected to understand everything that is written on any assignment turned in with your name; if you do refer to references, make sure you cite them appropriately, and make sure that all your words are your own.

You are also permitted to use general resources for whatever programming language you choose to use. None of the assignments require you to write much code, and *you may NOT reuse any code from friends/internet sources related to previous offerings of this course*. If you use helper functions or code snippets that you did not write yourself (aside from standard python packages, etc.) you must clearly cite the source in your writeup.

Please follow the honor code.