CS 168: The Modern Algorithmic Toolbox, Spring 2021
Here is the link to a google doc with some possible ideas for your final [mini]project. This list is certainly not comprehensive, and we will continue to add to the list. Also, the course staff would be more than happy to chat with you all about final project ideas, both over email, or during office hours.
 5/10: Miniproject #7 is available. Here is the data
file of national parks for Part 2. For Part 3, the QWOP scripts are available for MATLAB here and Python here. It is due Tuesday, May 18th at 11:59pm.
 5/4: Miniproject #6 is available.
It is
due Tuesday, May 11th at 11:59pm. Here is the data
file for Part 2.
 5/3: Kimberly's office hours this week will be 10am12 on thursday, instead of the usual 8am10am.
 4/26: Kimberly's office hours this week will be 10am12 on thursday, instead of the usual 8am10am.
 4/26: Miniproject #5 is available (this is one of my favorite miniprojects....)
It is
due Tuesday, May 4th at 11:59pm. The data files you will need are: dictionary.txt, co_occur.csv, analogy_task.txt, and p5_image.gif.
 4/19: Miniproject #4 is available.
It is
due Tuesday, April 27th at 11:59pm. Here is the dataset for Part 2.
 4/15: Typo in miniproject 3 Q2 part (e): originally, the questions asked you to plot the "average normalized training and test error over 10 iterations". The word "iterations" should be replaced by "trials". This should be clear from the context, but sorry for any confusion. This has been updated on the miniproject version available this webpage.
 4/12: Just this week, Woody is moving his Saturday office hours to MONDAY 710am.
 4/12: Miniproject #3 is now available. It is due Tuesday, April 20th at 11:59pm.
 4/11: Correction to problem 3b on the miniproject: The original problem statement asked you to assume that "each hash can be computed in constant time". This isn't a reasonable assumptioninstead, please assume that each of the L hashes takes d*k operations to compute, and also assume that checking the distance between length k vectors takes time k. No points will be deducted if you followed the original formulation of the problem (assuming constant hash time for each of the L hashes), though you'll get a slightly less satisfying answer, which won't be as much of a win over the runtime of the naive bruteforcesearch for the nearest neighbor. Sorry for any confusion,
 4/6: After much discussion regarding the final exam policies, we have decided to NOT have a final exam. Instead, there will be a project, worth 10% of your grade. You can turn the project in at any point up until the last lecture of the quarter, and can work in a team of at most four students. This project should take one of two forms:
 Option 1) Design an alternate miniproject for one topic. Choose one of the weeks in the course, and think about the highlevel takehome messages from that week. Craft a miniproject that explores some of those takehome messages. The miniproject you design should have the same sort of ``feel'' and length as the other miniprojects for the course. If your miniproject involves a dataset, please include a clear link to that dataset (google drive/dropbox/etc is fine) and description of what that dataset is. Please also include a samplesolution for the miniproject, together with a short (ie 1 or 2 paragraph) discussion/reflection on how the miniproject complements and/or reinforces the material for that week.
 Option 2) Dive deeper into one of the topics we've covered. Choose a week or even a single lecture, and explore how these ideas have been extended beyond the material covered in lecture. What are some of the newer applications of these ideas? How have the ideas been optimized/tweaked? Where is the current research frontier in this direction? See where your curiosity takes you, and then in a short (310 page) paper, present/discuss what you learned. Your paper can resemble a survey/research paper, or lecture notes.
 4/5: Miniproject #2 is now available. It is due Tuesday, April 13 at 11:59pm. The dataset is available here. This miniproject is probably the longer ones of the quarterthe remaining miniprojects will mostly range in length between miniproject 1 and miniproject 2.
 Here is some starter code for drawing
heatmaps in Python (and/or check Stack Overflow). Here
is a review (courtesy of EE263) of the most basic aspects of vectors and matrices (e.g., how
to multiply them). Most relevant for CS168 are pages 19 (up to "Block
matrices and submatrices") and the "Linear functions" and "Linear
equations" sections on pages 1011.
 4/5: We have added another fabulous CA: Ying Sheng, who will be having her office hours on Tuesdays, 25pm!
 3/31: CA office hours will be beginning tomorrow (Thursday). Please see either Piazza or the announcement on Canvas for the zoom links for office hours. (They are not posted here as we have had zoom bombing issues in the past...)
 3/31: One of your classmates, Maddie Wang, put together a studybuddy matching system (for cs168 and other stanford classes). If you're interested in trying it, here is the link.
 3/29: Miniproject #1 is available.
It is due Tuesday, April 6th (at 11:59pm). Please submit via Gradescope, entry code D56BWR.
Here is some starter code for drawing
histograms in Python (and/or check Stack Overflow).
 3/29: The first lecture video is posted on Canvas, and lecture notes and links to the original paper on consistent hashing, and some other relevant links, are provided below in the 'Lecture 1' portion of the 'Proposed Schedule' below.
 3/25: Add yourself to our class Piazza discussion forum here. Our weekly miniprojects will be submitted/graded via the GradeScope online submission system. Please create an account on Gradescope using your Stanford ID and join CS168 using entry code D56BWR. Students who constructively contribute to the Piazza discussions will garner priceless gratitude from both the teaching staff and your classmates...additionally a small amount of discretionary bonus points allocated may be allocated prior to computing your final grade.
 I encourage you all to attend the live lectures. That said, lectures will be recorded, and lecture videos will be available via Canvas a few hours after each lecture finishes. In addition, I will post nicely typed lecture notes either before or right after each lecture. The notes might change slightly in the few hours after each lecture as I make some postlecture adjustments. Also, please help me improve lecture notes by pointing out typos, unclear portions, etc., there will be a special Piazza thread for such helpful comments/suggestions.
 In previous years, I encouraged students to work in pairs on the miniprojects, submitting one assignment with both names. Given that remote office hours are a bit less effective, and we could all use a bit more interaction these days, we will be allowing groups of up to *four* students. We will also include some more openended directions on the miniprojects (possibly as bonus).
Instructor:

Gregory Valiant (Office hours: Mon 2:303:30pm via Canvas' Zoom portal. Contact: Fastest response will be via Piazza, but you can also email me at my last name at stanford.edu)
Course Assistants:
 Neha Gupta
(Office hours: Fri 9noon)

Jerry Qu
(Office hours: Mon 4:306:30pm)

Ying Sheng
(Office hours: Tues 25pm)

Kimberly Te
(Office hours: Thurs 8am10:20am)

Kevin Tian
(Office hours: Tues 9amnoon)

Woody Wang
(Office hours: Sat 9amnoon)

Jackie Yang
(Office hours: Thurs 4:306:30pm)
Lecture Time/location: Mon/Wed, 12:20pm @ your sofa.
Piazza site
Prerequisites: CS107 and CS161, or permission from the instructor.
This course will provide a rigorous and handson
introduction to the central ideas and algorithms that constitute the
core of the modern algorithms toolkit. Emphasis will be on
understanding the highlevel theoretical intuitions and principles
underlying the algorithms we discuss, as well as developing a concrete
understanding of when and how to implement and apply the algorithms.
The course will be structured as a sequence of oneweek
investigations; each week will introduce one algorithmic idea, and
discuss the motivation, theoretical underpinning, and practical
applications of that algorithmic idea. Each topic will be accompanied
by a miniproject in which students will be guided through a practical
application of the ideas of the week. Topics include modern techniques
in hashing, dimension reduction, linear and convex programming,
gradient descent and regression, sampling and estimation, compressive
sensing, and linearalgebraic techniques (principal components
analysis, singular value decomposition, spectral techniques).
Week 1: Modern Hashing
 Lecture 1 (Mon 3/29): Course introduction. ``Consistent'' hashing.
Supplementary material:
 Lecture 2 (Wed 3/31):
Propertypreserving lossy compression.
From majority elements to approximate heavy hitters.
From bloom filters to the countmin sketch.
Supplementary material:
Week 2: Data with Distances (Similarity Search, Nearest Neighbor,
Dimension Reduction, LSH)
 Lecture 3 (Mon 4/5):
Similarity Search.
(Dis)similarity metrics: Jaccard, Euclidean, Lp.
Efficient algorithm for finding similar elements in small/medium (ie. <20)
dimensions using kdtrees. x
Supplementary material:
 Lecture 4 (Wed 4/7):
Curse of Dimensionality, kissing number.
Distancepreserving compression.
Estimating Jaccard similarity using MinHash.
Euclidean distance preserving dimensionality reduction (aka the JohnsonLindenstrauss Transform).
Supplementary material:
 A nice survey of "kissing number", and some other strange phenomena from high dimensional spaces:
Kissing Numbers, Sphere Packings, and some Unexpected Proofs (from 2000).
 Origins of MinHash at Alta Vista:
Broder,
Identifying
and Filtering NearDuplicate
Documents (from 2000).
 Ailon/Chazelle, Faster
Dimension Reduction, CACM '10.
 Andoni/Indyk,
NearOptimal
Hashing Algorithms for Approximate Nearest Neighbor in High
Dimensions, CACM '08.
 For much more on Locality Sensitive Hashing, see this chapter of
the CS246 textbook (by Leskovec, Rajaraman, and Ullman).
Week 3: Generalization and Regularization
 Lecture 5 (Mon 4/12):
Generalization (or, how much data is enough?).
Learning an unknown function from samples from an unknown distribution.
Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.
 Lecture 6 (Wed 4/14): Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization.
Supplementary material:
 A recent paper arguing that, to understand why deep learning works, we need to rethink the theory of generalization. This paper is quite controversial, with one camp thinking that its conclusions are completely obvious, and the other camp thinking that it is revealing an extremely deep mystery. You decide for yourself! Paper is here.
Week 4: LinearAlgebraic Techniques:
Understanding Principal Components Analysis
 Lecture 7 (Mon 4/19):
Understanding Principal Component Analysis (PCA).
Minimizing squared distances equals maximizing variance.
Use cases for data visualization and data compression.
Failure modes for PCA.
Supplementary material:
 A nice exposition by 23andMe of the fact that the top 2 principal components of genetic SNP data of Europeans essentially recovers the geography of europe: nice exposition w. figures. Original Nature paper: Genes mirror geography in Europe, Nature, Aug. 2008.
 Eigenfaces
(see also this blog post)
 There's tons of PCA tutorials floating around the Web (some good, some
not so good), which you are also permitted to refer to.
 Lecture 8 (Wed 4/21):
How PCA works. Maximizing variance as finding the
"direction of maximum stretch" of the covariance matrix.
The simple geometry of "diagonals in disguise."
The power iteration algorithm.
Week 5: LinearAlgebraic Techniques: Understanding the Singular Value Decomposition
 Lecture 9 (Mon 4/26):
Lowrank matrix approximations. The singular value decomposition (SVD), applications to matrix compression, denoising, and matrix completion (i.e. recovering missing entries).
 Lecture 10 (Wed 4/28):
Tensor methods. Differences between matrices and tensors, the uniqueness of lowrank tensor factorizations, and Jenrich's algorithm.
Supplementary material:
 A blog post discussing Spearman's original experiment and the motivation for tensor methods: here.
 See chapter 3 of Ankur Moitra's course notes here for a more technical in depth discussion of tensor methods, and Jenrich's algorithm.
Week 6: Spectral Graph Theory
 Lecture 11 (Mon 5/3):
Graphs as matrices and the Laplacian of a graph. Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering.)
Supplementary material:
 Dan Spielman's excellent lecture notes for his semesterlong course on Spectral Graph Theory. The notes include a number of helpful plots.

Amin Saberi offered a grad course a few years ago that is covering some of the research frontier of Spectral Graph Theory.
 Lecture 12 (Wed 5/5):
Spectral techniques, Part 2.
Interpretations of the second eigenvalue
(via conductance and isoperimetric number), and connections with the
speed at which random walks/diffusions converge.
Week 7: Sampling and Estimation
 Lecture 13 (Mon 5/10):
Reservoir sampling (how to select a random sample from a datastream). Basic probability tools: Markov's inequality and Chebyshev's inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the GoodTuring estimate of the missing/unseen mass.
Supplementary material:
 A nice description of the probabilistic tools/approach that went into Nate Silver's Senate election forecasting model (from 2014): here.
 A somewhat recent paper of mine showing that one can accurately estimate the structure of the unobserved portion of a distributionnot just the total probability mass. paper link here.
 Lecture 14 (Wed 5/12):
Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC)
as approaches to solving hard problems by sampling from carefully
crafted distributions.
Supplementary material:
 A basic description of MCMC, here.
 Lecture notes from Persi Diaconis on MCMC, including a description of the MCMC approach to decoding substitutionciphers,
here.
 Example of MCMC used for fitting extremely complex biological models:
The human splicing code... Science, Jan. 2015.
 For those interested in computer Go: here is the Jan, 2016 Nature paper from Google's DeepMind group.
Week 8: The Fourier Perspective (and other bases)
 Lecture 15 (Mon 5/17):
Fourier methods, part 1.
 Lecture 16 (Wed 5/19):
Fourier methods, part 2 (emphasis on convolutions).
Week 9: Sparse Vector/Matrix Recovery (Compressive Sensing) and Mathematical Programming
 Lecture 17 (Mon 5/24):
Compressive sensing.
 Lecture 18 (Wed 5/26):
Linear and convex programming. Matrix completion.
Week 10: Privacy Preserving Computational
 Lecture 19 (Wed 6/2:
Differential Privacy in data analysis and machine learning.
 Assignments (90% of Grade): There will be 9 weekly miniprojects centered around the topics covered that week. Each miniproject contains both written and programming parts. You are strongly encouraged to work in teams of up to four students. If you work in a team only one member should submit all of the relevant files.
For the written part, you are encouraged to use LaTeX to typeset your homeworks;
we've provided a template for your
convenience. We will be using the GradeScope online submission system. Please create an account on Gradescope using your Stanford ID and join CS168 using entry code D56BWR.
For the programming part, you are encouraged to use matlab (tutorial), Numpy and Pyplot in Python (Python tutorial, Numpy tutorial, Matplotlib tutorial), or some other scientific computing tool (with plotting). Here is a comprehensive python tutorial using Google's colab that was put together for the CS231n course, with examples of plotting using matplotlib at the very end.
Assignments are released on Mondays, and are due at 11:59pm on Tuesdays the following week (both the written and the programming parts). No late assignments will be accepted, but we will drop your lowest assignment grade when calculating your final grade.
 Final Project (10% of Grade): You can turn the project in at any point up until the last lecture of the quarter, and can work in a team of at most four students. This project should take one of two forms:
 Option 1) Design an alternate miniproject for one topic. Choose one of the weeks in the course, and think about the highlevel takehome messages from that week. Craft a miniproject that explores some of those takehome messages. The miniproject you design should have the same sort of ``feel'' and length as the other miniprojects for the course. If your miniproject involves a dataset, please include a clear link to that dataset (google drive/dropbox/etc is fine) and description of what that dataset is. Please also include a samplesolution for the miniproject, together with a short (ie 1 or 2 paragraph) discussion/reflection on how the miniproject complements and/or reinforces the material for that week.
 Option 2) Dive deeper into one of the topics we've covered. Choose a week or even a single lecture, and explore how these ideas have been extended beyond the material covered in lecture. What are some of the newer applications of these ideas? How have the ideas been optimized/tweaked? Where is the current research frontier in this direction? See where your curiosity takes you, and then in a short (310 page) paper, present/discuss what you learned. Your paper can resemble a survey/research paper, or lecture notes.
 Contributions to the 168 Community (Discretionary Grade Bumps): You are encouraged to help your fellow classmates when possible. This includes responding to Piazza questions when you know the answer, helping improve the course lecture notes by pointing out typos/errors/confusing parts, etc. [there will be a special Piazza folder specifically for providing this feedback on the lecture notes]. At the end of the course, I will bump up grades of those students who had the most positive impact on the 168 community, according to the (quite subjective) judgement of the TAs and me.
You can discuss the problems at a high level with other groups and contact the course staff (via Piazza or office hours) for additional help. And of course, you are encouraged to help respond to Piazza questions .
You may refer to the course notes and research papers linked from the course webpage, and can also use other references that you find online. You may *NOT* consult solution sets or code from past offerings of this course. Services such as CourseHero are strictly forbidden, and if you try to sell my lecture notes or your solution sets to these sites, a wellpaid lawyer representing Stanford will be in contact with you : ) [Please don't do this!] Of course, you are expected to understand everything that is written on any assignment turned in with your name; if you do refer to references, make sure you cite them appropriately, and make
sure that all your words are your own.
You are also permitted to use general resources for whatever programming
language you choose to use. None of the assignments require you to write much code, and *you may NOT reuse any code from friends/internet sources related to previous offerings of this course*. If you use helper functions or code snippets that you did not write yourself (aside from standard python packages, etc.) you must clearly cite the source in your writeup.
Please follow the honor code.