Meeting time and recorded lectures

Stats 202 meets MWF 11:30-12:20 am in Gates B1.

All lectures will be recorded on video by the Stanford Center for Professional Development and posted on their site.

Lecture slides will be posted on this site (see the Lectures link on the left).

Course description

Stats 202 is an introduction to Data Mining. By the end of the quarter, students will:

  • Understand the distinction between supervised and unsupervised learning and be able to identify appropriate tools to answer different research questions.
  • Become familiar with basic unsupervised procedures including clustering and principal components analysis.
  • Become familiar with the following regression and classification algorithms: linear regression, ridge regression, the lasso, logistic regression, linear discriminant analysis, K-nearest neighbors, splines, generalized additive models, tree-based methods, and support vector machines.
  • Gain a practical appreciation of the bias-variance tradeoff and apply model selection methods based on cross-validation and bootstrapping to a prediction challenge.
  • Analyze a real dataset of moderate size using R.
  • Develop the computational skills for data wrangling, collaboration, and reproducible research.
  • Be exposed to other topics in machine learning, such as missing data, prediction using time series and relational data, non-linear dimensionality reduction techniques, web-based data visualizations, anomaly detection, and representation learning.


Introductory courses in statistics or probability (e.g., Stats 60), linear algebra (e.g., Math 51), and computer programming (e.g., CS 105).


The vast majority of questions about homework, the lectures, or the course should be asked on our Piazza forum, as others will benefit from the responses. You can join the Piazza forum using the link We strongly encourage students to respond to one another's questions!

Personal staff email addresses should generally be avoided unless the email contains confidential information.

Staff and office hours

Consult this table for up-to-date office hour information. There is one weekly online office hour via Zoom. You should install the software beforehand, see Stanford Zoom.

Office hours Location
Instructor Guenther Walther F 2:00-3:00, or by appointment Sequoia 135
TA Swarnadip Ghosh Th 6.30-7.30 Sequoia 105
TA Isaac Gibbs F 10-11 Sequioa Hall 200
TA Sifan Liu F 3-5 380-381U
TA Samyak Rajanala T 3.15-5.15 460-301
TA Yu Wang Th 2:30-4:20pm 380-381U
TA Han Wu M 5-7pm 460-334
TA Chenyang Zhong F 12-2pm Sequoia 220
Zoom Guenther Walther (T) T 2-3 Zoom Meeting
Zoom Swarnadip Ghosh (Th) Th 5.30-6.30 Zoom Meeting
Zoom Isaac Gibbs (F) F 4-5pm Zoom Meeting


The only textbook required is An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer, 1st ed., 2013). The book is available at the Stanford Bookstore and free online through the Stanford Libraries.

We may occasionally assign (optional) supplementary readings from the optional text The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (Springer, 2nd ed.).

In our lecture notes, the abbreviation ISL = Introduction to Statistical Learning and ESL = Elements of Statistical Learning.

If you would like to review concepts from linear algebra I recommend Linear Algebra and Learning from Data by Gilbert Strang.


(If you are an online SCPD student, please see SCPD info for more information on remote exam instructions and timings.)

  • Midterm exam: Friday, October 25, 11:30 am-12:20 pm (in our normal classroom).
  • Final exam: Friday, December 13, 8.30-11.30 am (in a room TBD).

If you cannot take these exams at those dates then you will need to take this class in a different quarter. There will be no alternative dates for these exams unless by official university business such as certain athletic commitments. If you do better on the final than on the midterm then the final supersedes the midterm.


There will be 8 graded homework assignments, due on Fridays by 11:59pm. Homeworks should be submitted through the course's gradescope page. An ungraded assignment (Homework 0) will help you install and become familiar with the tools used in this course. The homework assignments and staff solutions will be posted on this website and will be accessible by enrolled students (see the Homework link on the left). The course has a piazza forum for discussion.

You must write up your own solutions individually and explicitly indicate with whom (if anyone) you discussed the homework problems at the top of your homework solutions. In your solutions, please show your work and include all relevant code written. Please also keep in mind the university honor code.

Homework problems are similar to writing assignments in other courses in terms of citing sources and plagiarism. Students must cite (via URL or otherwise) sources used in preparing their homework solution.

This quarter, we will be using the Gradescope online submission and scoring system for all homework submission. Gradescope will send a Stats 202 enrollment notification to your Stanford email address. If you have not received such a notification by Wednesday Sep. 25, please contact the course staff via the staff mailing list.

Your problem sets should be submitted as PDF or image files through Gradescope.

Any regrade requests should be submitted through Gradescope within one week of receiving your grade. Please, read the relevant solutions and review the relevant course material prior to sending a request and specify (1) the part(s) of the homework you believe were wrongly graded and (2) why you deserve additional credit. We will typically regrade the entirety of any homework for which any regrade is requested and the resulting score may be higher or lower than the original one.

Late homework will not be accepted, but the lowest homework score will be ignored.

Kaggle competition

An important part of the class will be an in-class prediction challenge hosted by Kaggle. This competition will allow you to apply the concepts learned in class and develop the computational skills to analyze data in a collaborative setting.

To learn more about the competition see the link on the left.


  • Homework: 35% (lowest score dropped).
  • Midterm: 20%.
  • Final: 40%.
  • Kaggle competition: 5% (based on satisfactory participation).

Tentative outline

Day Topic Chapters
Mon 9/23 Class logistics, HW 0
Wed 9/25 Supervised and unsupervised learning 2
Fri 9/27 Principal components analysis 10.1,10.2,10.4
Mon 09/30 Clustering 10.3, 10.5
Wed 10/02 Linear regression 3.1-3.3
Fri 10/04 Linear regression 3.3
Mon 10/07 Linear regression 3.5
Wed 10/09 Classification, logistic regression 4.1-4.3
Fri 10/11 Linear and quadratic discriminant analysis 4.4-4.5
Mon 10/14 Classification examples 4.6
Wed 10/16 Cross validation 5.1
Fri 10/18 Bootstrap 5.2
Mon 10/21 Model selection 6.1, 6.2
Wed 10/23 Shrinkage 6.2
Fri 10/25 Midterm exam
Mon 10/28 Dimension reduction 6.3, 6.4
Wed 10/30 High-dimensional and non-linear regression, splines 6.4,7.1-7.4
Fri 11/01 Smoothing splines, GAMs, Local regression 7.5-7.7
Mon 11/04 GAMS, Document analysis 7.7
Wed 11/06 Decision trees 8.1
Fri 11/08 Classification trees, bagging, random forests 8.1, 8.2
Mon 11/11 Boosting, Support vector classifiers 9.1-9.2
Wed 11/13 Support vector machines 9.3,9.4
Fri 11/15 Support vector machines 9
Mon 11/18 Non-linear dimensionality reduction ESL 14.5.4, 14.8-9
Wed 11/20 Missing data ESL 9.6
Fri 11/22 Missing data lab
Mon 11/25 Thanksgiving
Wed 11/27 Thanksgiving
Fri 11/29 Thanksgiving
Mon 12/02 Web visualizations
Wed 12/04 Final review All chapters
Fri 12/06 Final review All chapters
Fri 12/13 Final exam