Stanford MS&E 226 – “Small” Data

Class description – Autumn 2018

This course is about understanding “small data”: these are datasets that allow interaction, visualization, exploration, and analysis on a local machine. The material provides an introduction to applied data analysis, with an emphasis on providing a conceptual framework for thinking about data from both statistical and machine learning perspectives. Topics will be drawn from the following list, depending on time constraints and class interest: approaches to data analysis: statistics (frequentist, Bayesian) and machine learning; binary classification; regression; bootstrapping; causal inference and experimental design; multiple hypothesis testing.

Homeworks will have a significant practical and computational load to help students apply the concepts discussed in class.

Outline

  1. Summarization (2 weeks). Given a single data set, how do we summarize it? Basic sample statistics. Using models to succinctly summarize data. The algebra of linear regression and logistic regression. In-sample measures of fit: R2 and residuals.

  2. Prediction (2-3 weeks). How do we generalize our understanding of a data set to new samples? Formalizing the prediction problem. Binary classification. Linear regression and logistic regression as approaches to prediction. Model complexity and the bias-variance tradeoff. Training vs. test sets and cross validation.

  3. Inference (2-3 weeks). How do we generalize our understanding of a data set to draw inferences about the population or system from which the data came? The basics of frequentist estimation and hypothesis testing. Application to linear regression. The bootstrap. The multiple hypothesis testing problem. Comparison to Bayesian estimation and hypothesis testing.

  4. Causality (2 weeks). How do we determine the effect that changing a system will have? The Rubin causal model, potential outcomes, and counterfactuals. The “gold standard”: randomized experiments. The basics of causal inference from observational data. From causal inference to data-driven decisions.

Logistics

Class times and locations:

  • (Most) Mondays, Wednesdays, and Fridays, 10:30 - 11:50 AM, Skilling Auditorium

  • Discussion sections Fridays, 3:30-4:20 PM, Skilling Auditorium

Evaluation:

  • 5 problem sets.

  • Midterm in-class (multiple choice) on October 31.

  • Midterm (take-home) handed out October 31, due November 2.

  • Final exam (multiple choice) on December 13.

  • Mini-project.

Downloading R

There is a computational component to this class, which requires using R. (If you like you may use Python or Matlab, but officially the class will use R.)

An easy interface to R that you can use on your local machine is RStudio Desktop, which is available free for non-commercial use.

R is powerful in part because of the range of packages available that increase its capabilities. After downloading and installing R, you will find it helpful to also load the following packages:

  1. tidyverse: A collection of useful packages including ggplot and dplyr

  2. arm: A set of helper functions from Andrew Gelman and Jennifer Hill's book.

To install packages run install.packages(’<package_name>’) at the R command prompt.

To load a package run library('package_name’) at the R command prompt.

Some links to get you started with R:

  1. R for Data Science

  2. Code School R tutorial

  3. R for beginners

  4. Stack Overflow

  5. Cross Validated

  6. ggplot2 homepage

  7. ggplot2 book (free using Stanford Libraries)

  8. Cookbook for R

  9. Quick-R

  10. R tutorial at Cyclismo

Course staff

Professor:
Ramesh Johari

TAs:

Je-ok Choi (ICME Ph.D.)

Lin Fan (MS&E Ph.D.)

Nikhil Garg (EE Ph.D.)

Allison Park (Stats MS)

Meltem Tutar (MS&E MS)

NOTE: Please use Piazza for course-related communication.