## Stanford MS&E 226 – “Small” Data## Class description – Autumn 2018This course is about understanding “small data”: these are datasets that allow interaction, visualization, exploration, and analysis on a local machine. The material provides an introduction to applied data analysis, with an emphasis on providing a conceptual framework for thinking about data from both statistical and machine learning perspectives. Topics will be drawn from the following list, depending on time constraints and class interest: approaches to data analysis: statistics (frequentist, Bayesian) and machine learning; binary classification; regression; bootstrapping; causal inference and experimental design; multiple hypothesis testing. Homeworks will have a significant practical and computational load to help students apply the concepts discussed in class. ## Outline**Summarization**(2 weeks). Given a single data set, how do we summarize it? Basic sample statistics. Using models to succinctly summarize data. The algebra of linear regression and logistic regression. In-sample measures of fit: R2 and residuals.**Prediction**(2-3 weeks). How do we generalize our understanding of a data set to new samples? Formalizing the prediction problem. Binary classification. Linear regression and logistic regression as approaches to prediction. Model complexity and the bias-variance tradeoff. Training vs. test sets and cross validation.**Inference**(2-3 weeks). How do we generalize our understanding of a data set to draw inferences about the population or system from which the data came? The basics of frequentist estimation and hypothesis testing. Application to linear regression. The bootstrap. The multiple hypothesis testing problem. Comparison to Bayesian estimation and hypothesis testing.**Causality**(2 weeks). How do we determine the effect that changing a system will have? The Rubin causal model, potential outcomes, and counterfactuals. The “gold standard”: randomized experiments. The basics of causal inference from observational data. From causal inference to data-driven decisions.
## Logistics
(Most) Mondays, Wednesdays, and Fridays, 10:30 - 11:50 AM, Skilling Auditorium Discussion sections Fridays, 3:30-4:20 PM, Skilling Auditorium
5 problem sets. Midterm in-class (multiple choice) on October 31. Midterm (take-home) handed out October 31, due November 2. Final exam (multiple choice) on December 13. Mini-project.
## Downloading RThere is a computational component to this class, which requires using R. (If you like you may use Python or Matlab, but officially the class will use R.) An easy interface to R that you can use on your local machine is RStudio Desktop, which is available free for non-commercial use. R is powerful in part because of the range of `tidyverse`: A collection of useful packages including ggplot and dplyr`arm`: A set of helper functions from Andrew Gelman and Jennifer Hill's book.
To install packages run To load a package run Some links to get you started with R: ## Course staff
Je-ok Choi (ICME Ph.D.) Lin Fan (MS&E Ph.D.) Nikhil Garg (EE Ph.D.) Allison Park (Stats MS) Meltem Tutar (MS&E MS) NOTE: Please use Piazza for course-related communication. |