Stats 306B: Methods for Applied Statistics: Unsupervised Learning

Lester Mackey, Stanford University, Spring 2014

Lectures

Monday and Wednesday, 3:15 - 4:30 PM in Building 300 Room 300.

Instructor

Lester Mackey. Office hours: Mon. 2:05 - 3:05 PM, Weds. 2:05 - 3:05 PM, 141 Sequoia Hall. Email: lmackey@

Teaching Assistants

Xiaoying Tian. Office hours: Tues. 11:00 - 12:00 PM, Fri. 10:00 - 11:00 AM, 232 Sequoia Hall. Email: xtian@

Jackson Gorham. Office hours: Mon. 12:00 - 2:00 PM, 207 Sequoia Hall. Email: jgorham@

Prerequisites

Introductory statistical theory (e.g., Stats 200), linear algebra (e.g., Math 113), and programming (e.g., Computer Science 106A). Students should be comfortable with a matrix-oriented programming language like R or Matlab.

Texts

Grading

Your grade will be determined by scribing (3%), three problem sets (42%), a midterm (15%), and a final project (40%).

Scribing

In order to gain experience with technical writing, each student will be required to prepare scribe notes for a single lecture. After taking careful notes in class, the scribes for a given lecture will jointly prepare a LaTeX document (using this style file and this template) written in full prose understandable to a student who may have missed class. The LaTeX document, along with any image or auxiliary files, should be submitted to the instructor within two weekdays of the scribed lecture. After review, the scribe notes will be posted to the course website.

Problem Sets

Problem sets posted on the class website will be due in class on Wednesdays at the start of lecture. If you are traveling, you may email your solution to one of the course staff in advance of the deadline. Ten percent of the homework value will be deducted for each day a homework is late. Exceptions will be made for documented emergencies. No credit will be given for homework submitted after solutions have been posted.

After attempting the problems on an individual basis, you may discuss a homework assignment with up to two classmates. However, you must write your own code and write up your own solutions individually and explicitly name any collaborators at the top of the homework.

Please keep in mind the university honor code.

Midterm

The midterm will be held in our normal classroom during our normal class time. Any material from lectures, problem sets, or assigned readings issued before May 15 may be tested. You may refer to your course texts, assigned readings, and notes during the exam. You may not make use of the internet or any other outside resources during the exam.

Final Project

See the final project page.

R Resources

You can download R for free for any computing platform at the R Project for Statistical Computing. R is already installed on many campus computers.

Getting Started

Popular Development Environments

Course Overview

How do you identify cancer subtypes from unlabeled gene expression data?

How do you detect anomalous network behavior from traffic patterns?

How do you segment an image into its constituent parts?

In Stats 306B, we will learn to recover the hidden structure underlying our observations as we survey classic and modern unsupervised learning techniques and their practical applications.

Course Topics (according to time and interest)

Clustering and Latent Class Methods

  • K-means, K-medoids

  • Mixture Models, Expectation Maximization

  • Hidden Markov Models, Baum-Welch

  • Hierarchical Clustering

  • Spectral Clustering

Dimensionality Reduction and Latent Feature Methods

  • Principal Component Analysis, Kernel PCA

  • Factor Analysis, Probabilistic PCA

  • State Space Models, Kalman Filtering

  • Canonical Correlation Analysis

  • Independent Component Analysis

Modern Topics

  • Unsupervised Learning with Missing Data

  • Sparse / Interpretable Unsupervised Learning

  • Nonnegative Matrix Factorization, Document Topic Modeling

  • Subspace Clustering

  • Method of Moments for Latent Variable Models

  • Unsupervised Deep Learning