Mining Massive Data Sets
Winter 2018

In the first two weeks of the class, we will also hold three recitation sessions that will serve as refreshers on important course material:

Course description

The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on Map Reduce as a tool for creating parallel algorithms that can process very large amounts of data.

Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Web Advertising.

CS246 is the first part in a two part sequence CS246--CS341. CS246 will discuss methods and algorithms for mining massive data sets, while CS341: Project in Mining Massive Data Sets will be a project-focused advanced class with an unlimited access to a large MapReduce cluster.

For students who want to learn more about Spark and Hadoop we are also offering CS246H: Mining Massive Data Sets: Hadoop/Spark Labs. In CS246H Spark and Hadoop will be covered in depth to give students a more complete understanding of the platform and its role in data mining. CS 246H videos may be viewed here.

Course outline

Tentative list of topics to be covered. These topics may change as the quarter progresses.

See Handouts for a list of topics and reading materials.

Students are expected to have the following background:

The recitation sessions in the first weeks of the class will give an overview of the expected background.

Course materials

Lecture notes and slides will be posted online. Readings have been derived from the book Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

Books: Leskovec-Rajaraman-Ullman: Mining of Massive Datasets can be downloaded for free. It can be purchased from Cambridge University Press, but you are not required to do so.

MOOC: You can watch videos from a past Coursera MOOC (similary to this course) on Youtube.

Piazza: Piazza Discussion Group for this class.

Course handouts: Available here.

Course work and grading

The coursework for the course will consist of:

Please read the homework submission instructions and policies for instructions on how to submit homework, register for Gradiance, etc.


Most assignments will require some level of programming in Spark. Spark is the open source implementation of MapReduce distributed data processing environment for mining large data sets across clusters of computers.

You will be running Spark jobs on your local laptop/desktop. Instructions on installing Spark can be found in homework 0.

Recitation sessions

Three recitation sessions will be held:

The recitation sessions are only intended to be refreshers; it is expected that you have already taken courses that include this material.

