Project in Mining Massive Data Sets
Spring 2018

Course Information

Course description

CS341 (Project in Mining Massive Data Sets) is a project-focused advanced class with access to a large MapReduce cluster. This course is the second part in a two part sequence CS246/CS341.

CS246 discusses methods and algorithms for mining massive data sets. In this class, we will develop large scale data mining techniques and research projects. Students will have access to Google Cloud computing cluster. This means we will be able to run massive MapReduce jobs. Because it is challenging to work on algorithms for large scale data mining, we will be able to work with only a small number of students, and enrollment will be limited.

This is a purely project based course. We expect that students are already to some extent familiar with data mining methods. There will be lectures on some advanced data mining algorithm at the begging of the quarter. We also expect to have a good number of industrial guest lecturers discussing big data case studies.

Course projects and datasets for 2018


Knowledge and familiarity with concepts of CS246 or similar class (Hadoop, large scale data mining and machine learning algorithms).

Other courses that might be helpful: CS221, CS224N, CS224W, CS228, CS229, CS276, EE364A.


Doing research in data mining can be challenging! Thus we will only be able to work with a small number of students, and enrollment will be limited.

Course application procedure

To apply to the course follow the following instructions

If you would like some help or guidance when developing your project idea feel free to contact course staff. We will help you develop your project idea.

Project proposal submission deadline is March 15 at 11:59pm Pacific Time. Over the weekend we will evaluate the proposals and notify you whether your team got accepted in the class.

Project writeups

The result of the project is a 5-10 page paper. We will not accept longer reports.

Course materials

The book Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman serves as a comprehensive reference for the background required for this course. Also you will find Chapter 20.2, 22 and 23 of the second edition of Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom) relevant.


Students will be required to successfully complete a substantial data-mining project. There will be mid-quarter project milestone submission/present ion in addition to the final project presentation and report submission.

Recitation sessions

Recitation sessions will be held to guide the students on the use of Google Cloud services.

Course work

The coursework for the course will consist of:


Mailing list: You can reach us at cs341-spr1718-staff@lists.stanford.edu
Piazza: For class-related questions and discussions, you can use Piazza: https://piazza.com/stanford/spring2018/cs341.