What is this course about?
The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large scale supervised machine learning, Data streams, Mining the Web for Structured Data, Web Advertising.
The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses CS246 (Winter, 3-4 Units, homeworks, final, no project) and CS341 (Spring, 3 Units, project focused).You can access class notes and slides of previous versions of the course here:
|CS246 Websites: CS246: Winter 2018 / CS246: Winter 2017 / CS246: Winter 2016 / CS246: Winter 2015 / CS246: Winter 2014 / CS246: Winter 2013 / CS246: Winter 2012 / CS246: Winter 2011|
|CS345a Website: CS345a: Winter 2010|
In Winter 2019, CS246H: Mining Massive Data Sets: Hadoop Labs is a partner course to CS246 which includes limited additional assignments. CS246H focuses on the practical application of big data technologies, rather than on the theory behind them.
In Spring 2019, we will be offering a project based course where students will apply data mining and machine learning techniques on real world datasets. CS341: Project in Mining Massive Data Sets
Students are expected to have the following background:
- Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).
- Good knowledge of Java and Python will be extremely helpful since most assignments will require the use of Spark/Hadoop.
- Familiarity with basic probability theory (CS109 or Stat116 or equivalent is sufficient but not necessary).
- Familiarity with writing rigorous proofs (at a minimum, at the level of CS 103).
- Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE 263 would be much more than necessary).
- Familiarity with algorithmic analysis (e.g., CS 161 would be much more than necessary).
The recitation sessions in the first weeks of the class will give an overview of the expected background.
The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset