CS 345, Winter 2014:
Topics in Database Management Systems

Coordinates: MW 12:35-2:05 in Hewlett Teaching Center 103
Instructor: Christopher Ré (chrismre)
Office Hours: W 11-12 (or by appointment or Skype)
mailing list: cs345-win1314-all
This page is preliminary in every sense of the word. The page is intended only to give a rough, vague sense of the course.

Description

The first part of the course will describe classical database systems topics including join processing, concurrency control, recovery, query optimization, and database theory. On each topic, there will be an in-depth discussion of a few representative papers and recent results. The second part of the course will focus on additional topics that are relevant to database systems including MapReduce-style processing, information extraction, and predictive analytics.

One of the unique flavors of database research is that the area touches applications, systems building, and theory. To try to convey this flavor, this course will cover applications, systems, and theory papers. No one is expected to be expert in all the areas, but a willingness to appreciate what each brings to the table is required. Theory papers seem to cause a particular kind of trauma and so are highlighted in red.

The plan is to cover about 2 papers per week. Lecture time will be spent expanding on the details of these papers and related ideas.

Prerequisites: You are expected to have taken an undergrad database and algorithms course. If you have concerns about meeting the prerequisites, please contact Chris.

Text: There is no formal textbook for this course. The reading list is a collection of papers, which is posted on the course web page.

Reference texts: The following sources may be used in this course. You do not need to buy these books.

J. Hellerstein and M. Stonebraker. Red Book: Readings in Database Systems, 2004. Extra material is here.
H. Garcia-Molina, J. Widom, and J. Ullman. Database Systems: The Complete Book.
R. Ramakrishnan and J. Gehrke. Cow Book: Database Management Systems.
J. Ullman. (The Baseball Book) Principles of Database & Knowledge-base Systems
S. Abiteboul, R. Hull and V. Vianu. Foundations of Databases.

Course Project

A component of this course is a research project. For the project, you pick a topic in the area of database systems and explore this topic in detail. I am happy to suggest a list of project topics, but you are free to select a project outside of this list. I require that you meet with me periodically throughout the quarter. The course project is a group project, and each group must be of size 2 or 3. Please start looking for project partners right away. It is your responsibility to form and manage groups. The course project will include a course project report, a short project presentation at the end of the quarter, and a final project report. The final project presentation will be in a workshop-like format.

More detail can be found here.

Grading and Deadlines

Deadlines and Milestones coming soon...

Element	Percentage	Breakdown
Homework	30%	Fundamentals of Query Evaluation (Due: Feb. 15, 11:59pm) Entity Resoltuion
Course Project	35%	Project selection report. (Due Jan. 31) Intermediate project report (Due Feb. 21) Final project report (Due March 14) Project talk and demo.
Class and Reading	35%	You need to ask on average one question per week in class or by email.

Lecture Plan

This plan is as preliminary as it gets.

#	Topic	Reading	Slides
Fundamentals
1	Course Logistics and Database History	Glance at Research Overview Section.
2	Classical Join Processing	L. Shapiro on Joins from the 80s
3	System R-Style Optimizers and Histograms	Selinger System R. Optional: Chaudhuri
4	Formal Query Languages and Acyclic Joins	Reference only: AHV (Chapters 3, 4, and 6.4)
5	Worst-case Optimal Optimizers: NPRR and LFTJ	Must read! (Just kidding) Ngo et al.'s Survey
6	Wrap-up of Fundamentals
Data Systems for Analytics
7	Parallel Databases: from Gamma to Column-Stores	Gamma and C-Store
8	Optimizing Joins on MapReduce (Theory)	Ullman and Afrati's Shares paper
8	NoSQL: The Rise of MapReduce and Fault Tolerance	CACM MapReduce flame wars: Google vs. DB Researchers
9	NewSQL: PIG, Hive, and MapReduce Joins	PIG and Hive
11	Predictive Analytics Systems.	GraphLab, MADlib, and Hogwild!
Probabilistic and KB Systems
12	Knowledge Base Construction	Watson, Elementary, and NELL
13	Why Probabilistic Systems? Fundamentals of Probabilistic Query Evaluation.	Cox's Theorem (Jaynes Ch. 2) and Prob DB book Ch. 1
Transactions and OLTP
14	Locking, Latching, and Recovery.	ARIES
15	NewSQL OLTP: Spanner+F1 and Main Memory Databases	Spanner and F1
Grand Finale
**	Project Presentations

Reading and Topic List

This list is preliminary. It will change as the quarter evolves.

Database Research Overview

M. Stonebraker. What Goes Around Comes Around. Readings in Database Systems. 2004.
M. Stonebraker et al. "One Size Fits All": An Idea Whose Time Has Come and Gone, 2005
A. Halevy et al. The Unreasonable Effectiveness Of Data, IEEE Intelligent Systems, 2009.

Fundamentals of Relational Query Processing (SQL)

SQL-style Analytics and Log Processing (OLAP)

SQL

J. Gray et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. DMKD 1(1): 29-53 (1997).

Fagin's Algorithm

NoSQL

J. Dean et al. MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008).

NewSQL

C. Olston et al. Pig Latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110
A. Gates et al. Building a High-Level Dataflow System on top of MapReduce: The Pig Experience. PVLDB 2(2): 1414-1425 (2009)
A. Thusoo et al., Hive: A Warehousing Solution Over A MapReduce Framework, VLDB, 2009.
S. Melnik et al., Dremel: Interactive Analysis Of Web-Scale Datasets, VLDB, 2010.

Predictive Analytics

Statistical Analytics

Y. Zhang: RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009.
ArrayStore.
F. Niu. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS, 2011.
J. Canny. Big data analytics with small footprint: squaring the cloud , KDD 2013.
Notes: Simple Analysis of First-order Methods and QR decomposition.

Frameworks for Statistical Analytics

Y. Low, et al., Distributed GraphLab: A Framework For Machine Learning And Data Mining In The Cloud, VLDB, 2012
M. Zaharia, et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction For In-Memory Cluster Computing, NSDI, 2012
J. Hellerstein. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 2012
Y. Bu. HaLoop: Efﬁcient Iterative Data Processing on Large Clusters. VLDB 10.

Knowledge Base Construction

D. Ferruci et al. Building Watson: An Overview of the DeepQA Project. AI Magazine, 2013.
Google's Knowledge Graph. paper coming soon.
Kasneci et al. The YAGO-NAGA approach to knowledge discovery, 2009
Niu et al. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference, 2012.
A. Carlson. Toward an Architecture for Never-Ending Language Learning, AAAI 2010.
O. Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD 2013.
Probabilistic Inference in Large Factor Graphs

Transaction Processing (OLTP)

SQL

J. Gray: Granularity of Locks and Degrees of Consistency in a Shared Data Base, 1976.

NoSQL

W. Vogels. Eventually consistent. Commun. ACM 52(1): 40-44 (2009).

F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26(2): (2008).

G. DeCandia et al., Dynamo: Amazon's Highly Available Key-Value Store, SOSP, 2007

B. Cooper et al., PNUTS: Yahoo!'s Hosted Data Serving Platform, VLDB, 2008

NewSQL

VoltDB and HStore (Main Memory Systems)
J. Lee, et al., High-Performance Transaction Processing In SAP HANA, ICDE Bulletin, 2013
J.C. Corbett, et al., Spanner: Google's Globally-Distributed Database, OSDI, 2012
J. Shute, et al., F1: A Distributed SQL Database That Scales, VLDB, 2013
M. Demirbas. An Overview Of Spanner, Online, 2013

CS 345, Winter 2014: Topics in Database Management Systems

Description

Course Project

Grading and Deadlines

Grading and Deadlines

Lecture Plan

Reading and Topic List

CS 345, Winter 2014:
Topics in Database Management Systems