CS 345, Winter 2014:
Topics in Database Management Systems

Description Project Grading Reading Lectures
Coordinates: MW 12:35-2:05 in Hewlett Teaching Center 103
Instructor: Christopher Ré (chrismre)
Office Hours: W 11-12 (or by appointment or Skype)
mailing list: cs345-win1314-all
This page is preliminary in every sense of the word. The page is intended only to give a rough, vague sense of the course.

Description

The first part of the course will describe classical database systems topics including join processing, concurrency control, recovery, query optimization, and database theory. On each topic, there will be an in-depth discussion of a few representative papers and recent results. The second part of the course will focus on additional topics that are relevant to database systems including MapReduce-style processing, information extraction, and predictive analytics.

One of the unique flavors of database research is that the area touches applications, systems building, and theory. To try to convey this flavor, this course will cover applications, systems, and theory papers. No one is expected to be expert in all the areas, but a willingness to appreciate what each brings to the table is required. Theory papers seem to cause a particular kind of trauma and so are highlighted in red.

The plan is to cover about 2 papers per week. Lecture time will be spent expanding on the details of these papers and related ideas.

Prerequisites: You are expected to have taken an undergrad database and algorithms course. If you have concerns about meeting the prerequisites, please contact Chris.

Text: There is no formal textbook for this course. The reading list is a collection of papers, which is posted on the course web page.
 
Reference texts: The following sources may be used in this course. You do not need to buy these books.

Course Project

A component of this course is a research project. For the project, you pick a topic in the area of database systems and explore this topic in detail. I am happy to suggest a list of project topics, but you are free to select a project outside of this list. I require that you meet with me periodically throughout the quarter. The course project is a group project, and each group must be of size 2 or 3. Please start looking for project partners right away. It is your responsibility to form and manage groups. The course project will include a course project report, a short project presentation at the end of the quarter, and a final project report. The final project presentation will be in a workshop-like format.

More detail can be found here.

Grading and Deadlines

Deadlines and Milestones coming soon...
Element Percentage Breakdown
Homework 30%
Course Project 35%
  • Project selection report. (Due Jan. 31)
  • Intermediate project report (Due Feb. 21)
  • Final project report (Due March 14)
  • Project talk and demo.
Class and Reading 35%
  • You need to ask on average one question per week in class or by email.

Lecture Plan

This plan is as preliminary as it gets.
# Topic Reading Slides
Fundamentals
1 Course Logistics and Database History Glance at Research Overview Section.
2 Classical Join Processing L. Shapiro on Joins from the 80s
3 System R-Style Optimizers and Histograms Selinger System R. Optional: Chaudhuri
4 Formal Query Languages and Acyclic Joins Reference only: AHV (Chapters 3, 4, and 6.4)
5 Worst-case Optimal Optimizers: NPRR and LFTJ Must read! (Just kidding) Ngo et al.'s Survey
6 Wrap-up of Fundamentals
Data Systems for Analytics
7 Parallel Databases: from Gamma to Column-Stores Gamma and C-Store
8 Optimizing Joins on MapReduce (Theory) Ullman and Afrati's Shares paper
8 NoSQL: The Rise of MapReduce and Fault Tolerance CACM MapReduce flame wars:
Google vs. DB Researchers
9 NewSQL: PIG, Hive, and MapReduce Joins PIG and Hive
11 Predictive Analytics Systems. GraphLab, MADlib, and Hogwild!
Probabilistic and KB Systems
12 Knowledge Base Construction Watson, Elementary, and NELL
13 Why Probabilistic Systems? Fundamentals of Probabilistic Query Evaluation. Cox's Theorem (Jaynes Ch. 2) and Prob DB book Ch. 1
Transactions and OLTP
14 Locking, Latching, and Recovery. ARIES
15 NewSQL OLTP: Spanner+F1 and Main Memory Databases Spanner and F1
Grand Finale
** Project Presentations


Reading and Topic List

This list is preliminary. It will change as the quarter evolves.
Database Research Overview
  1. M. Stonebraker. What Goes Around Comes Around. Readings in Database Systems. 2004.
  2. M. Stonebraker et al. "One Size Fits All": An Idea Whose Time Has Come and Gone, 2005
  3. A. Halevy et al. The Unreasonable Effectiveness Of Data, IEEE Intelligent Systems, 2009.
Fundamentals of Relational Query Processing (SQL)
  1. L. Shapiro: Join Processing in Database Systems with Large Main Memories. ACM Trans. Database Syst. 11(3): 239-264 (1986)
  2. P. Selinger et al.: Access Path Selection in a Relational Database Management System. SIGMOD 1979: 23-34
  3. S. Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: 34-43
  4. Y. Ioannidis et al. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. SIGMOD 1995.
  5. AHV Chapter 6.4 and Yannakakis's Algorithm (Acyclic Joins)
  6. H. Ngo et al.: Skew Strikes Back: New Developments in the Theory of Join Algorithms, Manuscript, 2013
SQL-style Analytics and Log Processing (OLAP)
SQL
  1. J. Gray et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. DMKD 1(1): 29-53 (1997).
  2. S. Abiteboul et al. Complexity of Answering Queries Using Materialized Views, PODS 1998.
  3. M Stonebraker et al. C-Store: A Column-oriented DBMS. VLDB 2005: 553-56.
  4. Fagin's Algorithm
NoSQL
  1. J. Dean et al. MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008).
  2. Parallel DBMS versus MapReduce
  3. Theory. Ullman and Suciu's papers about MapReduce and Joins
NewSQL
  1. C. Olston et al. Pig Latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110
  2. A. Gates et al. Building a High-Level Dataflow System on top of MapReduce: The Pig Experience. PVLDB 2(2): 1414-1425 (2009)
  3. A. Thusoo et al., Hive: A Warehousing Solution Over A MapReduce Framework, VLDB, 2009.
  4. S. Melnik et al., Dremel: Interactive Analysis Of Web-Scale Datasets, VLDB, 2010.
Predictive Analytics
Statistical Analytics
  1. Y. Zhang: RIOT: I/O-Efficient Numerical Computing without SQL, CIDR 2009.
  2. ArrayStore.
  3. F. Niu. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS, 2011.
  4. J. Canny. Big data analytics with small footprint: squaring the cloud , KDD 2013.
  5. Notes: Simple Analysis of First-order Methods and QR decomposition.
Frameworks for Statistical Analytics
  1. Y. Low, et al., Distributed GraphLab: A Framework For Machine Learning And Data Mining In The Cloud, VLDB, 2012
  2. M. Zaharia, et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction For In-Memory Cluster Computing, NSDI, 2012
  3. J. Hellerstein. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 2012
  4. Y. Bu. HaLoop: Efficient Iterative Data Processing on Large Clusters. VLDB 10.
Knowledge Base Construction
  1. D. Ferruci et al. Building Watson: An Overview of the DeepQA Project. AI Magazine, 2013.
  2. Google's Knowledge Graph. paper coming soon.
  3. Kasneci et al. The YAGO-NAGA approach to knowledge discovery, 2009
  4. Niu et al. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference, 2012.
  5. A. Carlson. Toward an Architecture for Never-Ending Language Learning, AAAI 2010.
  6. O. Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD 2013.
  7. Probabilistic Inference in Large Factor Graphs
Transaction Processing (OLTP)
SQL
  1. J. Gray: Granularity of Locks and Degrees of Consistency in a Shared Data Base, 1976.
  2. P. Lehman et al. Efficient Locking for Concurrent Operations on B-Trees. TODS 6(4): 650-670 (1981)
  3. C. Mohan et al.: ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. TODS 17(1): 94-162 (1992) You do not need to read the whole paper.
  4. C. Mohan et al.: Transaction Management in the R* Distributed Database Management System. TODS 11(4): 378-396 (1986)
  5. L. Lamport, Paxos Made Simple, ACM SIGACT News, 2001.
  6. CAP Theorem.
NoSQL
  1. W. Vogels. Eventually consistent. Commun. ACM 52(1): 40-44 (2009).
  2. F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26(2): (2008).
  3. G. DeCandia et al., Dynamo: Amazon's Highly Available Key-Value Store, SOSP, 2007
  4. B. Cooper et al., PNUTS: Yahoo!'s Hosted Data Serving Platform, VLDB, 2008
  5. P. Alvaro et al. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR 11.
  6. CALM Conjecture: a proof and a refutation.
NewSQL
  1. VoltDB and HStore (Main Memory Systems)
  2. J. Lee, et al., High-Performance Transaction Processing In SAP HANA, ICDE Bulletin, 2013
  3. J.C. Corbett, et al., Spanner: Google's Globally-Distributed Database, OSDI, 2012
  4. J. Shute, et al., F1: A Distributed SQL Database That Scales, VLDB, 2013
  5. M. Demirbas. An Overview Of Spanner, Online, 2013
Other Topics
Crowds
  1. A. Parameswaran, et al., Crowdscreen: Algorithms For Filtering Data With Humans, SIGMOD, 2012
  2. M. Stonebraker, et al., Data Curation At Scale: The Data Tamer System, CIDR, 2013
  3. M. Franklin, et al., CrowdDB: Answering Queries With Crowdsourcing, SIGMOD, 2011
Misc
  1. R. Agrawal. Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994.
  2. H. T. Kung: On Optimistic Methods for Concurrency Control. TODS 6(2): 213-226 (1981).
  3. J. Gray and L. Lamport. Consensus on Transaction Commit. MSR-TR-2003-96.
Relational Engine Internals
  1. Chou: An Evaluation of Buffer Management Strategies for Relational Database Systems. Algorithmica 1(3): 311-336 (1986).
  2. J. Gray. The 5 Minute Rule for Trading Memory for Disk Accesses and The 10 Byte Rule for Trading Memory for CPU Time. SIGMOD 1987.
  3. E. O'Neil: The LRU-K Page Replacement Algorithm For Database Disk Buffering. SIGMOD 1993: 297-306
  4. G. Graefe: The five-minute rule 20 years later (and how flash memory changes the rules). Commun. ACM 52(7): 48-59 (2009)
Incomplete, Inconsistent, and Probabilistic Databases
  1. T. Imielinksi et al. Incomplete Information in Relational Databases, JACM 1994
  2. Chapter 19 in AHV
  3. A. Das Sarma et al. Representing Uncertain Data: Uniqueness, Equivalence, Minimization, and Approximation, 2005.
  4. Suciu et al. Probabilistic Databases, Synthesis Lectures, 2011.
The Art of Reading Papers
Please read these on your own.
  1. M.J. Hanson. Efficient Reading of Papers in Science and Technology, 1989
  2. P. Valduriez, Some Hints to Improve Writing of Technical Papers, 1994.