Principles of Data-Intensive Systems

Spring 2019
Mon/Wed 1:30-2:50 PM, Skillaud

This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing systems, streaming and machine learning systems. Topics include database system architecture, storage, query optimization, transaction management, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems.

Course Staff

Instructor

Matei Zaharia (Office hours: Monday 3-4 PM, Gates 412)

Teaching Assistants

Benjamin Braun (Office hours: Wednesday 5:30-7 PM, over Google Hangouts: bjmnbraun at gmail dot com)
Deepak Narayanan (Office hours: Tuesday 4-5 PM, Huang Basement)
Edward Gan (Office hours: Friday 3-4 PM, Huang Basement)
James Thomas (Office hours: Tuesday 3-4 PM, Huang Basement)
Leo Mehr (Office hours: Wednesday 3-4 PM, Huang Basement)
Pratiksha Thaker (Office hours: Monday 9:30-10:30 AM, Huang Basement)

Schedule

M, Apr 1	Introduction
W, Apr 3	Database System Architecture Reading: A History and Evaluation of System R You can skip "The Recovery Subsystem", "The Locking Subsystem", "The Convoy Phenomeon" and "Additional Observations" on pages 643-644. Question: Why did the storage system change from "inversions" to B-trees during the project? Optional Reading: How to Read a Paper
M, Apr 8	Database Architecture 2 and Storage Assignment 1 Posted
W, Apr 10	Storage Formats and Indexing
M, Apr 15	Storage Formats and Indexing 2 Reading: Integrating Compression and Execution in Column-Oriented Database Systems Question: How might the conclusions of this paper change if it ran on an NVMe SSD instead? Optional Reading: C-Store: A Column-Oriented DBMS
W, Apr 17	Query Execution
M, Apr 22	Query Optimization
W, Apr 24	Guest Talk: Program Optimization in TensorFlow Rasmus Larsen and Tatiana Shpeisman, Google Optional Background Material: Neural Network Basics (Video), TensorFlow High-Level API: Keras, TensorFlow Graph API Assignment 1 Due (at noon)
M, Apr 29	Query Optimization 2 Reading: Spark SQL: Relational Data Processing in Spark Question: What proof points does the paper give that Catalyst achieves its goal as an extensible optimizer? Can you think of any limitations to Catalyst's approach for supporting external extensions? Assignment 2 Posted
W, May 1	Transactions and Failure Recovery
M, May 6	Midterm (in class) This Year's Midterm: exam, solutions Past Midterms: Winter 2009 (solutions), Summer 2009 (solutions), Winter 2017 (solutions)
W, May 8	Failure Recovery & Concurrency
M, May 13	Concurrency Assignment 2 Due (at noon) Assignment 3 Posted
W, May 15	Guest Talk: Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin, Databricks Many organizations store their largest datasets in data lakes, i.e. collections of files in a large-scale, low-cost storage system such as the Amazon S3 or HDFS. Unfortunately, making data lakes reliable and efficient is challenging: readers can see incomplete data from in-progress writes, there are no indexes to accelerate access, and rolling back incorrect updates made by a data pipeline is complex. To address these problems, Databricks recently open sourced Delta Lake, a transactional storage management system the runs over commodity cloud and HDFS storage to provide reliability and scalability. Delta Lake manages a transaction log that allows for efficient isolation, point-in-time snapshots and rollback, and accesses metadata using Apache Spark in order to scale to billions of files and petabytes of data. Delta Lake was first released to Databricks customers in 2017 and is now in production use at over 1000 organizations, where it processes multiple exabytes of data per month. We describe some of the largest-scale use cases and the motivation for Delta Lake's design. Bio: Reynold Xin is a cofounder and Chief Architect at Databricks. In the open source community, Reynold is known as a top contributor to the Apache Spark project, having designed many of its core user-facing APIs and execution engine features. Reynold received a PhD in Computer Science from UC Berkeley, where he worked on large-scale data processing systems including Apache Spark, Spark SQL, GraphX and CrowdDB. Reynold also led the team that set the 2014 GraySort record in the Sort Benchmark.
M, May 20	Concurrency 2
W, May 22	Distributed Databases Reading: BASE: An ACID Alternative (pdf version) Question: What are examples of database constraints that would be difficult to provide using the message queue solution discussed in this article?
M, May 27	No Class (Memorial Day)
W, May 29	Distributed Databases 2
Th, May 30	Assignment 3 Due at 11:59 PM (Midnight)
M, Jun 3	Security and Data Privacy Reading: Privacy Integrated Queries Question: Give an example of a computation on data for which it's hard to provide any differential privacy. Optional Readings: Splinter: Practical Private Queries on Public Data, Opaque: An Oblivious and Encrypted Distributed Analytics Platform
W, Jun 5	Review
M, Jun 10	Final Exam Time: 3:30-6:30 PM, Dinkelspiel Auditorium Room 100 This Year's Final: exam, solutions Past Finals: Winter 2009 (solutions), Winter 2017 (solutions)

Logistics

Announcements

All announcements will be made on our Piazza page for the class. Make sure you sign up for Piazza!

Prerequisites

Students should ideally have taken CS 145 and CS 161, or their equivalent courses. In particular, we expect students to be familiar with SQL syntax. You can take a basic SQL tutorial for an overview of SQL if needed.

Assignments and Exams

We will have three programming assignments, a midterm and a final. The programming assignments are designed to be runnable on your personal machine and should be submitted through GradeScope.

Exams will be open-notes and "open-laptop" (you can bring any material you want on paper or on your laptop), except that network access will not be allowed during exams. Exams will cover material in the lectures, readings and assignments.

Readings

We have occasional readings for the lectures. We expect students to complete these and think about the respective questions on their own (you do not need to turn in answers). Reading material can appear on the exams.

Grading

Assignments: 15% each (total: 45%)
Midterm: 25%
Final: 30%

Late Policy

Students each have up to 3 late days that they may use during the quarter. Assignments submitted later after these late days have been used up will incur a penalty of 10% per additional day late.

Feedback

Please post public questions about the class on Piazza. For private questions to the staff, email cs245-spr1819-staff@lists.stanford.edu.