Principles of Data-Intensive Systems

Spring 2019
Mon/Wed 1:30-2:50 PM, Skillaud

This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing systems, streaming and machine learning systems. Topics include database system architecture, storage, query optimization, transaction management, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems.

Course Staff

Instructor
Matei Zaharia (Office hours: Monday 3-4 PM, Gates 412)
Teaching Assistants
  • Benjamin Braun (Office hours: Wednesday 5:30-7 PM, over Google Hangouts: bjmnbraun at gmail dot com)
  • Deepak Narayanan (Office hours: Tuesday 4-5 PM, Huang Basement)
  • Edward Gan (Office hours: Friday 3-4 PM, Huang Basement)
  • James Thomas (Office hours: Tuesday 3-4 PM, Huang Basement)
  • Leo Mehr (Office hours: Wednesday 3-4 PM, Huang Basement)
  • Pratiksha Thaker (Office hours: Monday 9:30-10:30 AM, Huang Basement)

Schedule

M, Apr 1
Introduction
W, Apr 3
Database System Architecture
Reading: A History and Evaluation of System R
You can skip "The Recovery Subsystem", "The Locking Subsystem", "The Convoy Phenomeon" and "Additional Observations" on pages 643-644.
Question: Why did the storage system change from "inversions" to B-trees during the project?
Optional Reading: How to Read a Paper
M, Apr 8
Database Architecture 2 and Storage
W, Apr 10
Storage Formats and Indexing
M, Apr 15
Storage Formats and Indexing 2
Reading: Integrating Compression and Execution in Column-Oriented Database Systems
Question: How might the conclusions of this paper change if it ran on an NVMe SSD instead?
W, Apr 17
Query Execution
M, Apr 22
Query Optimization
W, Apr 24
Guest Talk: Program Optimization in TensorFlow
M, Apr 29
Query Optimization 2
Reading: Spark SQL: Relational Data Processing in Spark
Question: What proof points does the paper give that Catalyst achieves its goal as an extensible optimizer? Can you think of any limitations to Catalyst's approach for supporting external extensions?
W, May 1
Transactions and Failure Recovery
M, May 6
Midterm (in class)
This Year's Midterm: exam, solutions
W, May 8
Failure Recovery & Concurrency
M, May 13
Concurrency
W, May 15
Guest Talk: Delta Lake: Making Cloud Data Lakes Transactional and Scalable
Reynold Xin, Databricks
Many organizations store their largest datasets in data lakes, i.e. collections of files in a large-scale, low-cost storage system such as the Amazon S3 or HDFS. Unfortunately, making data lakes reliable and efficient is challenging: readers can see incomplete data from in-progress writes, there are no indexes to accelerate access, and rolling back incorrect updates made by a data pipeline is complex. To address these problems, Databricks recently open sourced Delta Lake, a transactional storage management system the runs over commodity cloud and HDFS storage to provide reliability and scalability. Delta Lake manages a transaction log that allows for efficient isolation, point-in-time snapshots and rollback, and accesses metadata using Apache Spark in order to scale to billions of files and petabytes of data. Delta Lake was first released to Databricks customers in 2017 and is now in production use at over 1000 organizations, where it processes multiple exabytes of data per month. We describe some of the largest-scale use cases and the motivation for Delta Lake's design.

Bio: Reynold Xin is a cofounder and Chief Architect at Databricks. In the open source community, Reynold is known as a top contributor to the Apache Spark project, having designed many of its core user-facing APIs and execution engine features. Reynold received a PhD in Computer Science from UC Berkeley, where he worked on large-scale data processing systems including Apache Spark, Spark SQL, GraphX and CrowdDB. Reynold also led the team that set the 2014 GraySort record in the Sort Benchmark.
M, May 20
Concurrency 2
W, May 22
Distributed Databases
Reading: BASE: An ACID Alternative (pdf version)
Question: What are examples of database constraints that would be difficult to provide using the message queue solution discussed in this article?
M, May 27
No Class (Memorial Day)
W, May 29
Distributed Databases 2
Th, May 30
M, Jun 3
Security and Data Privacy
Reading: Privacy Integrated Queries
Question: Give an example of a computation on data for which it's hard to provide any differential privacy.
W, Jun 5
Review
M, Jun 10
Final Exam
Time: 3:30-6:30 PM, Dinkelspiel Auditorium Room 100
This Year's Final: exam, solutions
Past Finals: Winter 2009 (solutions), Winter 2017 (solutions)

Logistics

Announcements

All announcements will be made on our Piazza page for the class. Make sure you sign up for Piazza!

Prerequisites

Students should ideally have taken CS 145 and CS 161, or their equivalent courses. In particular, we expect students to be familiar with SQL syntax. You can take a basic SQL tutorial for an overview of SQL if needed.

Assignments and Exams

We will have three programming assignments, a midterm and a final. The programming assignments are designed to be runnable on your personal machine and should be submitted through GradeScope.

Exams will be open-notes and "open-laptop" (you can bring any material you want on paper or on your laptop), except that network access will not be allowed during exams. Exams will cover material in the lectures, readings and assignments.

Readings

We have occasional readings for the lectures. We expect students to complete these and think about the respective questions on their own (you do not need to turn in answers). Reading material can appear on the exams.

Grading

Late Policy

Students each have up to 3 late days that they may use during the quarter. Assignments submitted later after these late days have been used up will incur a penalty of 10% per additional day late.

Feedback

Please post public questions about the class on Piazza. For private questions to the staff, email cs245-spr1819-staff@lists.stanford.edu.