Principles of Data-Intensive Systems

Winter 2021
Tue/Thu 2:30-3:50 PM Pacific

This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing systems, streaming and machine learning systems. Topics include database system architecture, storage, query optimization, transaction management, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems.

Join our Zoom Lectures on Canvas

Course Staff

Instructor
Matei Zaharia (Office hours: by appointment, please email me)
Teaching Assistants

Schedule

T, Jan 12
Introduction
Th, Jan 14
Database System Architecture
Reading: A History and Evaluation of System R
You can skip "The Recovery Subsystem", "The Locking Subsystem", "The Convoy Phenomeon" and "Additional Observations" on pages 643-644.
Question: Why did the storage system change from "inversions" to B-trees during the project?
Optional Reading: How to Read a Paper
T, Jan 19
Database Architecture 2 & Storage
Th, Jan 21
Storage Formats and Indexing
Reading: Integrating Compression and Execution in Column-Oriented Database Systems
Question: How might the conclusions of this paper change if it ran on an NVMe SSD instead?
T, Jan 26
Storage Formats and Indexing 2
Th, Jan 28
Query Execution
T, Feb 2
Query Optimization
Th, Feb 4
Query Optimization 2
Reading: Spark SQL: Relational Data Processing in Spark
Question: What proof points does the paper give that Catalyst achieves its goal as an extensible optimizer? Can you think of any limitations to Catalyst's approach for supporting external extensions?
T, Feb 9
Guest Talk: Automatically Discovering Systems Optimizations for Deep Learning
Zhihao Jia, Carnegie Mellon University and Facebook
Th, Feb 11
Transactions and Failure Recovery
Past Midterms for Reference: Winter 2020 (solutions), Spring 2019 (solutions), Winter 2017 (solutions)
T, Feb 16
Failure Recovery 2
Th, Feb 18
Concurrency
T, Feb 23
Concurrency 2
Reading: Granularity of Locks and Degrees of Consistency in a Shared Data Base
Read up to page 372. You can skim the rest of the paper.
Question: Draw the hierarchy of data structures in a DBMS that contains a table sorted by its primary key field with multiple pages. What locks have to be acquired if a transaction wants to move a record from one page of the table to another (by changing its primary key)?
Th, Feb 25
Concurrency 3 & Distributed Databases
T, Mar 2
Distributed Databases 2
Th, Mar 4
Cloud Database Systems
Optional Readings: Amazon Aurora, Dynamo, Delta Lake
T, Mar 9
Streaming Systems
Th, Mar 11
Security and Data Privacy
Reading: Privacy Integrated Queries
Question: Give an example of a computation on data for which it's hard to provide any differential privacy.
Past Finals for Reference: Winter 2020 (solutions), Spring 2019 (solutions), Winter 2017 (solutions)
T, Mar 16
Guest Talk: Lakehouse Technology as the Future of Data Warehousing
Reynold Xin, Databricks
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps?
In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally.
I'll discuss the key trends, break down the different modules (storage, query engine, ecosystem) of the lakehouse architecture, and look at the main principles behind their design.
Bio: Reynold Xin is a cofounder and Chief Architect at Databricks. Prior to Databricks, he was a PhD student at the UC Berkeley AMPLab.
Th, Mar 18
No Class (Work on Test 2)

Logistics

Announcements

All announcements will be made on our Piazza page for the class. Make sure you sign up for Piazza!

Prerequisites

Students should ideally have taken CS 145 and CS 161, or their equivalent courses. In particular, we expect students to be familiar with SQL syntax. You can take a basic SQL tutorial for an overview of SQL if needed.

Lectures and Video Recordings

Lectures for the class will be given live on Zoom and recorded. You can find our Zoom link and the lecture recordings on Canvas. Please note that these recordings might be reused in other Stanford courses, viewed by other Stanford students, faculty, or staff, or used for other education and research purposes. If you have questions about video recording, please contact a member of the teaching team.

Assignments and Tests

We will have three programming assignments and two take-home tests. The programming assignments are designed to be runnable on your personal machine and should be submitted through Gradescope.

The tests are open-book, meaning that you can use your course notes, slides, books, or online resources, except that communication is not be allowed during them (e.g., you can't ask a question on Stack Overflow or contact another student). Tests will cover material in the lectures, readings and assignments.

Readings

We have occasional readings for the lectures. We expect students to complete these and think about the questions we list for each paper on their own (you do not need to turn in answers). Our tests will cover content in the readings.

Optional Textbook

Database Systems: The Complete Book (2nd Edition), by Garcia-Molina, Ullman and Widom, covers a lot of the technical material in the course and may be helpful as a study guide. We focus on chapters 13-20. We will also cover the material in lectures, but this book is a good source of additional information.

Grading

Late Policy

Students each have up to 2 late days that they may use for assignments and tests. Assignments and tests submitted after these late days have been used up will incur a penalty of 10% per extra day late. In addition, we will not accept submissions after March 20th at midnight Pacific to give the staff enough time for grading.

Auditing

The course Zoom meetings and recordings are open for auditing to any Stanford student on Canvas.

Feedback

Please post public questions about the class on Piazza. For private questions to the staff, please open a private post on Piazza. You can also email professor Zaharia at matei@cs.stanford.edu.