Principles of Data-Intensive Systems

Winter 2020
Tue/Thu 1:30-2:50 PM, NVIDIA Auditorium

This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing systems, streaming and machine learning systems. Topics include database system architecture, storage, query optimization, transaction management, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems.

Course Staff

Instructor
Matei Zaharia (Office hours: Thursdays 11:00-12:00, Gates 412)
Teaching Assistants

Schedule

T, Jan 7
Introduction
Th, Jan 9
Database System Architecture
Reading: A History and Evaluation of System R
You can skip "The Recovery Subsystem", "The Locking Subsystem", "The Convoy Phenomeon" and "Additional Observations" on pages 643-644.
Question: Why did the storage system change from "inversions" to B-trees during the project?
Optional Reading: How to Read a Paper
T, Jan 14
Database Architecture 2 and Storage
Th, Jan 16
Storage Formats and Indexing
T, Jan 21
Storage Formats and Indexing 2
Reading: Integrating Compression and Execution in Column-Oriented Database Systems
Question: How might the conclusions of this paper change if it ran on an NVMe SSD instead?
Th, Jan 23
Query Execution
T, Jan 28
Query Execution 2 and Query Optimization
Th, Jan 30
Query Optimization 2
Reading: Spark SQL: Relational Data Processing in Spark
Question: What proof points does the paper give that Catalyst achieves its goal as an extensible optimizer? Can you think of any limitations to Catalyst's approach for supporting external extensions?
T, Feb 4
Guest Talk: How PyTorch Optimizes Deep Learning Computations
Th, Feb 6
Transactions and Failure Recovery
T, Feb 11
Midterm (in class)
This Year's Midterm: exam, solutions
Th, Feb 13
Failure Recovery & Concurrency
T, Feb 18
Concurrency
Th, Feb 20
Concurrency 2
T, Feb 25
Streaming Systems
Th, Feb 27
Distributed Databases
T, Mar 3
Distributed Databases 2
Th, Mar 5
Guest Talk: TBD
T, Mar 10
Security and Data Privacy
Th, Mar 12
Review
T, Mar 17
Final Exam

Logistics

Announcements

All announcements will be made on our Piazza page for the class. Make sure you sign up for Piazza!

Prerequisites

Students should ideally have taken CS 145 and CS 161, or their equivalent courses. In particular, we expect students to be familiar with SQL syntax. You can take a basic SQL tutorial for an overview of SQL if needed.

Assignments and Exams

We will have three programming assignments, a midterm and a final. The programming assignments are designed to be runnable on your personal machine and should be submitted through GradeScope.

Exams are open-notes and "open-laptop" (you can bring any material you want on paper or on your laptop), except that network access is not be allowed during exams. Exams will cover material in the lectures, readings and assignments.

Readings

We have occasional readings for the lectures. We expect students to complete these and think about the respective questions on their own (you do not need to turn in answers). Reading material can appear on the exams.

Optional Textbook

Database Systems: The Complete Book (2nd Edition), by Garcia-Molina, Ullman and Widom, covers a lot of the technical material in the course and may be helpful as a study guide. We focus on chapters 13-20. We will also cover the material in lectures, but this book is a good source of additional information.

Grading

Late Policy

Students each have up to 3 late days that they may use during the quarter. Assignments submitted later after these late days have been used up will incur a penalty of 10% per additional day late.

SCPD Lecture Recording Notice

Video cameras located in the back of the room will capture the instructor presentations in this course. For your convenience, you can access these recordings by logging into the course Canvas site. These recordings might be reused in other Stanford courses, viewed by other Stanford students, faculty, or staff, or used for other education and research purposes. Note that while the cameras are positioned with the intention of recording only the instructor, occasionally a part of your image or voice might be incidentally captured. If you have questions, please contact a member of the teaching team.

Feedback

Please post public questions about the class on Piazza. For private questions to the staff, please open a private post on Piazza. You can also email professor Zaharia at matei@cs.stanford.edu.