Principles of Data-Intensive Systems

Spring 2019
Mon/Wed 1:30-2:50 PM, Skillaud

This course covers the architecture of modern data storage and processing systems, including relational databases, cluster computing systems, streaming and machine learning systems. Topics include database system architecture, storage, query optimization, transaction management, fault recovery, and parallel processing, with a focus on the key design ideas shared across many types of data-intensive systems.

Course Staff

Instructor
Matei Zaharia (Office hours: Monday 3-4 PM, Gates 412)
Teaching Assistants
  • Benjamin Braun (Office hours: Wednesday 5:30-7 PM, over Google Hangouts: bjmnbraun at gmail dot com)
  • Deepak Narayanan (Office hours: Tuesday 4-5 PM, Huang Basement)
  • Edward Gan (Office hours: Friday 3-4 PM, Huang Basement)
  • James Thomas (Office hours: Tuesday 3-4 PM, Huang Basement)
  • Leo Mehr (Office hours: Wednesday 3-4 PM, Huang Basement)
  • Pratiksha Thaker (Office hours: Monday 9:30-10:30 AM, Huang Basement)

Schedule

M, Apr 1
Introduction
W, Apr 3
Database System Architecture
Reading: A History and Evaluation of System R
You can skip "The Recovery Subsystem", "The Locking Subsystem", "The Convoy Phenomeon" and "Additional Observations" on pages 643-644.
Question: Why did the storage system change from "inversions" to B-trees during the project?
Optional Reading: How to Read a Paper
M, Apr 8
Database Architecture 2 and Storage
W, Apr 10
Storage Formats and Indexing
M, Apr 15
Storage Formats and Indexing 2
Reading: Integrating Compression and Execution in Column-Oriented Database Systems
Question: How might the conclusions of this paper change if it ran on an NVMe SSD instead?
W, Apr 17
Query Execution
M, Apr 22
Query Optimization
W, Apr 24
Guest Talk: Program Optimization in TensorFlow
M, Apr 29
Query Optimization 2
Reading: Spark SQL: Relational Data Processing in Spark
Question: What proof points does the paper give that Catalyst achieves its goal as an extensible optimizer? Can you think of any limitations to Catalyst's approach for supporting external extensions?
W, May 1
Transactions and Failure Recovery
M, May 6
Midterm (in class)
W, May 8
Failure Recovery
M, May 13
Concurrency
W, May 15
Concurrency 2
M, May 20
Distributed Storage and Consistency
W, May 22
Parallel Processing
M, May 29
Stream Processing
M, Jun 3
Security and Data Privacy
W, Jun 5
Review
M, Jun 10
Final Exam
Time: 3:30-6:30 PM
Location: TBD

Logistics

Announcements

All announcements will be made on our Piazza page for the class. Make sure you sign up for Piazza!

Prerequisites

Students should ideally have taken CS 145 and CS 161, or their equivalent courses. In particular, we expect students to be familiar with SQL syntax. You can take a basic SQL tutorial for an overview of SQL if needed.

Assignments and Exams

We will have three programming assignments, a midterm and a final. The programming assignments are designed to be runnable on your personal machine and should be submitted through GradeScope.

Exams will be open-notes and "open-laptop" (you can bring any material you want on paper or on your laptop), except that network access will not be allowed during exams. Exams will cover material in the lectures, readings and assignments.

Readings

We have occasional readings for the lectures. We expect students to complete these and think about the respective questions on their own (you do not need to turn in answers). Reading material can appear on the exams.

Grading

Late Policy

Students each have up to 3 late days that they may use during the quarter. Assignments submitted later after these late days have been used up will incur a penalty of 10% per additional day late.

Feedback

Please post public questions about the class on Piazza. For private questions to the staff, email cs245-spr1819-staff@lists.stanford.edu.