Stanford / Spring 2026
The rapid adoption of large language models (LLMs) and agentic AI techniques has made efficient inference a critical challenge. This research seminar explores the infrastructure and systems behind AI inference at scale, covering the techniques, architectures, and engineering decisions required to serve LLMs efficiently and cost-effectively in production.
| # | Date | Topic | Readings | Deadlines |
|---|
We will use most lectures for an in-class discussion of the required papers. Students are expected to read the papers before each class and submit a short summary of the paper on gradescope. We will assigne two students to each each lecture, introduce the papers and manage the discussion. Two more students will keep notes.
A few lectures will feature guest lectures from industry experts.
Tentive grading breakdown: coursework 50%, leading a lecture 25%, in-class participation 25%.
This course aims to provide an extensive overview of LLM serving systems and an opportunity to dive into the details by building one. By the end of the class, students will have their own minimal serving engine capable of achieving reasonably good performance on typical workloads.
The mini engine project is broken into multiple milestones. We will provide a starter notebook, benchmark scripts, and a shared environment. Students will incrementally add new features on top of each prior milestone — all features must remain compatible, so expect later milestones to be increasingly challenging.