Stanford CS349D | AI Inference Infrastructure

The rapid adoption of large language models (LLMs) and agentic AI techniques has made efficient inference a critical challenge. This research seminar explores the infrastructure and systems behind AI inference at scale, covering the techniques, architectures, and engineering decisions required to serve LLMs efficiently and cost-effectively in production.

Course Staff

Christos Kozyrakis

Instructor

Zhiqiang Xie

Teaching Assistant

Swapnil Gandhi

Teaching Assistant

Logistics

Lectures: Mon/Wed 1:30 PM - 2:50 PM in 160-124
Office hours:
- Christos: 4:30 PM – 5:30 PM Mondays in Gates Hall 444
- Zhiqiang: Wed 3:00 PM – 4:00 PM, Gates 418
- Swapnil: Fri 2:00 PM – 3:00 PM, Gates 418

Schedule (Tentative)

#	Date	Topic	Readings	Deadlines

Readings may be updated closer to each class date.

Lectures

We will use most lectures for an in-class discussion of the required papers. Students are expected to read the papers before each class and submit a short summary of the paper on gradescope. We will assigne two students to each each lecture, introduce the papers and manage the discussion. Two more students will keep notes.

A few lectures will feature guest lectures from industry experts.

Tentive grading breakdown: coursework 50%, leading a lecture 25%, in-class participation 25%.

Coursework

Mini Serving Engine

This course aims to provide an extensive overview of LLM serving systems and an opportunity to dive into the details by building one. By the end of the class, students will have their own minimal serving engine capable of achieving reasonably good performance on typical workloads.

The mini engine project is broken into multiple milestones. We will provide a starter notebook, benchmark scripts, and a shared environment. Students will incrementally add new features on top of each prior milestone — all features must remain compatible, so expect later milestones to be increasingly challenging.

Milestones (tentative)

Data parallelism and tensor parallelism — enable higher throughput and support for larger models
Continuous batching and PagedAttention — improve resource utilization
Context caching and chunked prefill — reduce latency and improve throughput
Advanced feature: speculative decoding, hierarchical context caching, prefill-decode disaggregation, or a student-proposed feature