The rapid adoption of large language models (LLMs) and agentic AI techniques has made efficient inference a critical challenge. This research seminar explores the infrastructure and systems behind AI inference at scale, covering the techniques, architectures, and engineering decisions required to serve LLMs efficiently and cost-effectively in production.

Course Staff

Zhiqiang Xie
Zhiqiang Xie
Teaching Assistant
Swapnil Gandhi
Swapnil Gandhi
Teaching Assistant

Logistics


Schedule (Tentative)

# Date Topic Readings Deadlines
Readings may be updated closer to each class date.

Lectures

We will use most lectures for an in-class discussion of the required papers. Students are expected to read the papers before each class and submit a short summary of the paper on gradescope. We will assigne two students to each each lecture, introduce the papers and manage the discussion. Two more students will keep notes.

A few lectures will feature guest lectures from industry experts.

Tentive grading breakdown: coursework 50%, leading a lecture 25%, in-class participation 25%.


Coursework

Mini Serving Engine

This course aims to provide an extensive overview of LLM serving systems and an opportunity to dive into the details by building one. By the end of the class, students will have their own minimal serving engine capable of achieving reasonably good performance on typical workloads.

The mini engine project is broken into multiple milestones. We will provide a starter notebook, benchmark scripts, and a shared environment. Students will incrementally add new features on top of each prior milestone — all features must remain compatible, so expect later milestones to be increasingly challenging.

Milestones (tentative)

  1. Data parallelism and tensor parallelism — enable higher throughput and support for larger models
  2. Continuous batching and PagedAttention — improve resource utilization
  3. Context caching and chunked prefill — reduce latency and improve throughput
  4. Advanced feature: speculative decoding, hierarchical context caching, prefill-decode disaggregation, or a student-proposed feature