CS25: Transformers United V6

CS25 has become one of Stanford's hottest seminar courses, featuring top researchers at the forefront of Transformers research such as Geoffrey Hinton, Ashish Vaswani, and Andrej Karpathy. Our class has an incredibly popular reception within and outside Stanford, and millions of total views on YouTube. Each week, we dive into the latest breakthroughs in AI, from large language models like GPT to applications in art, biology, and robotics. Now on our sixth iteration of the course, we are excited to bring you fresh perspectives on where Transformer research is heading next.

The only homework for students is weekly attendance to the talks/lectures. Anybody is free to audit in-person or join our Zoom livestreams - you don't have to sign-up or be affiliated with Stanford! (Please do not contact us about this). We also have a lively Discord community (over 5000 members) - feel free to join and chat with hundreds of others about Transformers!

Instructors

Time and Location

Spring Quarter (March 30 - June 3)
Thursdays 4:30 - 5:50 pm PDT
Skilling Auditorium   |   Zoom Link   |   Slido

Instructors

Time and Location

Spring Quarter (March 30 - June 3)
Thursdays 4:30 - 5:50 pm PDT
Skilling Auditorium   |   Zoom Link   |   Slido

DateTitleDescription
April 2ndOverview of Transformers [In-Person]

Speakers: Instructors
Brief intro and overview of the history of ML/NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and current challenges. Link to slides. Paper discussed:
Feng et al., Baby Scale: Investigating Models Trained on Individual Children's Language Input, arXiv:2603.29522

Zeng et al., Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models, arXiv:2603.29552

Singh et al., To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining, arXiv:2604.00715

Singh et al., Curriculum-Guided Layer Scaling for Language Model Pretraining, arXiv:2506.11389

Singh et al., Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning, arXiv:2603.00786

Liu et al., A Unified Definition of Hallucination: It's The World Model, Stupid!, arXiv:2512.21577

April 9thFrom Representation Learning to World Modeling through Joint Embedding Predictive Architectures
[In-Person]

Speakers: Hazel Nam & Lucas Maes (Brown University)
World models are increasingly moving away from reconstruction and toward prediction in latent space. In this talk, we will present two recent JEPA-based approaches that illustrate this shift from complementary angles. Link to slides.
Causal-JEPA induces object-level relational bias to promote representations that capture entities, and interactions, leading to stronger reasoning and more efficient planning. LeWorldModel shows that such predictive world models can also be trained stably end-to-end from raw pixels using a minimal objective and a clean architectural recipe, while remaining competitive on control tasks. Taken together, these works argue for a unified view of world modeling: predictive latent learning becomes most powerful when combined with both structural bias and architectural simplicity. This perspective suggests a promising path toward robust world models that support abstraction, reasoning, and control.

Heejeong Nam is a Master's student at Brown University, working on representation learning, causality, and self-supervised learning. Lucas Maes is a PhD student at Mila and the University of Montreal, working on JEPA and planning.
April 16thOn the Tradeoffs of State Space Models and Transformers [In-Person]

Speaker: Albert Gu (CMU, Cartesia AI)
This talk will provide a high level overview of a recently popular subquadratic alternative to the Transformer, the state space model (SSM). We will discuss the core characteristics and design choices of SSMs and other related...
modern linear models. We will also focus on the fundamental tradeoffs between SSMs and Transformers both from a modeling perspective and their strengths and weaknesses on different application areas. A central theme is that different architectures have very different performance characteristics depending on the resolution of the data and its tokenization scheme; we will also talk about recent progress on tokenizer-free models such as H-Nets.

Albert Gu is an Assistant Professor in the Machine Learning Department at Carnegie Mellon University and Chief Scientist of Cartesia AI. His research broadly focuses on theoretical and empirical foundations of deep learning; he is particularly known for new approaches to deep sequence modeling and neural network architectures, and was recognized on the TIME AI100 list of most influential researchers in 2024. Previously, he completed his PhD at Stanford.
April 23thThe Ultra-Scale Talk: Scaling Training to Thousands of GPUs [In-Person]

Speaker: Nouamane Tazi (Hugging Face)
Training large language models demands more than raw compute, it requires a deep understanding of parallelism and architecture choices. This talk dives into the practicalities of ultra-scale training:
How 5D parallelism makes it possible to stretch a single run across massive GPU clusters, how Mixture-of-Experts architectures introduce new scaling dimensions and stability challenges, and the performance tuning and communication patterns that drive throughput. Drawing from the Ultra-Scale Playbook and real-world scaling efforts, we'll cover benchmarks, hard-earned lessons, and hands-on recommendations for engineers and researchers ready to train state-of-the-art models efficiently.

Nouamane Tazi is a Machine Learning Engineer at Hugging Face, focused on large-scale distributed LLM training. He is the lead author of the Ultra-Scale Playbook and a core developer of Nanotron, Hugging Face's open-source distributed training library. His work spans projects like StarCoder2, SmolLM3, and Mixture-of-Experts scaling with several initiatives. He is passionate about making large-scale training practical and accessible.
April 30thFrom Next-Token Prediction to Next-Generation Intelligence: The Future of Pretraining [In-Person]

Speaker: Shrimai Prabhumoye (Mistral AI, prev. NVIDIA)
This talk presents recent progress in pretraining algorithm design for large language models (LLMs), emphasizing the role of data ordering, reasoning-centric data integration, and reinforcement-based objectives in shaping model capability.
We introduce a two-phase pretraining framework that formalizes strategies for data selection, blending, and sequencing. We also demonstrate that front-loading reasoning-rich data during pretraining yields persistent gains in reasoning accuracy that post-training alone cannot reproduce. Further, we propose Reinforcement Learning during Pretraining (RLP) — a reinforcement-based objective that treats chain-of-thought generation as exploratory behavior, rewarding trajectories that maximize token-level information gain. Empirical results across diverse model scales show that structured data ordering combined with RLP improves downstream accuracies, and reasoning generalization, establishing a principled approach for integrating reasoning as a first-class objective in LLM pretraining.

Shrimai Prabhumoye is an AI Scientist at Mistral AI and an Adjunct Professor at Boston University. Her research focuses on advancing large language models, particularly improving their reasoning capabilities. Prior to this, she was a lead contributor to NVIDIA’s Nemotron family of models, working on data curation, pretraining, and scaling. Her work emphasizes optimizing pretraining pipelines, including data selection, blending, and ordering strategies to maximize downstream model performance.
May 7thDistinct Modes of Generalization from Parameters and Context, and Paths to Bridge the Gap [In-Person]

Speaker: Andrew Lampinen (Anthropic)
Language models can be taught information through multiple routes: either updating their parameters by training on the information, or presenting it to the model in context. In this talk I'll describe striking differences in the types of generalization that models make when they learn information via these two...
routes, in particular, ways in which models can generalize more flexibly from information in their context than their parameters. I'll then describe three different strategies that can help bridge this gap, based on data augmentation, retrieval, and RL. I'll link these findings to broader issues in the nature of intelligent systems, including findings in cognitive neuroscience about the role of complementary learning systems in the brain.

Andrew Lampinen is a Member of Technical Staff at Anthropic. Previously, he was a Staff Research Scientist at Google DeepMind, and received his PhD in Cognitive Psychology at Stanford University. His research interests bridge artificial intelligence and cognitive science, on topics including learning, generalization, and representation in language models, agents, and humans — as well as broader issues like methodologies for generalizable research. Some of his musings on these topics can be found on his substack: https://infinitefaculty.substack.com/.
May 14thTitle TBD [In-Person]

Speaker: Vivek Natarajan (DeepMind)
 
May 21thTitle TBD [In-Person]

Speaker: Victoria Lin (Thinking Machines)
 
May 28thServing Transformers: Lessons from the Trenches of Production Inference [In-Person]

Speaker: Charles Frye (Modal)
Training is fun, but it's not the end of the story -- it's just the first step in building an intelligent application. In applications, the AI engineer must be concerned with inference -- forward passes without backwards passes,
consumed by client humans or machines. In this talk, Charles will share insights, lessons, and gnarly scars from serving transformer model inferences at the scale of thousands of GPUs.

Charles Frye builds and teaches people to build AI applications. After publishing research in psychopharmacology and neurobiology, he got his Ph.D. at the University of California, Berkeley, for dissertation work on neural network optimization. He has taught thousands the entire stack of AI application development -- from linear algebra fundamentals and GPU arcana to building defensible businesses -- through work at Weights and Biases, Full Stack Deep Learning, and Modal.