CS25: Transformers United V6

CS25 has become one of Stanford's hottest seminar courses, featuring top researchers at the forefront of Transformers research such as Geoffrey Hinton, Ashish Vaswani, and Andrej Karpathy. Our class has an incredibly popular reception within and outside Stanford, and millions of total views on YouTube. Each week, we dive into the latest breakthroughs in AI, from large language models like GPT to applications in art, biology, and robotics. Now on our sixth iteration of the course, we are excited to bring you fresh perspectives on where Transformer research is heading next.

The only homework for students is weekly attendance to the talks/lectures. Anybody is free to audit in-person or join our Zoom livestreams - you don't have to sign-up or be affiliated with Stanford! (Please do not contact us about this). We also have a lively Discord community (over 5000 members) - feel free to join and chat with hundreds of others about Transformers!

Logistics Lecture Videos Discord

Instructors

<a href='https://styfeng.github.io'>Steven Feng</a>

<a href='https://karanps.com'>Karan Singh</a>

<a href='https://web.stanford.edu/~mcfrank/'>Michael C. Frank</a>

Michael C. Frank

<a href='https://nlp.stanford.edu/~manning/'>Christopher Manning</a>

Christopher Manning

Time and Location

Spring Quarter (March 30 - June 3)
Thursdays 4:30 - 5:50 pm PDT
Skilling Auditorium | Zoom Link | Slido

Instructors

<a href='https://styfeng.github.io'>Steven Feng</a>

<a href='https://karanps.com'>Karan Singh</a>

<a href='https://web.stanford.edu/~mcfrank/'>Michael C. Frank</a>

Michael C. Frank

<a href='https://nlp.stanford.edu/~manning/'>Christopher Manning</a>

Christopher Manning

Time and Location

Spring Quarter (March 30 - June 3)
Thursdays 4:30 - 5:50 pm PDT
Skilling Auditorium | Zoom Link | Slido

Date	Title	Description
April 2nd	Overview of Transformers [In-Person] Speakers: Instructors	Brief intro and overview of the history of ML/NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and current challenges. Link to slides. Paper discussed: Feng et al., Baby Scale: Investigating Models Trained on Individual Children's Language Input, arXiv:2603.29522 Zeng et al., Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models, arXiv:2603.29552 Singh et al., To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining, arXiv:2604.00715 Singh et al., Curriculum-Guided Layer Scaling for Language Model Pretraining, arXiv:2506.11389 Singh et al., Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning, arXiv:2603.00786 Liu et al., A Unified Definition of Hallucination: It's The World Model, Stupid!, arXiv:2512.21577
April 9th	From Representation Learning to World Modeling through Joint Embedding Predictive Architectures [In-Person] Speakers: Hazel Nam & Lucas Maes (Brown University)	World models are increasingly moving away from reconstruction and toward prediction in latent space. In this talk, we will present two recent JEPA-based approaches that illustrate this shift from complementary angles. Link to slides. Causal-JEPA induces object-level relational bias to promote representations that capture entities, and interactions, leading to stronger reasoning and more efficient planning. LeWorldModel shows that such predictive world models can also be trained stably end-to-end from raw pixels using a minimal objective and a clean architectural recipe, while remaining competitive on control tasks. Taken together, these works argue for a unified view of world modeling: predictive latent learning becomes most powerful when combined with both structural bias and architectural simplicity. This perspective suggests a promising path toward robust world models that support abstraction, reasoning, and control. Heejeong Nam is a Master's student at Brown University, working on representation learning, causality, and self-supervised learning. Lucas Maes is a PhD student at Mila and the University of Montreal, working on JEPA and planning.
April 16th	On the Tradeoffs of State Space Models and Transformers [In-Person] Speaker: Albert Gu (CMU, Cartesia AI)	This talk will provide a high level overview of a recently popular subquadratic alternative to the Transformer, the state space model (SSM). We will discuss the core characteristics and design choices of SSMs and other related... modern linear models. We will also focus on the fundamental tradeoffs between SSMs and Transformers both from a modeling perspective and their strengths and weaknesses on different application areas. A central theme is that different architectures have very different performance characteristics depending on the resolution of the data and its tokenization scheme; we will also talk about recent progress on tokenizer-free models such as H-Nets. Albert Gu is an Assistant Professor in the Machine Learning Department at Carnegie Mellon University and Chief Scientist of Cartesia AI. His research broadly focuses on theoretical and empirical foundations of deep learning; he is particularly known for new approaches to deep sequence modeling and neural network architectures, and was recognized on the TIME AI100 list of most influential researchers in 2024. Previously, he completed his PhD at Stanford.
April 23th	The Ultra-Scale Talk: Scaling Training to Thousands of GPUs [In-Person] Speaker: Nouamane Tazi (Hugging Face)	Training large language models demands more than raw compute, it requires a deep understanding of parallelism and architecture choices. This talk dives into the practicalities of ultra-scale training. Link to slides. Topics include how 5D parallelism makes it possible to stretch a single run across massive GPU clusters, how Mixture-of-Experts architectures introduce new scaling dimensions and stability challenges, and the performance tuning and communication patterns that drive throughput. Drawing from the Ultra-Scale Playbook and real-world scaling efforts, we'll cover benchmarks, hard-earned lessons, and hands-on recommendations for engineers and researchers ready to train state-of-the-art models efficiently. Nouamane Tazi is a Machine Learning Engineer at Hugging Face, focused on large-scale distributed LLM training. He is the lead author of the Ultra-Scale Playbook and a core developer of Nanotron, Hugging Face's open-source distributed training library. His work spans projects like StarCoder2, SmolLM3, and Mixture-of-Experts scaling with several initiatives. He is passionate about making large-scale training practical and accessible.
April 30th	From Next-Token Prediction to Next-Generation Intelligence: The Future of Pretraining [In-Person] Speaker: Shrimai Prabhumoye (Mistral AI, prev. NVIDIA)	This talk presents recent progress in pretraining algorithm design for large language models (LLMs), emphasizing the role of data ordering, reasoning-centric data integration, and reinforcement-based objectives in shaping model capability. Link to slides. We introduce a two-phase pretraining framework that formalizes strategies for data selection, blending, and sequencing. We also demonstrate that front-loading reasoning-rich data during pretraining yields persistent gains in reasoning accuracy that post-training alone cannot reproduce. Further, we propose Reinforcement Learning during Pretraining (RLP) — a reinforcement-based objective that treats chain-of-thought generation as exploratory behavior, rewarding trajectories that maximize token-level information gain. Empirical results across diverse model scales show that structured data ordering combined with RLP improves downstream accuracies, and reasoning generalization, establishing a principled approach for integrating reasoning as a first-class objective in LLM pretraining. Shrimai Prabhumoye is an AI Scientist at Mistral AI and an Adjunct Professor at Boston University. Her research focuses on advancing large language models, particularly improving their reasoning capabilities. Prior to this, she was a lead contributor to NVIDIA’s Nemotron family of models, working on data curation, pretraining, and scaling. Her work emphasizes optimizing pretraining pipelines, including data selection, blending, and ordering strategies to maximize downstream model performance.
May 7th	Distinct Modes of Generalization from Parameters and Context, and Paths to Bridge the Gap [In-Person] Speaker: Andrew Lampinen (Anthropic)	Language models can be taught information through multiple routes: either updating their parameters by training on the information, or presenting it to the model in context. In this talk I'll describe striking differences in the types of generalization that models make when they learn information via these two routes. Link to slides. In particular, ways in which models can generalize more flexibly from information in their context than their parameters. I'll then describe three different strategies that can help bridge this gap, based on data augmentation, retrieval, and RL. I'll link these findings to broader issues in the nature of intelligent systems, including findings in cognitive neuroscience about the role of complementary learning systems in the brain. Andrew Lampinen is a Member of Technical Staff at Anthropic. Previously, he was a Staff Research Scientist at Google DeepMind, and received his PhD in Cognitive Psychology at Stanford University. His research interests bridge artificial intelligence and cognitive science, on topics including learning, generalization, and representation in language models, agents, and humans — as well as broader issues like methodologies for generalizable research. Some of his musings on these topics can be found on his substack: https://infinitefaculty.substack.com/.
May 14th	Advancing Science and Medicine with Collaborative AI Agents [In-Person] Speaker: Vivek Natarajan (DeepMind)	This talk introduces general-purpose AI systems from Google DeepMind designed to accelerate scientific discovery and democratize medical expertise. First, the AI co-scientist, a multi-agent Gemini-based system, assists researchers by systematically generating and refining novel hypotheses for complex scientific challenges. This approach has yielded promising, lab-validated results, including identifying drugs for repurposing against acute myeloid leukemia, discovering new therapeutic targets for liver fibrosis (Advanced Science), and recapitulating a novel gene transfer mechanism for bacterial resistance (Cell). While early, the co-scientist represents a promising step toward a true collaborative AI partner for scientist. Secondly, the AI co-physician, AMIE, aims to give doctors superpowers and make medical expertise universally accessible. In simulated settings, AMIE outperformed primary care physicians across multiple clinical evaluation axes (Nature) and is showing promise as an assistive tool in ongoing real-world validations (Nature). Together, these initiatives demonstrate AI's potential to transform scientific research and care delivery. Vivek Natarajan is a Research Scientist at Google DeepMind leading research at the intersection of AI, science and medicine. He is the lead researcher behind Med-PaLM (Nature, 2023) and Med-PaLM 2 (Nature Medicine, 2025), the first AI systems to obtain passing and expert level scores on US Medical License exam questions, respectively. Vivek also co-leads Project AMIE, a research program aiming to build and democratize medical superintelligence. Over the past year, AMIE has shown promising potential in controlled settings, including primary care, specialty care, and complex diagnostic challenges, as both a standalone (Nature, 2025) and assistive tool for clinicians (Nature 2025). Finally, Vivek recently co-led the development of the AI co-scientist - a virtual AI collaborator designed to augment scientists, help uncover new original knowledge and accelerate the clock speed of scientific discoveries. Prior to Google, Vivek worked on multimodal assistant systems at Facebook AI Research. He is also part of the faculty for executive education at Harvard T.H. Chan School of Public Health in a part-time capacity.
May 21th	From Language Models to Native Multimodal Intelligence [In-Person] Speaker: Victoria Lin (Thinking Machines)	Large language models have demonstrated that next-token prediction at scale can induce remarkably general capabilities, including knowledge acquisition, instruction following, reasoning and planning. Yet the physical world is fundamentally multimodal: perception, interaction, and understanding emerge not only from language. Link to slides. They also emerge from the integration of visual, auditory and temporal information. As AI systems increasingly move beyond language toward real-world interaction, native multimodal intelligence becomes a central challenge for the field. In this talk, I will discuss the evolution from language models to native multimodal systems, focusing on the architectural and training principles that transfer from the LLM paradigm, as well as the new challenges introduced by multimodal learning. I will review the building blocks of modern multimodal LLMs, including modality representations, autoregressive modeling, and reasoning capabilities inherited from strong language models. I will then discuss emerging directions in multimodal architecture design, including sparsity and modality specialization, arguing that efficient and scalable multimodal intelligence may require both shared abstract reasoning and modality-aware computation. Finally, I will briefly discuss real-time multimodal intelligence and several open problems facing the field. The goal of this talk is to present a perspective on how the core ideas behind the language model revolution may evolve into the next generation of multimodal intelligence systems. Victoria Lin is a Member of Technical Staff at Thinking Machines Lab, where she focuses on native multimodal intelligence. She is passionate about building AI systems that empower humans to tackle complex, knowledge-intensive problems effectively. Previously, she was a research scientist at Meta AI and Salesforce AI Research. She received a PhD in computer science from the University of Washington.
May 28th	Serving Transformers: Lessons from the Trenches of Production Inference [In-Person] Speaker: Charles Frye (Modal)	Training is fun, but it's not the end of the story -- it's just the first step in building an intelligent application. In applications, the AI engineer must be concerned with inference -- forward passes without backwards passes, consumed by client humans or machines. In this talk, Charles will share insights, lessons, and gnarly scars from serving transformer model inferences at the scale of thousands of GPUs. Charles Frye builds and teaches people to build AI applications. After publishing research in psychopharmacology and neurobiology, he got his Ph.D. at the University of California, Berkeley, for dissertation work on neural network optimization. He has taught thousands the entire stack of AI application development -- from linear algebra fundamentals and GPU arcana to building defensible businesses -- through work at Weights and Biases, Full Stack Deep Learning, and Modal.