CS25: Transformers United V6

CS25 has become one of Stanford's hottest seminar courses, featuring top researchers at the forefront of Transformers research such as Geoffrey Hinton, Ashish Vaswani, and Andrej Karpathy. Our class has an incredibly popular reception within and outside Stanford, and millions of total views on YouTube. Each week, we dive into the latest breakthroughs in AI, from large language models like GPT to applications in art, biology, and robotics. Now on our sixth iteration of the course, we are excited to bring you fresh perspectives on where Transformer research is heading next.

The only homework for students is weekly attendance to the talks/lectures. Anybody is free to audit in-person or join our Zoom livestreams - you don't have to sign-up or be affiliated with Stanford! (Please do not contact us about this). We also have a lively Discord community (over 5000 members) - feel free to join and chat with hundreds of others about Transformers!

Logistics Lecture Videos Discord

Instructors

<a href='https://styfeng.github.io'>Steven Feng</a>

<a href='https://karanps.com'>Karan Singh</a>

<a href='https://web.stanford.edu/~mcfrank/'>Michael C. Frank</a>

Michael C. Frank

<a href='https://nlp.stanford.edu/~manning/'>Christopher Manning</a>

Christopher Manning

Time and Location

Spring Quarter (March 30 - June 3)
Thursdays 4:30 - 5:50 pm PDT
Skilling Auditorium | Zoom Link | Slido

Date	Title	Description
April 2nd	Overview of Transformers [In-Person] Speakers: Instructors	Brief intro and overview of the history of ML/NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and current challenges. Link to slides. Paper discussed: Feng et al., Baby Scale: Investigating Models Trained on Individual Children's Language Input, arXiv:2603.29522 Zeng et al., Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models, arXiv:2603.29552 Singh et al., To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining, arXiv:2604.00715 Singh et al., Curriculum-Guided Layer Scaling for Language Model Pretraining, arXiv:2506.11389 Singh et al., Interpretable Cross-Network Attention for Resting-State fMRI Representation Learning, arXiv:2603.00786 Liu et al., A Unified Definition of Hallucination: It's The World Model, Stupid!, arXiv:2512.21577
April 9th	From Representation Learning to World Modeling through Joint Embedding Predictive Architectures [In-Person] Speakers: Hazel Nam & Lucas Maes (Brown University)	World models are increasingly moving away from reconstruction and toward prediction in latent space. In this talk, we will present two recent JEPA-based approaches that illustrate this shift from complementary angles. Causal-JEPA induces object-level relational bias to promote representations that capture entities, and interactions, leading to stronger reasoning and more efficient planning. LeWorldModel shows that such predictive world models can also be trained stably end-to-end from raw pixels using a minimal objective and a clean architectural recipe, while remaining competitive on control tasks. Taken together, these works argue for a unified view of world modeling: predictive latent learning becomes most powerful when combined with both structural bias and architectural simplicity. This perspective suggests a promising path toward robust world models that support abstraction, reasoning, and control. Heejeong Nam is a Master's student at Brown University, working on representation learning, causality, and self-supervised learning. Lucas Maes is a PhD student at Mila and the University of Montreal, working on JEPA and planning.
April 16th	SSMs [In-Person] Speaker: Albert Gu (CMU)
April 23th	The Ultra-Scale Talk: Scaling Training to Thousands of GPUs [In-Person] Speaker: Nouamane Tazi (Hugging Face)	Training large language models demands more than raw compute, it requires a deep understanding of parallelism and architecture choices. This talk dives into the practicalities of ultra-scale training: How 5D parallelism makes it possible to stretch a single run across massive GPU clusters, how Mixture-of-Experts architectures introduce new scaling dimensions and stability challenges, and the performance tuning and communication patterns that drive throughput. Drawing from the Ultra-Scale Playbook and real-world scaling efforts, we'll cover benchmarks, hard-earned lessons, and hands-on recommendations for engineers and researchers ready to train state-of-the-art models efficiently. Nouamane Tazi is a Machine Learning Engineer at Hugging Face, focused on large-scale distributed LLM training. He is the lead author of the Ultra-Scale Playbook and a core developer of Nanotron, Hugging Face's open-source distributed training library. His work spans projects like StarCoder2, SmolLM3, and Mixture-of-Experts scaling with several initiatives. He is passionate about making large-scale training practical and accessible.
April 30th	TBA
May 7th	Speaker: Andrew Lampinen (Anthropic)
May 14th	Speaker: Vivek Natarajan (DeepMind)
May 21th	TBA
May 28th	Speaker: Charles Frye (Modal)