Interested in Transformers, the deep learning model that has taken the world by storm? Want to have intimate discussions with researchers? If so, this course is for you! It's not every day that you get to personally hear from and chat with the authors of the papers you read!

Each week, we invite folks at the forefront of Transformers research to discuss the latest breakthroughs, from LLM architectures like GPT and Gemini to creative use cases in generating art (e.g. DALL-E and Sora), biology and neuroscience applications, robotics, playing complex games, and so forth!

CS25 has become one of Stanford's hottest and most exciting seminar courses. We invite the coolest speakers such as Andrej Karpathy, Geoffrey Hinton, Jim Fan, Ashish Vaswani, and folks from OpenAI, Google, NVIDIA, etc. Our class has an incredibly popular reception within and outside Stanford, and around 1 million total views on YouTube. Our class with Andrej Karpathy was the second most popular YouTube video uploaded by Stanford in 2023 with over 500k views!

We have significant improvements for Spring 2024, including a large lecture hall, professional recording and livestreaming (to the public), social events, and potential 1-on-1 networking! The only homework for students is weekly attendance to the talks/lectures. Also, livestreaming and auditing are available to the public. Feel free to audit in-person or by joining the Zoom livestream. Anybody can attend, you don't have to be affiliated with Stanford!

We also have a Discord server (over 1500 members) used for Transformers discussion. We open it to the public as more of a "Transformers community". Feel free to join and chat with hundreds of others about Transformers!


  • This is a 1-unit S/NC (pass/fail) course. Enroll on Axess as a Stanford student! (Waitlist available)
  • Lectures are on Thursdays at 4:30 - 5:50 pm PDT, Gates Computer Science Building, Room B01 (Basement)
  • Zoom Livestream (Anyone can join!): Link [Meeting ID: 999 2215 1759, Password: 123456]
  • Announcements will be made by email, Discord, Canvas (for students), and this mailing list (for auditors/public).
  • Attendance: Enrolled students should attend in-person (up to 3 absences). During/following each lecture, submit a response here. Note: the form will only open during each lecture.
  • Auditing: Open to everyone! Please join in-person or the Zoom livestream. No need to email us. Join this mailing list for announcements.
  • Questions: There will be an opportunity for questions after each lecture. You can submit questions for the speakers on, using the code "cs25". Do not unmute on Zoom to ask questions. We cannot guarantee the Zoom chat will be monitored, so please ask questions on instead.
  • Public (Non-Stanford): There is no way to "officially" enroll in or audit this course (i.e. get a credit/certificate/acknowledgement) unless you are a Stanford student. We are just opening it up to the public for attendance.
  • Contact: If you have any questions about the course, contact us at
  • Recordings & Slides

  • We plan to publicly release YouTube recordings after each talk at a reasonable pace (i.e. approx. 2 weeks afterward).
  • Recordings of previous talks can be found here. Future recordings will also be posted to this same playlist. Video links will also be attached directly to the schedule below.
  • Slides will be posted during/after each lecture, on this website (attached to the schedule below), our Discord, and sent by email through the class mailing lists. We will aim to post them in a timely manner (i.e. within a week of each talk).
  • Disclaimers for Students & Attendees

  • In-person attendees: We will be recording, broadcasting (over Zoom), and publishing the speaker presentations to YouTube to help the timely spread of this cutting-edge information. For your convenience, you can also access these recordings by logging into the course Canvas site (students only). Video cameras located in the back of the room will capture the instructor presentations in this course. Note that while the cameras are positioned with the intention of recording only the instructor, occasionally a part of your image or voice might be incidentally captured. Before the recordings are published, an editor will review to remove any student and attendee appearances. If you have questions, please contact a member of the teaching team.
  • Auditors: If the room is full, please give seats to enrolled students who have priority.
  • Zoom attendees: Please do not unmute yourself on Zoom, use the whiteboard functionality, or any other disruptive behavior! If you have any questions/concerns, please send them in the chat; we will be actively monitoring it.
  • Inappropriate behavior will result in blacklist from the course (and maybe other consequences with Stanford).
  • Faculty Advisor


    The current class schedule is below (subject to change):

    Date Title Description
    April 4 Instructor Lecture: Overview of Transformers [In-Person]

    Speakers: Steven Feng, Div Garg, Emily Bunnapradist, Seonghee Lee
    Brief intro and overview of the history of NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and remaining challenges/weaknesses. Also discussion about AI agents. Recording here. Slides posted here.
    April 11 Intuitions on Language Models (Jason) [In-Person]

    Shaping the Future of AI from the History of Transformer (Hyung Won) [In-Person]

    Speakers: Jason Wei & Hyung Won Chung, OpenAI

    Jason Wei is an AI researcher based in San Francisco. He is currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in large language models such as chain-of-thought prompting, instruction tuning, and emergent phenomena.

    Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT.
    Jason will talk about some basic intuitions on language models, inspired by manual examination of data. First, he will discuss how one can view next word prediction as massive multi-task learning. Then, he will discuss how this framing reconciles scaling laws with emergent individual tasks. Finally, he will talk about the more general implications of these learnings. Slides posted here.

    Hyung Won: AI is developing at such an overwhelming pace that it is hard to keep up. Instead of spending all our energy catching up with the latest development, I argue that we should study the change itself. First step is to identify and understand the driving force behind the change. For AI, it is the exponentially cheaper compute and associated scaling. I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute. This analysis will help us connect the past and present in a unified perspective, which in turn makes it more manageable to project where the field is heading. Slides posted here.
    April 18 Aligning Open Language Models [Virtual/Zoom]
    Speaker: Nathan Lambert, Allen Institute for AI (AI2)

    Nathan Lambert is a Research Scientist at the Allen Institute for AI focusing on RLHF and the author of Previously, he helped build an RLHF research team at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research.
    Since the emergence of ChatGPT there has been an explosion of methods and models attempting to make open language models easier to use. This talk retells the major chapters in the evolution of open chat, instruct, and aligned models, covering the most important techniques, datasets, and models. Alpaca, QLoRA, DPO, PPO, and everything in between will be covered. The talk will conclude with predictions and expectations for the future of aligning open language models. Slides posted here. All the models in the figures are in this HuggingFace collection.
    April 25 Demystifying Mixtral of Experts [Virtual/Zoom]
    Speaker: Albert Jiang, Mistral AI / University of Cambridge

    Albert Jiang is an AI scientist at Mistral AI, and a final-year PhD student at the computer science department of Cambridge University. He works on language model pretraining and reasoning at Mistral AI, and language models for mathematics at Cambridge.
    In this talk I will introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combines their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. I will go into the architectural details and analyse the expert routing decisions made by the model.
    May 2 Transformers that Transform Well Enough to Support Near-Shallow Architectures [In-Person]
    Speaker: Jake Williams, Drexel University

    Jake Ryland Williams is an Associate Professor of Information Science at Drexel University's College of Computing and Informatics in Philadelphia, Pennsylvania. Dr. Williams' has a background in physics and math with degrees from the University of Vermont, and his research leverages a quantitative linguistics perspective that applies math and statistical methodology to analyze and improve linguistic learning systems, alongside others that utilize shared neural methodology. Following a one-year Postdoctoral appointment at the University of California, Berkeley (Cal) studying large-scale machine learning in 2015, Dr. Willams became a data science (DS) faculty at Drexel, where he drove the foundation of a DS MS program and develops and instructs DS coursework, including on natural language processing with deep learning.
    The talk will discuss various effectiveness-enhancing and cost-cutting augmentations to language model (LM) learning process, including the derivation and application of non-random parameter initializations for specialized self-attention-based architectures. These are referred to as precision LMs (PLMs), in part, for their capability to effectively and efficiently train both large and small LMs. Highlighting their hallmark capability for training with only very limited resources, an introduction to PLMs will be followed by presentation of a developing application that localizes untrained PLMs on microprocessors to act as hardware-based controllers for small electronics devices. This will discuss their utility at training in air-gapped environments, training progressively bigger models on CPUs, as well as provide detail on a fully developed control system and its user interface, including recent experiments on Le Potato, where effective inference of user directives occurred after only 20 minutes of lay interaction over a microphone and light switch.
    May 9 From Large Language Models to Large Multimodal Models [Virtual/Zoom]
    Speaker: Ming Ding, Zhipu AI

    Ming Ding is a research scientist at Zhipu AI based in Beijing. He obtained his bachelor's and doctoral degrees at Tsinghua University, advised by Prof. Jie Tang. His research interests include multimodal, generative models, and pre-training technologies. He has led or participated in the research works about multimodal generative models such as CogView and CogVideo; multimodal understanding models CogVLM and CogAgent; and language models such as GLM and GLM-130B.
    As large language models (LLMs) have made significant advancements over the past five years, there is growing anticipation for seamlessly integrating other modalities of perception (primarily visual) with the capabilities of large language models. This talk will start with the basics of large language models, discuss the academic community's attempts at multimodal models and structural updates over the past one year. We will focus on introducing CogVLM, a powerful open-source multimodal model with 17B parameters (equivalent to a 7B dense model), and CogAgent, a model designed for scenarios involving GUI and OCR. Finally, we will discuss the applications of multimodal models and viable research directions in academia.
    May 16 Amortizing intractable inference in large language models [In-Person]
    Speaker: Edward Hu, Prev. OpenAI

    Edward Hu is building his own company. He was previously a researcher at OpenAI and received his research training as a Ph.D. student advised by Yoshua Bengio, a recipient of the 2018 A.M. Turing Award.Before graduate school, Edward was a researcher at Microsoft, where he invented LoRA and ╬╝Transfer. LoRA is now one of the most popular methods for customizing AI models, and ╬╝Transfer is underpinning the largest AI models being developed today.
    Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.

    Recommended Reading:
    1. Amortizing Intractable Inference in Large Language Models
    May 23 Behind the Scenes of LLM Pre-training: StarCoder Use Case [Virtual/Zoom]
    Speaker: Loubna Ben Allal, Hugging Face

    Loubna Ben Allal is a Machine Learning Engineer in the Science team at Hugging Face working on Large Language Models for code & Synthetic data generation. She is part of the core team behind the BigCode Project and has co-authored The Stack dataset and StarCoder models for code generation. Loubna holds Mathematics & Deep Learning Master's Degrees from Ecole des Mines de Nancy and ENS Paris Saclay.
    As large language models (LLMs) become essential to many AI products, learning to pretrain and fine-tune them is now crucial. In this talk, we will explore the intricacies of training LLMs from scratch, including lessons on scaling laws and data curation. Then, we will study the StarCoder use case as an example of LLMs tailored for code, highlighting how their development differs from standard LLMs. Additionally, we will discuss important aspects of data governance and evaluation, crucial elements in today's conversations about LLMs and AI that are frequently overshadowed by the pre-training discussions.
    May 30 NO CLASS!! Enjoy the summer :)