Interested in Transformers, the deep learning model that has taken the world by storm? Want to have intimate discussions with researchers? If so, this course is for you! It's not every day that you get to personally hear from and chat with the authors of the papers you read!
Each week, we invite folks at the forefront of Transformers research to discuss the latest breakthroughs, from LLM architectures like GPT and Gemini to creative use cases in generating art (e.g. DALL-E and Sora), biology and neuroscience applications, robotics, playing complex games, and so forth!
CS25 has become one of Stanford's hottest and most exciting seminar courses. We invite the coolest speakers such as Andrej Karpathy, Geoffrey Hinton, Jim Fan, Ashish Vaswani, and folks from OpenAI, Google, NVIDIA, etc. Our class has an incredibly popular reception within and outside Stanford, and around 1 million total views on YouTube. Our class with Andrej Karpathy was the second most popular YouTube video uploaded by Stanford in 2023 with over 500k views!
We have significant improvements for Spring 2024, including a large lecture hall, professional recording and livestreaming (to the public), social events, and potential 1-on-1 networking! The only homework for students is weekly attendance to the talks/lectures. Also, livestreaming and auditing are available to the public. Feel free to audit in-person or by joining the Zoom livestream. Anybody can attend, you don't have to be affiliated with Stanford!
We also have a Discord server (over 1500 members) used for Transformers discussion. We open it to the public as more of a "Transformers community". Feel free to join and chat with hundreds of others about Transformers!
The current class schedule is below (subject to change):
Date | Title | Description |
---|---|---|
April 4 | Instructor Lecture: Overview of Transformers [In-Person] Speakers: Steven Feng, Div Garg, Emily Bunnapradist, Seonghee Lee |
Brief intro and overview of the history of NLP, Transformers and how they work, and their impact. Discussion about recent trends, breakthroughs, applications, and remaining challenges/weaknesses. Also discussion about AI agents. Recording here. Slides posted here. |
April 11 | Intuitions on Language Models (Jason) [In-Person] Shaping the Future of AI from the History of Transformer (Hyung Won) [In-Person] Speakers: Jason Wei & Hyung Won Chung, OpenAI Jason Wei is an AI researcher based in San Francisco. He is currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in large language models such as chain-of-thought prompting, instruction tuning, and emergent phenomena. Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT. |
Jason will talk about some basic intuitions on language models, inspired by manual examination of data. First, he will discuss how one can view next word prediction as massive multi-task learning. Then, he will discuss how this framing reconciles scaling laws with emergent individual tasks. Finally, he will talk about the more general implications of these learnings. Slides posted here.
Hyung Won: AI is developing at such an overwhelming pace that it is hard to keep up. Instead of spending all our energy catching up with the latest development, I argue that we should study the change itself. First step is to identify and understand the driving force behind the change. For AI, it is the exponentially cheaper compute and associated scaling. I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute. This analysis will help us connect the past and present in a unified perspective, which in turn makes it more manageable to project where the field is heading. Slides posted here. |
April 18 | Aligning Open Language Models [Virtual/Zoom] Speaker: Nathan Lambert, Allen Institute for AI (AI2) Nathan Lambert is a Research Scientist at the Allen Institute for AI focusing on RLHF and the author of Interconnects.ai. Previously, he helped build an RLHF research team at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. |
Since the emergence of ChatGPT there has been an explosion of methods and models attempting to make open language models easier to use. This talk retells the major chapters in the evolution of open chat, instruct, and aligned models, covering the most important techniques, datasets, and models. Alpaca, QLoRA, DPO, PPO, and everything in between will be covered. The talk will conclude with predictions and expectations for the future of aligning open language models. Slides posted here. All the models in the figures are in this HuggingFace collection. |
April 25 | Demystifying Mixtral of Experts [Virtual/Zoom] Speaker: Albert Jiang, Mistral AI / University of Cambridge Albert Jiang is an AI scientist at Mistral AI, and a final-year PhD student at the computer science department of Cambridge University. He works on language model pretraining and reasoning at Mistral AI, and language models for mathematics at Cambridge. |
In this talk I will introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combines their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. I will go into the architectural details and analyse the expert routing decisions made by the model. |
May 2 | Transformers that Transform Well Enough to Support Near-Shallow Architectures [In-Person] Speaker: Jake Williams, Drexel University Jake Ryland Williams is an Associate Professor of Information Science at Drexel University's College of Computing and Informatics in Philadelphia, Pennsylvania. Dr. Williams' has a background in physics and math with degrees from the University of Vermont, and his research leverages a quantitative linguistics perspective that applies math and statistical methodology to analyze and improve linguistic learning systems, alongside others that utilize shared neural methodology. Following a one-year Postdoctoral appointment at the University of California, Berkeley (Cal) studying large-scale machine learning in 2015, Dr. Willams became a data science (DS) faculty at Drexel, where he drove the foundation of a DS MS program and develops and instructs DS coursework, including on natural language processing with deep learning. |
The talk will discuss various effectiveness-enhancing and cost-cutting augmentations to language model (LM) learning process, including the derivation and application of non-random parameter initializations for specialized self-attention-based architectures. These are referred to as precision LMs (PLMs), in part, for their capability to effectively and efficiently train both large and small LMs. Highlighting their hallmark capability for training with only very limited resources, an introduction to PLMs will be followed by presentation of a developing application that localizes untrained PLMs on microprocessors to act as hardware-based controllers for small electronics devices. This will discuss their utility at training in air-gapped environments, training progressively bigger models on CPUs, as well as provide detail on a fully developed control system and its user interface, including recent experiments on Le Potato, where effective inference of user directives occurred after only 20 minutes of lay interaction over a microphone and light switch. |
May 9 | From Large Language Models to Large Multimodal Models [Virtual/Zoom] Speaker: Ming Ding, Zhipu AI Ming Ding is a research scientist at Zhipu AI based in Beijing. He obtained his bachelor's and doctoral degrees at Tsinghua University, advised by Prof. Jie Tang. His research interests include multimodal, generative models, and pre-training technologies. He has led or participated in the research works about multimodal generative models such as CogView and CogVideo; multimodal understanding models CogVLM and CogAgent; and language models such as GLM and GLM-130B. |
As large language models (LLMs) have made significant advancements over the past five years, there is growing anticipation for seamlessly integrating other modalities of perception (primarily visual) with the capabilities of large language models. This talk will start with the basics of large language models, discuss the academic community's attempts at multimodal models and structural updates over the past one year. We will focus on introducing CogVLM, a powerful open-source multimodal model with 17B parameters (equivalent to a 7B dense model), and CogAgent, a model designed for scenarios involving GUI and OCR. Finally, we will discuss the applications of multimodal models and viable research directions in academia. |
May 16 | Amortizing intractable inference in large language models [In-Person] Speaker: Edward Hu, Prev. OpenAI Edward Hu is building his own company. He was previously a researcher at OpenAI and received his research training as a Ph.D. student advised by Yoshua Bengio, a recipient of the 2018 A.M. Turing Award.Before graduate school, Edward was a researcher at Microsoft, where he invented LoRA and μTransfer. LoRA is now one of the most popular methods for customizing AI models, and μTransfer is underpinning the largest AI models being developed today. |
Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.
Recommended Reading: |
May 23 | Behind the Scenes of LLM Pre-training: StarCoder Use Case [Virtual/Zoom] Speaker: Loubna Ben Allal, Hugging Face Loubna Ben Allal is a Machine Learning Engineer in the Science team at Hugging Face working on Large Language Models for code & Synthetic data generation. She is part of the core team behind the BigCode Project and has co-authored The Stack dataset and StarCoder models for code generation. Loubna holds Mathematics & Deep Learning Master's Degrees from Ecole des Mines de Nancy and ENS Paris Saclay. |
As large language models (LLMs) become essential to many AI products, learning to pretrain and fine-tune them is now crucial. In this talk, we will explore the intricacies of training LLMs from scratch, including lessons on scaling laws and data curation. Then, we will study the StarCoder use case as an example of LLMs tailored for code, highlighting how their development differs from standard LLMs. Additionally, we will discuss important aspects of data governance and evaluation, crucial elements in today's conversations about LLMs and AI that are frequently overshadowed by the pre-training discussions. |
May 30 | NO CLASS!! Enjoy the summer :) |