CS 224R: Deep Reinforcement Learning
Notes on CS 224R: Deep Reinforcement Learning at Stanford, 2026 Spring.
0. RL Basics
MDPs, policies, value functions, state-visitation distributions, and the family of deep RL algorithms
1. Imitation Learning
Behavior cloning, expressive policy distributions, action chunking, DAgger, and demonstration collection
2. Policy Gradients
REINFORCE, the log-derivative trick, reward-to-go, baselines, off-policy importance sampling, and practical considerations
3. Actor-Critic Methods
Policy evaluation, advantage estimates, basic and off-policy actor-critic, PPO, and SAC
4. Q-Learning and Value-Based RL
From actor-critic to critic-only methods, Q-learning, target networks, double Q-learning, n-step returns, and the DQN recipe
5. Offline Reinforcement Learning
Why offline RL differs from off-policy RL, filtered BC, advantage-weighted regression, IQL, and conservative Q-learning
6. Reward Learning
Goal classifiers, preference-based reward models, RLHF for language models, and RLAIF
7. Model-Based Reinforcement Learning
Learning dynamics models, MBPO, planning with learned models, MPC, CEM, and the tradeoffs of model-based RL
8. Multi-Task and Goal-Conditioned RL
Universal policies and value functions, multi-task imitation and RL, hindsight relabeling, and HER
9. Meta-Reinforcement Learning
Meta-RL problem setup, task distributions, black-box meta-RL, exploration-exploitation, posterior sampling, and task-inference objectives
10. Hierarchical Imitation and Reinforcement Learning
Long-horizon tasks, high-level and low-level policies, subgoal representations, hierarchical imitation, hierarchical RL, and skill discovery
11. Sim2Real Robot Learning
Physics simulators, sim2real gaps, domain randomization, adaptation, Real2Sim, human data, retargeting, and RL algorithms for robot policy learning
12. Cross-Cutting Study Guide
Formula sheet, data regimes, method comparison table, bias/variance/distribution shift, meta-RL, hierarchy, sim2real, common failure modes, and exam traps
Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO
LLM training pipeline, instruction tuning, RLHF, preference modeling, and Direct Preference Optimization
Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning
Inference-time scaling, lessons from poker and Go, RL for reasoning, and serial vs. parallel test-time compute