CS 224R: Deep Reinforcement Learning

Notes on CS 224R: Deep Reinforcement Learning at Stanford, 2026 Spring.

0. RL Basics

MDPs, policies, value functions, state-visitation distributions, and the family of deep RL algorithms

Shurui Liu
0. RL Basics

1. Imitation Learning

Behavior cloning, expressive policy distributions, action chunking, DAgger, and demonstration collection

Shurui Liu
1. Imitation Learning

2. Policy Gradients

REINFORCE, the log-derivative trick, reward-to-go, baselines, off-policy importance sampling, and practical considerations

Shurui Liu
2. Policy Gradients

3. Actor-Critic Methods

Policy evaluation, advantage estimates, basic and off-policy actor-critic, PPO, and SAC

Shurui Liu
3. Actor-Critic Methods

4. Q-Learning and Value-Based RL

From actor-critic to critic-only methods, Q-learning, target networks, double Q-learning, n-step returns, and the DQN recipe

Shurui Liu
4. Q-Learning and Value-Based RL

5. Offline Reinforcement Learning

Why offline RL differs from off-policy RL, filtered BC, advantage-weighted regression, IQL, and conservative Q-learning

Shurui Liu
5. Offline Reinforcement Learning

6. Reward Learning

Goal classifiers, preference-based reward models, RLHF for language models, and RLAIF

Shurui Liu
6. Reward Learning

7. Model-Based Reinforcement Learning

Learning dynamics models, MBPO, planning with learned models, MPC, CEM, and the tradeoffs of model-based RL

Shurui Liu
7. Model-Based Reinforcement Learning

8. Multi-Task and Goal-Conditioned RL

Universal policies and value functions, multi-task imitation and RL, hindsight relabeling, and HER

Shurui Liu
8. Multi-Task and Goal-Conditioned RL

9. Meta-Reinforcement Learning

Meta-RL problem setup, task distributions, black-box meta-RL, exploration-exploitation, posterior sampling, and task-inference objectives

Shurui Liu
9. Meta-Reinforcement Learning

10. Hierarchical Imitation and Reinforcement Learning

Long-horizon tasks, high-level and low-level policies, subgoal representations, hierarchical imitation, hierarchical RL, and skill discovery

Shurui Liu
10. Hierarchical Imitation and Reinforcement Learning

11. Sim2Real Robot Learning

Physics simulators, sim2real gaps, domain randomization, adaptation, Real2Sim, human data, retargeting, and RL algorithms for robot policy learning

Shurui Liu
11. Sim2Real Robot Learning

12. Cross-Cutting Study Guide

Formula sheet, data regimes, method comparison table, bias/variance/distribution shift, meta-RL, hierarchy, sim2real, common failure modes, and exam traps

Shurui Liu
12. Cross-Cutting Study Guide

Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO

LLM training pipeline, instruction tuning, RLHF, preference modeling, and Direct Preference Optimization

Shurui Liu
Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO

Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning

Inference-time scaling, lessons from poker and Go, RL for reasoning, and serial vs. parallel test-time compute

Shurui Liu
Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning