CS 224R: Deep Reinforcement Learning

Notes on CS 224R: Deep Reinforcement Learning at Stanford, 2026 Spring.

0. RL Basics

MDPs, policies, value functions, state-visitation distributions, and the family of deep RL algorithms

9 minShurui Liu

0. RL Basics

1. Imitation Learning

Behavior cloning, expressive policy distributions, action chunking, DAgger, and demonstration collection

8 minShurui Liu

1. Imitation Learning

2. Policy Gradients

REINFORCE, the log-derivative trick, reward-to-go, baselines, off-policy importance sampling, and practical considerations

8 minShurui Liu

2. Policy Gradients

3. Actor-Critic Methods

Policy evaluation, advantage estimates, basic and off-policy actor-critic, PPO, and SAC

10 minShurui Liu

3. Actor-Critic Methods

4. Q-Learning and Value-Based RL

From actor-critic to critic-only methods, Q-learning, target networks, double Q-learning, n-step returns, and the DQN recipe

7 minShurui Liu

4. Q-Learning and Value-Based RL

5. Offline Reinforcement Learning

Why offline RL differs from off-policy RL, filtered BC, advantage-weighted regression, IQL, and conservative Q-learning

8 minShurui Liu

5. Offline Reinforcement Learning

6. Reward Learning

Goal classifiers, preference-based reward models, RLHF for language models, and RLAIF

6 minShurui Liu

6. Reward Learning

7. Model-Based Reinforcement Learning

Learning dynamics models, MBPO, planning with learned models, MPC, CEM, and the tradeoffs of model-based RL

6 minShurui Liu

7. Model-Based Reinforcement Learning

8. Multi-Task and Goal-Conditioned RL

Universal policies and value functions, multi-task imitation and RL, hindsight relabeling, and HER

6 minShurui Liu

8. Multi-Task and Goal-Conditioned RL

9. Meta-Reinforcement Learning

Meta-RL problem setup, task distributions, black-box meta-RL, exploration-exploitation, posterior sampling, and task-inference objectives

9 minShurui Liu

9. Meta-Reinforcement Learning

10. Hierarchical Imitation and Reinforcement Learning

Long-horizon tasks, high-level and low-level policies, subgoal representations, hierarchical imitation, hierarchical RL, and skill discovery

9 minShurui Liu

10. Hierarchical Imitation and Reinforcement Learning

11. Sim2Real Robot Learning

Physics simulators, sim2real gaps, domain randomization, adaptation, Real2Sim, human data, retargeting, and RL algorithms for robot policy learning

8 minShurui Liu

11. Sim2Real Robot Learning

12. Cross-Cutting Study Guide

Formula sheet, data regimes, method comparison table, bias/variance/distribution shift, meta-RL, hierarchy, sim2real, common failure modes, and exam traps

12 minShurui Liu

12. Cross-Cutting Study Guide

Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO

LLM training pipeline, instruction tuning, RLHF, preference modeling, and Direct Preference Optimization

7 minShurui Liu

Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO

Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning

Inference-time scaling, lessons from poker and Go, RL for reasoning, and serial vs. parallel test-time compute

6 minShurui Liu

Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning

`?`	Toggle this help
`/`	Search
`f`	Link hints (vim-like)
`t`	Toggle dark mode
`j` / `k`	Scroll down / up
`g` / `G`	Top / bottom
`o`	Jump back
`l`	Cycle language (en→zh→fr)
`H` / `L`	History back / forward
`r`	Reload
`F`	Fullscreen
`i`	Idle in the Matrix
`a`	ASCII Aquarium
`Esc`	Close / cancel

CS 224R: Deep Reinforcement Learning

0. RL Basics

1. Imitation Learning

2. Policy Gradients

3. Actor-Critic Methods

4. Q-Learning and Value-Based RL

5. Offline Reinforcement Learning

6. Reward Learning

7. Model-Based Reinforcement Learning

8. Multi-Task and Goal-Conditioned RL

9. Meta-Reinforcement Learning

10. Hierarchical Imitation and Reinforcement Learning

11. Sim2Real Robot Learning

12. Cross-Cutting Study Guide

Appendix A. Guest Lecture: Archit Sharma — Post-Training, RLHF, and DPO

Appendix B. Guest Lecture: Noam Brown — RL for LLM Reasoning

Keyboard Shortcuts