12. Cross-Cutting Study Guide
Core Formula Sheet
Trajectory probability:
$$ \tau=(s_ 1,a_ 1,\ldots,s_ T,a_ T,s_ {T+1}), \qquad p_ \theta(\tau)=\rho_ 0(s_ 1)\prod_ {t=1}^{T} \pi_ \theta(a_ t\mid s_ t)p(s_ {t+1}\mid s_ t,a_ t). $$
Return and reward-to-go:
$$ R(\tau)=\sum_ t \gamma^{t-1}r_ t, \qquad G_ t=\sum_ {k=0}^{T-t}\gamma^k r_ {t+k}. $$
Policy-gradient estimator:
$$ \nabla_ \theta J(\theta) \approx \frac{1}{N}\sum_ {i,t} \nabla_ \theta\log\pi_ \theta(a_ {i,t}\mid s_ {i,t})\hat{A}_ {i,t}. $$
TD residual and one-step advantage:
$$ \delta_ t=r_ t+\gamma V(s_ {t+1})-V(s_ t). $$
Q-learning target:
$$ y=r+\gamma\max_ {a'}Q_ {\bar{\phi}}(s',a'). $$
PPO ratio and clipped objective:
$$ \rho_ t(\theta)= \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)}, $$
$$ L^{\mathrm{PPO}}= \mathbb{E}\left[ \min(\rho_ t\hat{A}_ t, \operatorname{clip}(\rho_ t,1-\epsilon,1+\epsilon)\hat{A}_ t) \right]. $$
SAC soft actor objective:
$$ \max_ \theta \mathbb{E}_ {s\sim\mathcal{D},a\sim\pi_ \theta} [Q(s,a)-\alpha\log\pi_ \theta(a\mid s)]. $$
IQL policy extraction:
$$ \max_ \theta \mathbb{E}_ {(s,a)\sim\mathcal{D}} \left[ \log\pi_ \theta(a\mid s) \exp\left(\frac{Q(s,a)-V(s)}{\alpha}\right) \right]. $$
Preference reward model:
$$ \Pr(\tau_ a\succ\tau_ b)= \sigma(r_ \theta(\tau_ a)-r_ \theta(\tau_ b)). $$
Policy-gradient theorem:
$$ \nabla_ \theta J(\theta)= \frac{1}{1-\gamma} \mathbb{E}_ {s\sim d^{\pi_ \theta},a\sim\pi_ \theta} [\nabla_ \theta\log\pi_ \theta(a\mid s)Q^{\pi_ \theta}(s,a)]. $$
GAE:
$$ \hat{A}^{\mathrm{GAE}}_ t= \sum_ {l=0}^{\infty}(\gamma\lambda)^l\delta_ {t+l}, \qquad \delta_ t=r_ t+\gamma V(s_ {t+1})-V(s_ t). $$
TRPO-style constrained surrogate:
$$ \max_ \theta \mathbb{E}\left[ \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)} \hat{A}_ t \right] \quad \text{s.t.}\quad \mathbb{E}_ s[D_ {\mathrm{KL}}(\pi_ {\mathrm{old}}|\pi_ \theta)]\le\delta. $$
IQL expectile/value and Q losses:
$$ L_ \tau(u)=|\tau-\mathbf{1}{u<0}|u^2, $$
$$ \min_ \psi \mathbb{E}_ {(s,a)\sim\mathcal{D}} [L_ \tau(Q_ \phi(s,a)-V_ \psi(s))], \qquad \min_ \phi \mathbb{E}[(Q_ \phi(s,a)-r-\gamma V_ \psi(s'))^2]. $$
DPO loss:
$$ \mathcal{L}_ {\mathrm{DPO}}=- \mathbb{E} \left[ \log\sigma\left( \beta \left[ \log\frac{\pi_ \theta(y_ w\mid x)}{\pi_ {\mathrm{ref}}(y_ w\mid x)} - \log\frac{\pi_ \theta(y_ l\mid x)}{\pi_ {\mathrm{ref}}(y_ l\mid x)} \right] \right) \right]. $$
Meta-RL adaptation policy:
$$ a\sim\pi_ \theta(\cdot\mid s,\mathcal{D}_ {\mathrm{train}}), \qquad \mathcal{D}_ {\mathrm{train}} ={(s_ t,a_ t,r_ t,s_ {t+1})}_ {t=1}^k. $$
Black-box meta-RL memory:
$$ h_ t=f_ \theta(h_ {t-1},s_ t,a_ {t-1},r_ {t-1}), \qquad a_ t\sim\pi_ \theta(\cdot\mid s_ t,h_ t). $$
Hierarchical policy:
$$ g_ t\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ t,c), \qquad a_ t\sim\pi_ {\mathrm{LL}}(\cdot\mid o_ t,g_ t). $$
Domain randomization objective:
$$ \max_ \theta \mathbb{E}_ {e\sim p(e)} [J_ {\mathrm{sim}}(\pi_ \theta;e)], \qquad x_ {t+1}=f_ {\mathrm{sim}}(x_ t,u_ t,e). $$
Bellman operators:
$$ (\mathcal{B}^\pi V)(s)= \mathbb{E}_ {a\sim\pi,s'\sim p}[r(s,a)+\gamma V(s')], $$
$$ (\mathcal{B}^*Q)(s,a)= r(s,a)+\gamma\mathbb{E}_ {s'\sim p}[\max_ {a'}Q(s',a')]. $$
Data Regimes
The course repeatedly compares algorithms by data regime:
- Offline imitation learning: fixed expert dataset, no reward needed.
- Online imitation learning: expert interventions or labels after rollout.
- On-policy RL: use only current policy data.
- Less off-policy RL: reuse one recent batch for multiple updates.
- Fully off-policy RL: replay buffer from many past policies.
- Offline RL: fixed reward-labeled dataset, no new policy data.
More online data usually permits more aggressive improvement. Less online data requires stronger conservatism.
Method Comparison Table
| Method | Data regime | Learned objects | Main update | Main failure mode |
|---|---|---|---|---|
| BC | Offline demos | Policy | Supervised likelihood | Covariate shift, multimodal averaging |
| DAgger | Online expert labels | Policy | Aggregate learner-state labels | Requires expert intervention |
| REINFORCE | On-policy RL | Policy | Score-function gradient | High variance |
| Actor-critic | Usually on-policy or lightly off-policy | Policy and value/critic | PG with learned advantage | Critic bias/instability |
| TRPO | Recent on-policy batches | Policy and value | KL-constrained surrogate | Harder implementation, conservative updates |
| PPO | Recent on-policy batches | Policy and value | Clipped/KL-constrained actor update | Data inefficiency |
| SAC | Replay buffer | Policy and soft Q | Entropy-regularized off-policy AC | Tuning and critic extrapolation |
| DQN | Replay buffer | Q-function | Bellman optimality backup | Overestimation, discrete-action bias |
| AWR/AWAC | Offline or replay data | Policy plus value/Q | Advantage-weighted BC | Weak or noisy advantage estimates |
| IQL | Fixed reward dataset | V, Q, policy | Expectile value + weighted BC | Depends on data support and expectile tuning |
| CQL | Fixed reward dataset | Conservative Q and policy | Penalize unsupported high Q | Too much pessimism |
| Model-based RL | Real plus model data or planning | Dynamics, often reward/value | Synthetic rollouts or MPC | Model error exploitation |
| Goal-conditioned RL | Multi-goal data | Policy/Q conditioned on goal | Relabeling and off-policy RL | Invalid relabeling, sparse exploration |
| Black-box meta-RL | Many training tasks, few-shot test tasks | Memory/context-conditioned policy | Optimize adaptation over task distribution | Hard exploration, poor task-distribution generalization |
| PEARL-style meta-RL | Many training tasks, replay data | Latent task posterior and policy | Infer $z$ from context and act with $\pi(a\mid s,z)$ | Posterior sampling may not gather task-identifying information |
| Hierarchical IL/RL | Demonstrations or RL with subgoals | High-level and low-level policies | Predict subgoals, then actions conditioned on subgoals | Level mismatch, bad termination/replanning |
| Sim2Real RL | Simulated plus sometimes real data | Robot policy, sometimes adaptation/residual models | Train in randomized/calibrated simulation | Simulator mismatch and reward misspecification |
What Each Learned Object Does
- Policy $\pi(a\mid s)$: chooses actions.
- Value $V(s)$: estimates how good a state is under a policy.
- Q-function $Q(s,a)$: estimates how good an action is in a state.
- Advantage $A(s,a)$: estimates whether an action is better than expected.
- Dynamics model $p(s'\mid s,a)$: predicts the next state.
- Reward model $r(s,a)$ or $r(x,y)$: predicts task success or human preference.
Most deep RL algorithms are combinations of these learned objects.
Bellman and DP Checklist
Exact dynamic programming assumes known $p(s'\mid s,a)$ and tractable state/action spaces.
- Policy evaluation: repeatedly apply $\mathcal{B}^\pi$ until convergence.
- Policy improvement: set $\pi_ {\mathrm{new}}(s)\in\arg\max_ a Q^\pi(s,a)$.
- Policy iteration: alternate evaluation and improvement.
- Value iteration: repeatedly apply the optimality backup.
The contraction property
$$ |\mathcal{B}V-\mathcal{B}U|_ \infty\le\gamma|V-U|_ \infty $$
is why tabular backups converge. Deep RL is harder because function approximation couples errors across states, samples replace exact expectations, and bootstrapped targets move during training.
Bias, Variance, and Distribution Shift
Many exam questions can be solved by identifying which side of these tradeoffs an algorithm chooses:
| Choice | Lower variance | Lower bias |
|---|---|---|
| Return estimator | TD target | Monte Carlo return |
| Policy gradient | Baseline/advantage | Raw return remains unbiased but noisy |
| RL data | Off-policy replay | Fresh on-policy data |
| Offline RL | Conservative policy/value | Aggressive policy improvement |
| Model rollouts | Short rollouts from real states | Long rollouts if model is accurate |
Distribution shift appears in different forms:
- State shift: learned policy visits states absent from demonstrations.
- Action shift: offline/off-policy actor queries Q-values for actions absent from data.
- Model shift: planner rolls learned dynamics into unreal states.
- Reward shift: RL discovers states or outputs where the learned reward is miscalibrated.
- Task shift: meta-test tasks differ from the meta-training task distribution.
- Level shift: a high-level policy outputs subgoals outside the low-level policy's training support.
- Sim2Real shift: real dynamics, sensors, actuators, contacts, or delays differ from simulation.
Meta-RL, Hierarchy, and Sim2Real Checklist
Meta-RL:
- What is the task distribution $p(\mathcal{T})$?
- What data is available at meta-test time: full episodes, a few transitions, rewards, demonstrations, or context only?
- Does the method learn a memory state, a latent task posterior, or an explicit update rule?
- Is exploration trained only through downstream return, or through an auxiliary task-inference objective?
Hierarchy:
- What is the high-level action $g$: language, image, state goal, option ID, or latent skill?
- Is $g$ expressive enough but still achievable by the low-level policy?
- Are the two levels trained on compatible distributions?
- Is replanning event-based, fixed-interval, or learned?
Sim2Real:
- Which mismatch matters most: dynamics parameters, missing physics, sensing, actuation, latency, perception, or reward?
- Does the method use robustness, adaptation, calibration, residual modeling, or retargeting?
- Which signals are privileged during simulation but unavailable at deployment?
- Does the deployed actor depend only on real-world available observations?
Probability Tools from the TA Notes
Total variation distance between distributions $\mu$ and $\nu$ is
$$ D_ {\mathrm{TV}}(\mu,\nu)= \sup_ A |\mu(A)-\nu(A)| =\frac{1}{2}|\mu-\nu|_ 1 $$
in the discrete case. It measures the largest possible difference in probability assigned to the same event.
A coupling of $\mu$ and $\nu$ is a joint distribution over $(X,Y)$ whose marginals are $\mu$ and $\nu$. Coupling is useful for policy-change arguments because it lets us reason about the probability that two policies take different actions under a shared random draw. A common bound is that if two action distributions are close in total variation at each state, then trajectories remain coupled for longer; if they decouple, state distributions can diverge.
This is the intuition behind trust regions: small per-state policy changes limit how much the state distribution can change, which keeps old-policy advantage estimates meaningful for the new policy.
Common Failure Modes
- Behavior cloning: covariate shift and action averaging.
- Policy gradients: high variance and data inefficiency.
- Actor-critic: biased or unstable critic misleads policy.
- Q-learning: moving targets, overestimation, poor action coverage.
- Offline RL: OOD action exploitation and value overestimation.
- Reward learning: reward hacking and classifier/reward-model exploitation.
- Model-based RL: compounding model error and model exploitation.
- Multi-task RL: negative transfer and data imbalance.
- Meta-RL: learning exploration and exploitation are coupled, so sparse rewards can create poor local optima.
- Hierarchical policies: the high-level policy can issue subgoals that the low-level policy cannot reliably complete.
- Sim2Real: a policy can exploit simulator details that are absent or different on the real robot.
Practical Algorithm Map
- Start with behavior cloning if demonstrations exist.
- Use DAgger or intervention data if BC fails due to compounding errors and expert online feedback is feasible.
- Use PPO when online data is cheap enough and stability matters.
- Use SAC when online data is expensive but tuning is acceptable.
- Use DQN-style Q-learning for discrete action spaces.
- Use offline RL when you only have a static dataset with rewards.
- Use reward learning when the reward is hard to write down but preferences or success examples are available.
- Use model-based RL when dynamics are easier to learn than the policy or when test-time planning is valuable.
- Use goal-conditioned RL/HER when many tasks can be expressed as reaching goals.
- Use meta-RL when deployment requires fast adaptation from a few examples drawn from a known task family.
- Use hierarchy when long-horizon tasks have reusable subtasks or when high-level reasoning and low-level control need different time scales.
- Use sim2real RL when real robot data is expensive but the simulator can be randomized, calibrated, or adapted enough for deployment.
Common Exam Traps
- "Off-policy" does not mean "works with arbitrary data." You still need coverage of useful state-action pairs.
- A baseline in policy gradients must not depend on the sampled action. If it does, the estimator can become biased.
- $V^\pi$ and $Q^\pi$ evaluate a fixed policy. $V^$ and $Q^$ assume optimal future behavior.
- Q-learning's max is over the learned Q-function, not over the behavior policy's sampled next action.
- PPO is often called on-policy even though it reuses a recent batch for several updates; the updates are constrained so the data is still approximately on-policy.
- Offline RL is not just "run SAC on a fixed replay buffer"; the actor and critic can exploit unsupported actions without conservatism.
- A learned model that predicts one step well may still be poor for long-horizon planning.
- A learned reward that classifies examples well may still be exploitable by an RL policy.
- Meta-RL is not ordinary multi-task RL with a hidden task ID; the test-time policy must infer the task from experience.
- A hierarchical policy is only useful if the subgoal interface is actually used and both levels are trained on compatible distributions.
- Domain randomization does not make the real world disappear; it only helps if the real deployment lies inside the randomized training envelope.