12. Cross-Cutting Study Guide

Shurui Liu

Core Formula Sheet

Trajectory probability:

$$ \tau=(s_ 1,a_ 1,\ldots,s_ T,a_ T,s_ {T+1}), \qquad p_ \theta(\tau)=\rho_ 0(s_ 1)\prod_ {t=1}^{T} \pi_ \theta(a_ t\mid s_ t)p(s_ {t+1}\mid s_ t,a_ t). $$

Return and reward-to-go:

$$ R(\tau)=\sum_ t \gamma^{t-1}r_ t, \qquad G_ t=\sum_ {k=0}^{T-t}\gamma^k r_ {t+k}. $$

Policy-gradient estimator:

$$ \nabla_ \theta J(\theta) \approx \frac{1}{N}\sum_ {i,t} \nabla_ \theta\log\pi_ \theta(a_ {i,t}\mid s_ {i,t})\hat{A}_ {i,t}. $$

TD residual and one-step advantage:

$$ \delta_ t=r_ t+\gamma V(s_ {t+1})-V(s_ t). $$

Q-learning target:

$$ y=r+\gamma\max_ {a'}Q_ {\bar{\phi}}(s',a'). $$

PPO ratio and clipped objective:

$$ \rho_ t(\theta)= \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)}, $$

$$ L^{\mathrm{PPO}}= \mathbb{E}\left[ \min(\rho_ t\hat{A}_ t, \operatorname{clip}(\rho_ t,1-\epsilon,1+\epsilon)\hat{A}_ t) \right]. $$

SAC soft actor objective:

$$ \max_ \theta \mathbb{E}_ {s\sim\mathcal{D},a\sim\pi_ \theta} [Q(s,a)-\alpha\log\pi_ \theta(a\mid s)]. $$

IQL policy extraction:

$$ \max_ \theta \mathbb{E}_ {(s,a)\sim\mathcal{D}} \left[ \log\pi_ \theta(a\mid s) \exp\left(\frac{Q(s,a)-V(s)}{\alpha}\right) \right]. $$

Preference reward model:

$$ \Pr(\tau_ a\succ\tau_ b)= \sigma(r_ \theta(\tau_ a)-r_ \theta(\tau_ b)). $$

Policy-gradient theorem:

$$ \nabla_ \theta J(\theta)= \frac{1}{1-\gamma} \mathbb{E}_ {s\sim d^{\pi_ \theta},a\sim\pi_ \theta} [\nabla_ \theta\log\pi_ \theta(a\mid s)Q^{\pi_ \theta}(s,a)]. $$

GAE:

$$ \hat{A}^{\mathrm{GAE}}_ t= \sum_ {l=0}^{\infty}(\gamma\lambda)^l\delta_ {t+l}, \qquad \delta_ t=r_ t+\gamma V(s_ {t+1})-V(s_ t). $$

TRPO-style constrained surrogate:

$$ \max_ \theta \mathbb{E}\left[ \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)} \hat{A}_ t \right] \quad \text{s.t.}\quad \mathbb{E}_ s[D_ {\mathrm{KL}}(\pi_ {\mathrm{old}}|\pi_ \theta)]\le\delta. $$

IQL expectile/value and Q losses:

$$ L_ \tau(u)=|\tau-\mathbf{1}{u<0}|u^2, $$

$$ \min_ \psi \mathbb{E}_ {(s,a)\sim\mathcal{D}} [L_ \tau(Q_ \phi(s,a)-V_ \psi(s))], \qquad \min_ \phi \mathbb{E}[(Q_ \phi(s,a)-r-\gamma V_ \psi(s'))^2]. $$

DPO loss:

$$ \mathcal{L}_ {\mathrm{DPO}}=- \mathbb{E} \left[ \log\sigma\left( \beta \left[ \log\frac{\pi_ \theta(y_ w\mid x)}{\pi_ {\mathrm{ref}}(y_ w\mid x)} - \log\frac{\pi_ \theta(y_ l\mid x)}{\pi_ {\mathrm{ref}}(y_ l\mid x)} \right] \right) \right]. $$

Meta-RL adaptation policy:

$$ a\sim\pi_ \theta(\cdot\mid s,\mathcal{D}_ {\mathrm{train}}), \qquad \mathcal{D}_ {\mathrm{train}} ={(s_ t,a_ t,r_ t,s_ {t+1})}_ {t=1}^k. $$

Black-box meta-RL memory:

$$ h_ t=f_ \theta(h_ {t-1},s_ t,a_ {t-1},r_ {t-1}), \qquad a_ t\sim\pi_ \theta(\cdot\mid s_ t,h_ t). $$

Hierarchical policy:

$$ g_ t\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ t,c), \qquad a_ t\sim\pi_ {\mathrm{LL}}(\cdot\mid o_ t,g_ t). $$

Domain randomization objective:

$$ \max_ \theta \mathbb{E}_ {e\sim p(e)} [J_ {\mathrm{sim}}(\pi_ \theta;e)], \qquad x_ {t+1}=f_ {\mathrm{sim}}(x_ t,u_ t,e). $$

Bellman operators:

$$ (\mathcal{B}^\pi V)(s)= \mathbb{E}_ {a\sim\pi,s'\sim p}[r(s,a)+\gamma V(s')], $$

$$ (\mathcal{B}^*Q)(s,a)= r(s,a)+\gamma\mathbb{E}_ {s'\sim p}[\max_ {a'}Q(s',a')]. $$

Data Regimes

The course repeatedly compares algorithms by data regime:

  • Offline imitation learning: fixed expert dataset, no reward needed.
  • Online imitation learning: expert interventions or labels after rollout.
  • On-policy RL: use only current policy data.
  • Less off-policy RL: reuse one recent batch for multiple updates.
  • Fully off-policy RL: replay buffer from many past policies.
  • Offline RL: fixed reward-labeled dataset, no new policy data.

More online data usually permits more aggressive improvement. Less online data requires stronger conservatism.

Method Comparison Table

MethodData regimeLearned objectsMain updateMain failure mode
BCOffline demosPolicySupervised likelihoodCovariate shift, multimodal averaging
DAggerOnline expert labelsPolicyAggregate learner-state labelsRequires expert intervention
REINFORCEOn-policy RLPolicyScore-function gradientHigh variance
Actor-criticUsually on-policy or lightly off-policyPolicy and value/criticPG with learned advantageCritic bias/instability
TRPORecent on-policy batchesPolicy and valueKL-constrained surrogateHarder implementation, conservative updates
PPORecent on-policy batchesPolicy and valueClipped/KL-constrained actor updateData inefficiency
SACReplay bufferPolicy and soft QEntropy-regularized off-policy ACTuning and critic extrapolation
DQNReplay bufferQ-functionBellman optimality backupOverestimation, discrete-action bias
AWR/AWACOffline or replay dataPolicy plus value/QAdvantage-weighted BCWeak or noisy advantage estimates
IQLFixed reward datasetV, Q, policyExpectile value + weighted BCDepends on data support and expectile tuning
CQLFixed reward datasetConservative Q and policyPenalize unsupported high QToo much pessimism
Model-based RLReal plus model data or planningDynamics, often reward/valueSynthetic rollouts or MPCModel error exploitation
Goal-conditioned RLMulti-goal dataPolicy/Q conditioned on goalRelabeling and off-policy RLInvalid relabeling, sparse exploration
Black-box meta-RLMany training tasks, few-shot test tasksMemory/context-conditioned policyOptimize adaptation over task distributionHard exploration, poor task-distribution generalization
PEARL-style meta-RLMany training tasks, replay dataLatent task posterior and policyInfer $z$ from context and act with $\pi(a\mid s,z)$Posterior sampling may not gather task-identifying information
Hierarchical IL/RLDemonstrations or RL with subgoalsHigh-level and low-level policiesPredict subgoals, then actions conditioned on subgoalsLevel mismatch, bad termination/replanning
Sim2Real RLSimulated plus sometimes real dataRobot policy, sometimes adaptation/residual modelsTrain in randomized/calibrated simulationSimulator mismatch and reward misspecification

What Each Learned Object Does

  • Policy $\pi(a\mid s)$: chooses actions.
  • Value $V(s)$: estimates how good a state is under a policy.
  • Q-function $Q(s,a)$: estimates how good an action is in a state.
  • Advantage $A(s,a)$: estimates whether an action is better than expected.
  • Dynamics model $p(s'\mid s,a)$: predicts the next state.
  • Reward model $r(s,a)$ or $r(x,y)$: predicts task success or human preference.

Most deep RL algorithms are combinations of these learned objects.

Bellman and DP Checklist

Exact dynamic programming assumes known $p(s'\mid s,a)$ and tractable state/action spaces.

  • Policy evaluation: repeatedly apply $\mathcal{B}^\pi$ until convergence.
  • Policy improvement: set $\pi_ {\mathrm{new}}(s)\in\arg\max_ a Q^\pi(s,a)$.
  • Policy iteration: alternate evaluation and improvement.
  • Value iteration: repeatedly apply the optimality backup.

The contraction property

$$ |\mathcal{B}V-\mathcal{B}U|_ \infty\le\gamma|V-U|_ \infty $$

is why tabular backups converge. Deep RL is harder because function approximation couples errors across states, samples replace exact expectations, and bootstrapped targets move during training.

Bias, Variance, and Distribution Shift

Many exam questions can be solved by identifying which side of these tradeoffs an algorithm chooses:

ChoiceLower varianceLower bias
Return estimatorTD targetMonte Carlo return
Policy gradientBaseline/advantageRaw return remains unbiased but noisy
RL dataOff-policy replayFresh on-policy data
Offline RLConservative policy/valueAggressive policy improvement
Model rolloutsShort rollouts from real statesLong rollouts if model is accurate

Distribution shift appears in different forms:

  • State shift: learned policy visits states absent from demonstrations.
  • Action shift: offline/off-policy actor queries Q-values for actions absent from data.
  • Model shift: planner rolls learned dynamics into unreal states.
  • Reward shift: RL discovers states or outputs where the learned reward is miscalibrated.
  • Task shift: meta-test tasks differ from the meta-training task distribution.
  • Level shift: a high-level policy outputs subgoals outside the low-level policy's training support.
  • Sim2Real shift: real dynamics, sensors, actuators, contacts, or delays differ from simulation.

Meta-RL, Hierarchy, and Sim2Real Checklist

Meta-RL:

  • What is the task distribution $p(\mathcal{T})$?
  • What data is available at meta-test time: full episodes, a few transitions, rewards, demonstrations, or context only?
  • Does the method learn a memory state, a latent task posterior, or an explicit update rule?
  • Is exploration trained only through downstream return, or through an auxiliary task-inference objective?

Hierarchy:

  • What is the high-level action $g$: language, image, state goal, option ID, or latent skill?
  • Is $g$ expressive enough but still achievable by the low-level policy?
  • Are the two levels trained on compatible distributions?
  • Is replanning event-based, fixed-interval, or learned?

Sim2Real:

  • Which mismatch matters most: dynamics parameters, missing physics, sensing, actuation, latency, perception, or reward?
  • Does the method use robustness, adaptation, calibration, residual modeling, or retargeting?
  • Which signals are privileged during simulation but unavailable at deployment?
  • Does the deployed actor depend only on real-world available observations?

Probability Tools from the TA Notes

Total variation distance between distributions $\mu$ and $\nu$ is

$$ D_ {\mathrm{TV}}(\mu,\nu)= \sup_ A |\mu(A)-\nu(A)| =\frac{1}{2}|\mu-\nu|_ 1 $$

in the discrete case. It measures the largest possible difference in probability assigned to the same event.

A coupling of $\mu$ and $\nu$ is a joint distribution over $(X,Y)$ whose marginals are $\mu$ and $\nu$. Coupling is useful for policy-change arguments because it lets us reason about the probability that two policies take different actions under a shared random draw. A common bound is that if two action distributions are close in total variation at each state, then trajectories remain coupled for longer; if they decouple, state distributions can diverge.

This is the intuition behind trust regions: small per-state policy changes limit how much the state distribution can change, which keeps old-policy advantage estimates meaningful for the new policy.

Common Failure Modes

  • Behavior cloning: covariate shift and action averaging.
  • Policy gradients: high variance and data inefficiency.
  • Actor-critic: biased or unstable critic misleads policy.
  • Q-learning: moving targets, overestimation, poor action coverage.
  • Offline RL: OOD action exploitation and value overestimation.
  • Reward learning: reward hacking and classifier/reward-model exploitation.
  • Model-based RL: compounding model error and model exploitation.
  • Multi-task RL: negative transfer and data imbalance.
  • Meta-RL: learning exploration and exploitation are coupled, so sparse rewards can create poor local optima.
  • Hierarchical policies: the high-level policy can issue subgoals that the low-level policy cannot reliably complete.
  • Sim2Real: a policy can exploit simulator details that are absent or different on the real robot.

Practical Algorithm Map

  • Start with behavior cloning if demonstrations exist.
  • Use DAgger or intervention data if BC fails due to compounding errors and expert online feedback is feasible.
  • Use PPO when online data is cheap enough and stability matters.
  • Use SAC when online data is expensive but tuning is acceptable.
  • Use DQN-style Q-learning for discrete action spaces.
  • Use offline RL when you only have a static dataset with rewards.
  • Use reward learning when the reward is hard to write down but preferences or success examples are available.
  • Use model-based RL when dynamics are easier to learn than the policy or when test-time planning is valuable.
  • Use goal-conditioned RL/HER when many tasks can be expressed as reaching goals.
  • Use meta-RL when deployment requires fast adaptation from a few examples drawn from a known task family.
  • Use hierarchy when long-horizon tasks have reusable subtasks or when high-level reasoning and low-level control need different time scales.
  • Use sim2real RL when real robot data is expensive but the simulator can be randomized, calibrated, or adapted enough for deployment.

Common Exam Traps

  • "Off-policy" does not mean "works with arbitrary data." You still need coverage of useful state-action pairs.
  • A baseline in policy gradients must not depend on the sampled action. If it does, the estimator can become biased.
  • $V^\pi$ and $Q^\pi$ evaluate a fixed policy. $V^$ and $Q^$ assume optimal future behavior.
  • Q-learning's max is over the learned Q-function, not over the behavior policy's sampled next action.
  • PPO is often called on-policy even though it reuses a recent batch for several updates; the updates are constrained so the data is still approximately on-policy.
  • Offline RL is not just "run SAC on a fixed replay buffer"; the actor and critic can exploit unsupported actions without conservatism.
  • A learned model that predicts one step well may still be poor for long-horizon planning.
  • A learned reward that classifies examples well may still be exploitable by an RL policy.
  • Meta-RL is not ordinary multi-task RL with a hidden task ID; the test-time policy must infer the task from experience.
  • A hierarchical policy is only useful if the subgoal interface is actually used and both levels are trained on compatible distributions.
  • Domain randomization does not make the real world disappear; it only helps if the real deployment lies inside the randomized training envelope.