12. Cross-Cutting Study Guide

12 minShurui Liu

Core Formula Sheet

Trajectory probability:

$$ \tau=(s_ 1,a_ 1,\ldots,s_ T,a_ T,s_ {T+1}), \qquad p_ \theta(\tau)=\rho_ 0(s_ 1)\prod_ {t=1}^{T} \pi_ \theta(a_ t\mid s_ t)p(s_ {t+1}\mid s_ t,a_ t). $$

Return and reward-to-go:

$$ R(\tau)=\sum_ t \gamma^{t-1}r_ t, \qquad G_ t=\sum_ {k=0}^{T-t}\gamma^k r_ {t+k}. $$

Policy-gradient estimator:

$$ \nabla_ \theta J(\theta) \approx \frac{1}{N}\sum_ {i,t} \nabla_ \theta\log\pi_ \theta(a_ {i,t}\mid s_ {i,t})\hat{A}_ {i,t}. $$

TD residual and one-step advantage:

$$ \delta_ t=r_ t+\gamma V(s_ {t+1})-V(s_ t). $$

Q-learning target:

$$ y=r+\gamma\max_ {a'}Q_ {\bar{\phi}}(s',a'). $$

PPO ratio and clipped objective:

$$ \rho_ t(\theta)= \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)}, $$

$$ L^{\mathrm{PPO}}= \mathbb{E}\left[ \min(\rho_ t\hat{A}_ t, \operatorname{clip}(\rho_ t,1-\epsilon,1+\epsilon)\hat{A}_ t) \right]. $$

SAC soft actor objective:

$$ \max_ \theta \mathbb{E}_ {s\sim\mathcal{D},a\sim\pi_ \theta} [Q(s,a)-\alpha\log\pi_ \theta(a\mid s)]. $$

IQL policy extraction:

$$ \max_ \theta \mathbb{E}_ {(s,a)\sim\mathcal{D}} \left[ \log\pi_ \theta(a\mid s) \exp\left(\frac{Q(s,a)-V(s)}{\alpha}\right) \right]. $$

Preference reward model:

$$ \Pr(\tau_ a\succ\tau_ b)= \sigma(r_ \theta(\tau_ a)-r_ \theta(\tau_ b)). $$

Policy-gradient theorem:

$$ \nabla_ \theta J(\theta)= \frac{1}{1-\gamma} \mathbb{E}_ {s\sim d^{\pi_ \theta},a\sim\pi_ \theta} [\nabla_ \theta\log\pi_ \theta(a\mid s)Q^{\pi_ \theta}(s,a)]. $$

GAE:

$$ \hat{A}^{\mathrm{GAE}}_ t= \sum_ {l=0}^{\infty}(\gamma\lambda)^l\delta_ {t+l}, \qquad \delta_ t=r_ t+\gamma V(s_ {t+1})-V(s_ t). $$

TRPO-style constrained surrogate:

$$ \max_ \theta \mathbb{E}\left[ \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)} \hat{A}_ t \right] \quad \text{s.t.}\quad \mathbb{E}_ s[D_ {\mathrm{KL}}(\pi_ {\mathrm{old}}|\pi_ \theta)]\le\delta. $$

IQL expectile/value and Q losses:

$$ L_ \tau(u)=|\tau-\mathbf{1}{u<0}|u^2, $$

$$ \min_ \psi \mathbb{E}_ {(s,a)\sim\mathcal{D}} [L_ \tau(Q_ \phi(s,a)-V_ \psi(s))], \qquad \min_ \phi \mathbb{E}[(Q_ \phi(s,a)-r-\gamma V_ \psi(s'))^2]. $$

DPO loss:

$$ \mathcal{L}_ {\mathrm{DPO}}=- \mathbb{E} \left[ \log\sigma\left( \beta \left[ \log\frac{\pi_ \theta(y_ w\mid x)}{\pi_ {\mathrm{ref}}(y_ w\mid x)} - \log\frac{\pi_ \theta(y_ l\mid x)}{\pi_ {\mathrm{ref}}(y_ l\mid x)} \right] \right) \right]. $$

Meta-RL adaptation policy:

$$ a\sim\pi_ \theta(\cdot\mid s,\mathcal{D}_ {\mathrm{train}}), \qquad \mathcal{D}_ {\mathrm{train}} ={(s_ t,a_ t,r_ t,s_ {t+1})}_ {t=1}^k. $$

Black-box meta-RL memory:

$$ h_ t=f_ \theta(h_ {t-1},s_ t,a_ {t-1},r_ {t-1}), \qquad a_ t\sim\pi_ \theta(\cdot\mid s_ t,h_ t). $$

Hierarchical policy:

$$ g_ t\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ t,c), \qquad a_ t\sim\pi_ {\mathrm{LL}}(\cdot\mid o_ t,g_ t). $$

Domain randomization objective:

$$ \max_ \theta \mathbb{E}_ {e\sim p(e)} [J_ {\mathrm{sim}}(\pi_ \theta;e)], \qquad x_ {t+1}=f_ {\mathrm{sim}}(x_ t,u_ t,e). $$

Bellman operators:

$$ (\mathcal{B}^\pi V)(s)= \mathbb{E}_ {a\sim\pi,s'\sim p}[r(s,a)+\gamma V(s')], $$

$$ (\mathcal{B}^*Q)(s,a)= r(s,a)+\gamma\mathbb{E}_ {s'\sim p}[\max_ {a'}Q(s',a')]. $$

Data Regimes

The course repeatedly compares algorithms by data regime:

Offline imitation learning: fixed expert dataset, no reward needed.
Online imitation learning: expert interventions or labels after rollout.
On-policy RL: use only current policy data.
Less off-policy RL: reuse one recent batch for multiple updates.
Fully off-policy RL: replay buffer from many past policies.
Offline RL: fixed reward-labeled dataset, no new policy data.

More online data usually permits more aggressive improvement. Less online data requires stronger conservatism.

Method Comparison Table

Method	Data regime	Learned objects	Main update	Main failure mode
BC	Offline demos	Policy	Supervised likelihood	Covariate shift, multimodal averaging
DAgger	Online expert labels	Policy	Aggregate learner-state labels	Requires expert intervention
REINFORCE	On-policy RL	Policy	Score-function gradient	High variance
Actor-critic	Usually on-policy or lightly off-policy	Policy and value/critic	PG with learned advantage	Critic bias/instability
TRPO	Recent on-policy batches	Policy and value	KL-constrained surrogate	Harder implementation, conservative updates
PPO	Recent on-policy batches	Policy and value	Clipped/KL-constrained actor update	Data inefficiency
SAC	Replay buffer	Policy and soft Q	Entropy-regularized off-policy AC	Tuning and critic extrapolation
DQN	Replay buffer	Q-function	Bellman optimality backup	Overestimation, discrete-action bias
AWR/AWAC	Offline or replay data	Policy plus value/Q	Advantage-weighted BC	Weak or noisy advantage estimates
IQL	Fixed reward dataset	V, Q, policy	Expectile value + weighted BC	Depends on data support and expectile tuning
CQL	Fixed reward dataset	Conservative Q and policy	Penalize unsupported high Q	Too much pessimism
Model-based RL	Real plus model data or planning	Dynamics, often reward/value	Synthetic rollouts or MPC	Model error exploitation
Goal-conditioned RL	Multi-goal data	Policy/Q conditioned on goal	Relabeling and off-policy RL	Invalid relabeling, sparse exploration
Black-box meta-RL	Many training tasks, few-shot test tasks	Memory/context-conditioned policy	Optimize adaptation over task distribution	Hard exploration, poor task-distribution generalization
PEARL-style meta-RL	Many training tasks, replay data	Latent task posterior and policy	Infer $z$ from context and act with $\pi(a\mid s,z)$	Posterior sampling may not gather task-identifying information
Hierarchical IL/RL	Demonstrations or RL with subgoals	High-level and low-level policies	Predict subgoals, then actions conditioned on subgoals	Level mismatch, bad termination/replanning
Sim2Real RL	Simulated plus sometimes real data	Robot policy, sometimes adaptation/residual models	Train in randomized/calibrated simulation	Simulator mismatch and reward misspecification

What Each Learned Object Does

Policy $\pi(a\mid s)$: chooses actions.
Value $V(s)$: estimates how good a state is under a policy.
Q-function $Q(s,a)$: estimates how good an action is in a state.
Advantage $A(s,a)$: estimates whether an action is better than expected.
Dynamics model $p(s'\mid s,a)$: predicts the next state.
Reward model $r(s,a)$ or $r(x,y)$: predicts task success or human preference.

Most deep RL algorithms are combinations of these learned objects.

Bellman and DP Checklist

Exact dynamic programming assumes known $p(s'\mid s,a)$ and tractable state/action spaces.

Policy evaluation: repeatedly apply $\mathcal{B}^\pi$ until convergence.
Policy improvement: set $\pi_ {\mathrm{new}}(s)\in\arg\max_ a Q^\pi(s,a)$.
Policy iteration: alternate evaluation and improvement.
Value iteration: repeatedly apply the optimality backup.

The contraction property

$$ |\mathcal{B}V-\mathcal{B}U|_ \infty\le\gamma|V-U|_ \infty $$

is why tabular backups converge. Deep RL is harder because function approximation couples errors across states, samples replace exact expectations, and bootstrapped targets move during training.

Bias, Variance, and Distribution Shift

Many exam questions can be solved by identifying which side of these tradeoffs an algorithm chooses:

Choice	Lower variance	Lower bias
Return estimator	TD target	Monte Carlo return
Policy gradient	Baseline/advantage	Raw return remains unbiased but noisy
RL data	Off-policy replay	Fresh on-policy data
Offline RL	Conservative policy/value	Aggressive policy improvement
Model rollouts	Short rollouts from real states	Long rollouts if model is accurate

Distribution shift appears in different forms:

State shift: learned policy visits states absent from demonstrations.
Action shift: offline/off-policy actor queries Q-values for actions absent from data.
Model shift: planner rolls learned dynamics into unreal states.
Reward shift: RL discovers states or outputs where the learned reward is miscalibrated.
Task shift: meta-test tasks differ from the meta-training task distribution.
Level shift: a high-level policy outputs subgoals outside the low-level policy's training support.
Sim2Real shift: real dynamics, sensors, actuators, contacts, or delays differ from simulation.

Meta-RL, Hierarchy, and Sim2Real Checklist

Meta-RL:

What is the task distribution $p(\mathcal{T})$?
What data is available at meta-test time: full episodes, a few transitions, rewards, demonstrations, or context only?
Does the method learn a memory state, a latent task posterior, or an explicit update rule?
Is exploration trained only through downstream return, or through an auxiliary task-inference objective?

Hierarchy:

What is the high-level action $g$: language, image, state goal, option ID, or latent skill?
Is $g$ expressive enough but still achievable by the low-level policy?
Are the two levels trained on compatible distributions?
Is replanning event-based, fixed-interval, or learned?

Sim2Real:

Which mismatch matters most: dynamics parameters, missing physics, sensing, actuation, latency, perception, or reward?
Does the method use robustness, adaptation, calibration, residual modeling, or retargeting?
Which signals are privileged during simulation but unavailable at deployment?
Does the deployed actor depend only on real-world available observations?

Probability Tools from the TA Notes

Total variation distance between distributions $\mu$ and $\nu$ is

$$ D_ {\mathrm{TV}}(\mu,\nu)= \sup_ A |\mu(A)-\nu(A)| =\frac{1}{2}|\mu-\nu|_ 1 $$

in the discrete case. It measures the largest possible difference in probability assigned to the same event.

A coupling of $\mu$ and $\nu$ is a joint distribution over $(X,Y)$ whose marginals are $\mu$ and $\nu$. Coupling is useful for policy-change arguments because it lets us reason about the probability that two policies take different actions under a shared random draw. A common bound is that if two action distributions are close in total variation at each state, then trajectories remain coupled for longer; if they decouple, state distributions can diverge.

This is the intuition behind trust regions: small per-state policy changes limit how much the state distribution can change, which keeps old-policy advantage estimates meaningful for the new policy.

Common Failure Modes

Behavior cloning: covariate shift and action averaging.
Policy gradients: high variance and data inefficiency.
Actor-critic: biased or unstable critic misleads policy.
Q-learning: moving targets, overestimation, poor action coverage.
Offline RL: OOD action exploitation and value overestimation.
Reward learning: reward hacking and classifier/reward-model exploitation.
Model-based RL: compounding model error and model exploitation.
Multi-task RL: negative transfer and data imbalance.
Meta-RL: learning exploration and exploitation are coupled, so sparse rewards can create poor local optima.
Hierarchical policies: the high-level policy can issue subgoals that the low-level policy cannot reliably complete.
Sim2Real: a policy can exploit simulator details that are absent or different on the real robot.

Practical Algorithm Map

Start with behavior cloning if demonstrations exist.
Use DAgger or intervention data if BC fails due to compounding errors and expert online feedback is feasible.
Use PPO when online data is cheap enough and stability matters.
Use SAC when online data is expensive but tuning is acceptable.
Use DQN-style Q-learning for discrete action spaces.
Use offline RL when you only have a static dataset with rewards.
Use reward learning when the reward is hard to write down but preferences or success examples are available.
Use model-based RL when dynamics are easier to learn than the policy or when test-time planning is valuable.
Use goal-conditioned RL/HER when many tasks can be expressed as reaching goals.
Use meta-RL when deployment requires fast adaptation from a few examples drawn from a known task family.
Use hierarchy when long-horizon tasks have reusable subtasks or when high-level reasoning and low-level control need different time scales.
Use sim2real RL when real robot data is expensive but the simulator can be randomized, calibrated, or adapted enough for deployment.

Common Exam Traps

"Off-policy" does not mean "works with arbitrary data." You still need coverage of useful state-action pairs.
A baseline in policy gradients must not depend on the sampled action. If it does, the estimator can become biased.
$V^\pi$ and $Q^\pi$ evaluate a fixed policy. $V^$ and $Q^$ assume optimal future behavior.
Q-learning's max is over the learned Q-function, not over the behavior policy's sampled next action.
PPO is often called on-policy even though it reuses a recent batch for several updates; the updates are constrained so the data is still approximately on-policy.
Offline RL is not just "run SAC on a fixed replay buffer"; the actor and critic can exploit unsupported actions without conservatism.
A learned model that predicts one step well may still be poor for long-horizon planning.
A learned reward that classifies examples well may still be exploitable by an RL policy.
Meta-RL is not ordinary multi-task RL with a hidden task ID; the test-time policy must infer the task from experience.
A hierarchical policy is only useful if the subgoal interface is actually used and both levels are trained on compatible distributions.
Domain randomization does not make the real world disappear; it only helps if the real deployment lies inside the randomized training envelope.

`?`	Toggle this help
`/`	Search
`f`	Link hints (vim-like)
`t`	Toggle dark mode
`j` / `k`	Scroll down / up
`g` / `G`	Top / bottom
`o`	Jump back
`l`	Cycle language (en→zh→fr)
`H` / `L`	History back / forward
`r`	Reload
`F`	Fullscreen
`i`	Idle in the Matrix
`a`	ASCII Aquarium
`Esc`	Close / cancel