3. Actor-Critic Methods
Motivation
Vanilla policy gradients use sampled returns as noisy estimates of action quality. Actor-critic methods improve this by learning a value function or Q-function. The actor is the policy $\pi_ \theta$; the critic estimates $V^\pi$, $Q^\pi$, or $A^\pi$.
Instead of asking "what return happened after this action once?", actor-critic tries to estimate "what return should we expect after this action on average?"
Policy Evaluation
Policy evaluation means estimating the value of the current policy.
Monte Carlo value fitting uses observed reward-to-go:
$$ y_ t^{\text{MC}}=\sum_ {t'=t}^{T}\gamma^{t'-t}r(s_ {t'},a_ {t'}), $$
and trains:
$$ \min_ \phi \mathbb{E}_ {(s_ t,y_ t)} \left[(V_ \phi(s_ t)-y_ t^{\text{MC}})^2\right]. $$
Monte Carlo targets are unbiased but high variance and require waiting for future rewards.
Bootstrapping, or temporal-difference learning, uses:
$$ y_ t^{\text{TD}}=r(s_ t,a_ t)+\gamma V_ \phi(s_ {t+1}). $$
TD targets have lower variance but are biased because they use the model's current prediction.
N-step returns interpolate between the two:
$$ y_ t^{(n)}= \sum_ {k=0}^{n-1}\gamma^k r(s_ {t+k},a_ {t+k}) +\gamma^n V_ \phi(s_ {t+n}). $$
The bias-variance tradeoff is central:
- Monte Carlo targets: no bootstrap bias, high variance, require full rollouts.
- One-step TD targets: lower variance and online updates, but biased when $V_ \phi$ is wrong.
- N-step targets: intermediate; larger $n$ uses more real rewards, smaller $n$ relies more on the critic.
When fitting a neural critic, the target is usually treated as a constant for that gradient step. In code this means detaching the bootstrap value or using a target network. Otherwise the critic can chase a moving target through both sides of the loss.
Advantage Estimates
A simple one-step advantage estimate is the TD residual:
$$ \hat{A}_ t = r(s_ t,a_ t)+\gamma V_ \phi(s_ {t+1})-V_ \phi(s_ t). $$
This measures whether the observed action led to a better outcome than the critic expected from the state.
In practice, many implementations use generalized advantage estimation (GAE), which exponentially averages multi-step TD residuals:
$$ \hat{A}^{\text{GAE}(\gamma,\lambda)}_ t = \sum_ {l=0}^{\infty}(\gamma\lambda)^l\delta_ {t+l}, \qquad \delta_ t=r_ t+\gamma V_ \phi(s_ {t+1})-V_ \phi(s_ t). $$
GAE was not the central focus of the slides, but it is the standard practical extension behind PPO-style actor-critic.
The parameter $\lambda$ controls the bias-variance tradeoff. $\lambda=0$ gives the one-step TD residual. $\lambda=1$ approaches a Monte Carlo advantage estimate. PPO commonly uses GAE because it gives smoother, lower-variance advantages while retaining some multi-step credit assignment.
Basic Actor-Critic Algorithm
- Sample trajectories from the current policy $\pi_ \theta$.
- Fit $V_ \phi$ using Monte Carlo, TD, or N-step targets.
- Estimate advantages $\hat{A}_ t$.
- Update the actor:
$$ \nabla_ \theta J(\theta) \approx \sum_ t \nabla_ \theta\log\pi_ \theta(a_ t\mid s_ t)\hat{A}_ t. $$
- Repeat with newly collected data.
The critic reduces variance; the actor still performs policy-gradient improvement.
Performance Difference and Trust Regions
The performance difference lemma formalizes why advantages are the right object for policy improvement. In a discounted MDP,
$$ J(\pi')-J(\pi)= \frac{1}{1-\gamma} \mathbb{E}_ {s\sim d^{\pi'},,a\sim\pi'(\cdot\mid s)} \left[A^\pi(s,a)\right]. $$
This says a new policy improves over $\pi$ if, under the states it actually visits, it chooses actions with positive old-policy advantage. The catch is that $d^{\pi'}$ is hard to know before deploying $\pi'$. A local surrogate replaces $d^{\pi'}$ with $d^\pi$:
$$ L_ \pi(\pi')= J(\pi)+ \frac{1}{1-\gamma} \mathbb{E}_ {s\sim d^\pi,,a\sim\pi'(\cdot\mid s)} [A^\pi(s,a)]. $$
This surrogate is accurate only when $\pi'$ is close to $\pi$. If the new policy moves too far, it can visit different states where the old advantages are irrelevant. This is the motivation for trust-region methods: improve the surrogate while constraining the policy change.
TRPO expresses this idea as a constrained optimization problem:
$$ \max_ \theta \mathbb{E}_ {s,a\sim\pi_ {\mathrm{old}}} \left[ \frac{\pi_ \theta(a\mid s)} {\pi_ {\mathrm{old}}(a\mid s)} \hat{A}_ t \right] \quad \text{s.t.}\quad \mathbb{E}_ {s\sim d^{\pi_ {\mathrm{old}}}} \left[ D_ {\mathrm{KL}}(\pi_ {\mathrm{old}}(\cdot\mid s)|\pi_ \theta(\cdot\mid s)) \right]\le \delta. $$
Natural policy gradient is the local quadratic version of the same idea. Ordinary gradients measure parameter-space distance; natural gradients measure distance in distribution space using the Fisher information matrix. TRPO is more complicated to implement, but it is the conceptual bridge to PPO.
Off-Policy Actor-Critic
To reuse more data, actor-critic can become off-policy.
With multiple gradient steps on one batch, importance ratios adjust for the policy changing during the update. This is the path toward PPO.
With a replay buffer, data from many past policies can be reused. But the value estimate must be compatible with the current policy. It is often easier to fit $Q(s,a)$ than $V(s)$ from replay data because replay transitions contain actions:
$$ Q^\pi(s,a)=r(s,a)+ \gamma\mathbb{E}_ {s'\sim p,,a'\sim\pi(\cdot\mid s')} [Q^\pi(s',a')]. $$
The replay-buffer critic target is:
$$ y=r+\gamma Q_ {\bar{\phi}}(s',a'), \qquad a'\sim\pi_ \theta(\cdot\mid s'). $$
Here $Q_ {\bar{\phi}}$ is often a slowly updated target critic. The actor can be updated toward actions with high critic value:
$$ \max_ \theta \mathbb{E}_ {s\sim\mathcal{D},,a\sim\pi_ \theta(\cdot\mid s)} [Q_ \phi(s,a)]. $$
This is the conceptual route to soft actor-critic (SAC).
The important off-policy actor-critic distinction is that the dataset contains states and actions from old behavior policies, but the Bellman target uses actions from the current policy at the next state. That lets the critic evaluate the current actor using replay data:
$$ y=r+\gamma \mathbb{E}_ {a'\sim\pi_ \theta(\cdot\mid s')} [Q_ {\bar{\phi}}(s',a')]. $$
This is more data efficient than on-policy policy gradients, but the target becomes sensitive to critic extrapolation and policy drift.
PPO and SAC in Context
PPO is a "less off-policy" actor-critic algorithm. It collects a batch with the current policy, then takes multiple constrained updates. The usual clipped surrogate is:
$$ L^{\text{PPO}}(\theta)= \mathbb{E}_ t\left[ \min\left( \rho_ t(\theta)\hat{A}_ t, \operatorname{clip}(\rho_ t(\theta),1-\epsilon,1+\epsilon)\hat{A}_ t \right)\right]. $$
The clipping prevents the new policy from moving too far from the data-collecting policy.
The clipping should be interpreted by the sign of the advantage:
- If $\hat{A}_ t>0$, PPO wants to increase the probability of $a_ t$, but not by more than about $1+\epsilon$.
- If $\hat{A}_ t<0$, PPO wants to decrease the probability of $a_ t$, but not by more than about $1-\epsilon$.
This is why the objective uses a minimum of unclipped and clipped terms. The clipped objective removes the incentive to push probability ratios farther once the update is already too large in the useful direction.
A practical PPO update usually includes three losses:
$$ \mathcal{L}(\theta,\phi)= -L^{\mathrm{PPO}}(\theta) +c_ v\mathbb{E}_ t[(V_ \phi(s_ t)-y_ t)^2] -c_ e\mathbb{E}_ t[\mathcal{H}(\pi_ \theta(\cdot\mid s_ t))]. $$
The value loss trains the critic, and the entropy bonus discourages premature collapse. PPO is often called on-policy because after several epochs on the current batch, it discards that batch and collects fresh rollouts. The clipping only makes short reuse tolerable; it does not make arbitrary old replay data safe.
SAC is a "more off-policy" actor-critic algorithm. It uses a replay buffer and optimizes a maximum-entropy objective:
$$ \mathbb{E}\left[\sum_ t \gamma^t \left(r(s_ t,a_ t)+\alpha\mathcal{H}(\pi(\cdot\mid s_ t))\right) \right], $$
encouraging both high reward and exploration. SAC is usually more data efficient than PPO but can be harder to tune.
A common soft Bellman target is
$$ y=r+\gamma \mathbb{E}_ {a'\sim\pi_ \theta(\cdot\mid s')} \left[ Q_ {\bar{\phi}}(s',a')-\alpha\log\pi_ \theta(a'\mid s') \right]. $$
The actor maximizes high soft Q-value while keeping entropy:
$$ \max_ \theta \mathbb{E}_ {s\sim\mathcal{D},a\sim\pi_ \theta} \left[ Q_ \phi(s,a)-\alpha\log\pi_ \theta(a\mid s) \right]. $$
Many SAC implementations use two Q-functions and take the minimum in the target to reduce overestimation. The temperature $\alpha$ controls the reward-exploration tradeoff; larger $\alpha$ favors higher entropy.
The soft value function corresponding to SAC is
$$ V^\pi(s)= \mathbb{E}_ {a\sim\pi(\cdot\mid s)} \left[ Q^\pi(s,a)-\alpha\log\pi(a\mid s) \right]. $$
This makes the soft Bellman backup:
$$ Q^\pi(s,a)= r(s,a)+\gamma \mathbb{E}_ {s'\sim p} [V^\pi(s')]. $$
The entropy term has two roles. It improves exploration during data collection, and it prevents the actor from putting all mass on a single action when several actions have similar Q-values. In continuous control, SAC often uses the reparameterization trick $a=f_ \theta(\epsilon,s)$ so the actor gradient can flow through the sampled action into $Q_ \phi(s,a)$.
Deterministic Policy Gradients
For continuous actions, a deterministic actor $\mu_ \theta(s)$ can be trained directly through a differentiable critic:
$$ \nabla_ \theta J(\theta) \approx \mathbb{E}_ {s\sim\mathcal{D}} \left[ \nabla_ a Q_ \phi(s,a)\vert_ {a=\mu_ \theta(s)} \nabla_ \theta \mu_ \theta(s) \right]. $$
This is the deterministic policy gradient idea behind algorithms such as DDPG and TD3. It avoids the score-function estimator but relies heavily on critic accuracy. If the critic is wrong or overestimated for actions near $\mu_ \theta(s)$, the actor follows that error directly. Stochastic methods such as SAC are often more robust because entropy and sampling reduce brittle point-estimate exploitation.
PPO vs. SAC
PPO and SAC are both actor-critic methods, but they occupy different points in the data-efficiency/stability tradeoff.
| Property | PPO | SAC |
|---|---|---|
| Data use | Mostly on-policy, short reuse of recent batches | Off-policy replay buffer |
| Policy constraint | Clipping or KL penalty | Entropy regularization and replay-based critic |
| Strength | Stable, simple default | Data efficient continuous control |
| Weakness | Uses lots of fresh interaction | More moving parts and critic sensitivity |
| Typical action spaces | Discrete or continuous | Usually continuous |
Practical Advice from the Slides
- PPO and variants: choose when stability and ease of use matter more than data efficiency.
- SAC and variants: choose when online data is expensive and you can tolerate more tuning.
- Imitation learning can seed both: initialize the PPO policy or add demonstrations to the SAC replay buffer.
- Actor-critic methods rely on value estimates; poor critics can mislead the actor.