2. Policy Gradients

Shurui Liu

From Imitation to Trial and Error

Policy gradients are the first online RL algorithm family in the course. The agent repeatedly:

  1. Runs the current policy to collect trajectories.
  2. Estimates which actions or trajectories were good.
  3. Updates the policy to increase the probability of good behavior and decrease the probability of bad behavior.

Unlike imitation learning, policy gradients can improve from practice, assuming rewards are available and online data collection is feasible.

Log-Derivative Trick

The objective is:

$$ J(\theta)=\mathbb{E}_ {\tau\sim p_ \theta(\tau)}[R(\tau)], \qquad R(\tau)=\sum_ {t=1}^T r(s_ t,a_ t). $$

Using the identity

$$ \nabla_ \theta p_ \theta(\tau) =p_ \theta(\tau)\nabla_ \theta\log p_ \theta(\tau), $$

we get:

$$ \nabla_ \theta J(\theta) = \mathbb{E}_ {\tau\sim p_ \theta} \left[ \nabla_ \theta\log p_ \theta(\tau)R(\tau) \right]. $$

The environment dynamics do not depend on $\theta$, so:

$$ \nabla_ \theta\log p_ \theta(\tau) = \sum_ {t=1}^T \nabla_ \theta\log\pi_ \theta(a_ t\mid s_ t). $$

Therefore:

$$ \nabla_ \theta J(\theta) = \mathbb{E}_ {\tau\sim p_ \theta} \left[ \sum_ {t=1}^T \nabla_ \theta\log\pi_ \theta(a_ t\mid s_ t) R(\tau) \right]. $$

This is REINFORCE, also called vanilla policy gradient.

Policy Gradient Theorem

The trajectory derivation is useful for implementation, but the policy-gradient theorem gives the cleaner conceptual form:

$$ \nabla_ \theta J(\theta)= \frac{1}{1-\gamma} \mathbb{E}_ {s\sim d^{\pi_ \theta},,a\sim\pi_ \theta(\cdot\mid s)} \left[ \nabla_ \theta\log\pi_ \theta(a\mid s)Q^{\pi_ \theta}(s,a) \right], $$

for the discounted infinite-horizon case. With normalized $d^{\pi_ \theta}$, the factor $1/(1-\gamma)$ appears; with unnormalized occupancy, it is absorbed into the measure.

This theorem hides the derivative of the state distribution. Changing $\theta$ changes which states the policy visits, but the theorem says the gradient can still be written using the score of the action probabilities and the action value. This is why the sampled estimator only differentiates $\log\pi_ \theta(a\mid s)$ and not the environment dynamics.

Replacing $Q^{\pi_ \theta}(s,a)$ with an advantage is also valid:

$$ \nabla_ \theta J(\theta)= \frac{1}{1-\gamma} \mathbb{E}_ {s,a} \left[ \nabla_ \theta\log\pi_ \theta(a\mid s)A^{\pi_ \theta}(s,a) \right], $$

because subtracting a state-only baseline has zero expectation.

Monte Carlo Estimator

With $N$ sampled trajectories:

$$ \widehat{\nabla_ \theta J} = \frac{1}{N} \sum_ {i=1}^N \sum_ {t=1}^T \nabla_ \theta\log\pi_ \theta(a_ {i,t}\mid s_ {i,t}) R(\tau_ i). $$

The intuition matches the slides: shift probability mass toward actions that occurred in higher-return trajectories. With a baseline or normalized advantages, actions whose outcomes are worse than expected get negative weights and are explicitly pushed down.

This estimator is unbiased but usually high variance. A single trajectory return is a noisy estimate of how good every action in that trajectory was, and the same reward may be assigned to many unrelated earlier actions. Most practical improvements to policy gradients reduce this variance without changing the expected gradient.

Causality: Reward-to-Go

An action at time $t$ cannot affect rewards before $t$. So instead of weighting each action by full trajectory reward, use reward-to-go:

$$ G_ t=\sum_ {t'=t}^{T}\gamma^{t'-t}r(s_ {t'},a_ {t'}). $$

The gradient estimator becomes:

$$ \widehat{\nabla_ \theta J} = \frac{1}{N} \sum_ {i=1}^N \sum_ {t=1}^T \nabla_ \theta\log\pi_ \theta(a_ {i,t}\mid s_ {i,t}) G_ {i,t}. $$

This is unbiased and lower variance than using the full return.

The formal reason is that past rewards are conditionally independent of the current action once the past trajectory has already happened. In expectation, terms like

$$ \nabla_ \theta \log \pi_ \theta(a_ t\mid s_ t) \sum_ {t'<t} r(s_ {t'},a_ {t'}) $$

average to zero, so dropping past rewards does not bias the gradient.

Baselines

Subtracting a baseline that does not depend on the sampled action keeps the estimator unbiased:

$$ \mathbb{E}_ {a\sim\pi_ \theta(\cdot\mid s)} \left[\nabla_ \theta\log\pi_ \theta(a\mid s)b(s)\right]=0. $$

A common baseline is the value function $V^\pi(s_ t)$:

$$ \widehat{\nabla_ \theta J} = \frac{1}{N} \sum_ {i,t} \nabla_ \theta\log\pi_ \theta(a_ {i,t}\mid s_ {i,t}) \left(G_ {i,t}-V^\pi(s_ {i,t})\right). $$

The term in parentheses estimates the advantage. Baselines reduce variance and make the update depend on whether an action was better or worse than expected from that state.

The baseline must not depend on the sampled action $a_ t$. It may depend on the state, time step, or trajectory prefix. A state-value baseline is natural because it asks: was this action better than the average action the current policy would take in this state?

There is a mathematically optimal constant baseline for variance reduction, but in practice $V^\pi(s)$ is the useful approximation because it changes across states.

The variance-minimizing constant baseline is not exactly the average return unless score-function gradient magnitudes are constant. It has the form of a reward average weighted by $|\nabla_ \theta\log\pi_ \theta(a\mid s)|^2$. This is rarely used directly, but it explains why a learned value baseline is only a practical approximation to the statistically optimal variance-reduction term.

In code, advantage normalization is common:

$$ \tilde{A}_ t=\frac{\hat{A}_ t-\operatorname{mean}(\hat{A})} {\operatorname{std}(\hat{A})+\epsilon}. $$

This changes the scale of the gradient and can introduce small finite-sample effects, but it usually improves neural-network optimization by keeping update magnitudes well conditioned.

Surrogate Objective for Implementation

Autodiff should not be applied to the sampled rollout process as if actions and rewards were differentiable through the environment. Instead, implement a surrogate loss whose gradient equals the policy-gradient estimator:

$$ \mathcal{L}_ {\mathrm{PG}}(\theta)= -\frac{1}{N}\sum_ {i,t} \log\pi_ \theta(a_ {i,t}\mid s_ {i,t})\hat{A}_ {i,t}. $$

The action samples, returns, and advantages are treated as constants for this actor update. For a discrete policy, $-\log\pi_ \theta(a\mid s)\hat{A}$ is a weighted cross-entropy term. For a Gaussian policy, it is a weighted negative log likelihood of the sampled continuous action.

This is the implementation meaning of "do more good stuff, less bad stuff": if $\hat{A}_ t>0$, gradient descent on $\mathcal{L}_ {\mathrm{PG}}$ increases the log probability of the sampled action; if $\hat{A}_ t<0$, it decreases it.

Off-Policy Policy Gradients and Importance Sampling

On-policy policy gradients use data from the same policy being updated. To use data from an older behavior policy $\pi_ {\theta_ {\text{old}}}$, importance ratios correct the distribution mismatch:

$$ \rho_ t(\theta)= \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\theta_ {\text{old}}}(a_ t\mid s_ t)}. $$

A typical off-policy surrogate form is:

$$ \mathbb{E}_ {(s_ t,a_ t)\sim \pi_ {\theta_ {\text{old}}}} \left[ \rho_ t(\theta)\nabla_ \theta\log\pi_ \theta(a_ t\mid s_ t)\hat{A}_ t \right]. $$

This is a local correction, not a license to use arbitrary stale data. It corrects the sampled action probability at visited states, but the state distribution itself can also shift. Large policy changes make importance weights high variance and make state visitation mismatch worse. This motivates KL constraints or clipped objectives, which reappear in PPO.

Full trajectory importance sampling would multiply ratios across many time steps:

$$ \prod_ {t=1}^T \frac{\pi_ \theta(a_ t\mid s_ t)} {\pi_ {\mathrm{old}}(a_ t\mid s_ t)}. $$

This is usually too high variance. Practical algorithms use per-decision ratios and keep the new policy close to the data-collecting policy, so the correction is only trusted locally.

Practical Limitations

Policy gradients are conceptually simple but noisy. They usually need:

  • Large batches of fresh data.
  • Dense enough rewards to distinguish useful behavior.
  • Baselines or learned critics.
  • Careful control of policy-update size.

They are most attractive when stability and simplicity are more important than data efficiency.

Implementation checklist:

  • Sample trajectories with the same policy whose gradient is being estimated, unless using importance weights.
  • Compute reward-to-go or advantages, not just total episode reward.
  • Normalize advantages within a batch when training neural networks; this changes scaling but usually improves optimization.
  • Do not backpropagate the policy loss through the sampled actions or through the Monte Carlo returns.
  • Use enough stochasticity for exploration; deterministic policies cannot use the score-function estimator in the same simple form.