4. Q-Learning and Value-Based RL

Shurui Liu

From Actor-Critic to Critic Only

If we have an accurate $Q^\pi(s,a)$ for policy $\pi$, we can improve the policy by choosing the action with largest Q-value:

$$ \pi_ {\text{new}}(s)=\arg\max_ a Q^\pi(s,a). $$

Q-learning removes the explicit actor and tries to directly learn the optimal Q-function:

$$ Q^*(s,a)=\max_ \pi Q^\pi(s,a). $$

The Bellman optimality equation is:

$$ Q^(s,a) = r(s,a)+\gamma \mathbb{E}_ {s'\sim p(\cdot\mid s,a)} \left[\max_ {a'}Q^(s',a')\right]. $$

The learned policy is greedy:

$$ \pi(s)=\arg\max_ a Q_ \phi(s,a). $$

In the exact tabular setting, the Q-learning target comes from the Bellman optimality operator:

$$ (\mathcal{B}^*Q)(s,a)= r(s,a)+\gamma \mathbb{E}_ {s'\sim p(\cdot\mid s,a)} \left[\max_ {a'}Q(s',a')\right]. $$

Fitted Q iteration applies this operator approximately to a dataset:

  1. Hold a previous Q-function fixed for target construction.
  2. Compute targets $y_ i=r_ i+\gamma\max_ {a'}Q_ {\mathrm{old}}(s'_ i,a')$.
  3. Regress a new Q-function toward those targets.
  4. Repeat.

Deep Q-learning is fitted Q iteration with neural networks, replay, target networks, and online data collection. The simple tabular equality becomes an optimization problem because the network cannot set every $Q(s,a)$ independently.

Compare three related updates:

  • Policy evaluation: learn $Q^\pi$ using the next action from the same policy $\pi$.
  • SARSA: on-policy TD control, using the actually sampled next action $a'$.
  • Q-learning: off-policy TD control, using $\max_ {a'}Q(s',a')$ regardless of which action the behavior policy took.

SARSA target:

$$ y_ {\mathrm{SARSA}}=r+\gamma Q(s',a'_ {\mathrm{sampled}}). $$

Q-learning target:

$$ y_ {\mathrm{Q}}=r+\gamma\max_ {a'}Q(s',a'). $$

This is why Q-learning can learn about the greedy policy while collecting data with an exploratory behavior policy.

Q-Learning Update

Given transition $(s,a,r,s')$, the target is:

$$ y=r+\gamma\max_ {a'}Q_ {\bar{\phi}}(s',a'). $$

If $s'$ is terminal, the bootstrap term is omitted:

$$ y=r. $$

The critic minimizes:

$$ \min_ \phi \mathbb{E}_ {(s,a,r,s')\sim\mathcal{D}} \left[ (Q_ \phi(s,a)-y)^2 \right]. $$

Q-learning is off-policy: the transition action $a$ may come from any behavior policy, while the target uses the greedy action under the current Q-function. This makes Q-learning data efficient, but also unstable with neural networks.

In tabular Q-learning, the update is often written as

$$ Q(s,a)\leftarrow Q(s,a)+\alpha \left[ r+\gamma\max_ {a'}Q(s',a')-Q(s,a) \right]. $$

The bracketed term is the TD error. With sufficient exploration and appropriate step-size decay, tabular Q-learning converges to $Q^*$. Deep Q-learning keeps the same target but replaces the table with a neural network, which removes the clean convergence guarantee.

The standard convergence caveats from the tabular theory are:

  • Every relevant state-action pair must be visited infinitely often.
  • Learning rates should decay with $\sum_ t\alpha_ t=\infty$ and $\sum_ t\alpha_ t^2<\infty$.
  • The representation is exact, so updating one table entry does not accidentally change unrelated entries.

Deep Q-learning violates the last condition by design. One gradient step changes many state-action predictions at once, which is why target networks, replay, normalization, and conservative update sizes matter.

Data Collection

Although Q-learning is off-policy, it still needs action coverage. If the data never tries certain actions in important states, the algorithm cannot accurately compare them. Online Q-learning often collects data with an exploratory policy such as $\epsilon$-greedy:

$$ a= \begin{cases} \text{random action}, & \text{with probability }\epsilon,\\ \arg\max_ a Q(s,a), & \text{otherwise}. \end{cases} $$

In continuous actions, maximizing over $a'$ is difficult. This is one reason DQN-style Q-learning is most natural for discrete actions, while actor-critic methods such as SAC are common for continuous control.

Exploration is not optional. Q-learning can be off-policy, but it cannot learn values for actions that are never tried or never represented in the dataset. Coverage is the bridge between "off-policy" and "possible to improve."

Stabilizing Deep Q-Learning

Deep Q-learning has a moving-target problem: the target depends on the same network being updated. Target networks fix this by using delayed parameters $\bar{\phi}$ in the target:

$$ y=r+\gamma\max_ {a'}Q_ {\bar{\phi}}(s',a'). $$

The target network is updated slowly:

$$ \bar{\phi}\leftarrow \tau\phi+(1-\tau)\bar{\phi}, $$

or copied periodically. This is a core idea in DQN.

Replay buffers also help by reducing correlation between updates and reusing past transitions.

The combination of function approximation, bootstrapping, and off-policy data is often called the deadly triad. DQN works in practice because replay buffers reduce sample correlation and target networks slow the moving target. These stabilizers do not make arbitrary deep Q-learning safe, but they remove the most immediate instabilities.

The critic update is usually a semi-gradient update: the target

$$ y=r+\gamma\max_ {a'}Q_ {\bar{\phi}}(s',a') $$

is treated as a constant when differentiating

$$ (Q_ \phi(s,a)-y)^2. $$

If the same parameters are differentiated through both the prediction and the bootstrap target, the objective no longer corresponds to the usual Bellman-error update and can behave poorly. Detaching the target in code is therefore not an incidental implementation detail; it is part of the algorithm.

Overestimation and Double Q-Learning

The max operator tends to overestimate values when Q estimates are noisy:

$$ \mathbb{E}[\max_ a \hat{Q}(s,a)] \ge \max_ a \mathbb{E}[\hat{Q}(s,a)]. $$

Double Q-learning separates action selection from action evaluation. In deep double Q-learning:

$$ a^*=\arg\max_ {a'}Q_ \phi(s',a'), $$

$$ y=r+\gamma Q_ {\bar{\phi}}(s',a^*). $$

The online network selects the action, while the target network evaluates it. This reduces maximization bias.

Overestimation is an expectation effect, not just a bad implementation. If several action values have zero-mean errors, the maximum is likely to select a positive error. This is why separating selection and evaluation helps.

N-Step Returns

Q-learning can use multi-step targets:

$$ y_ t^{(n)} = \sum_ {k=0}^{n-1}\gamma^k r_ {t+k} +\gamma^n\max_ {a'}Q_ {\bar{\phi}}(s_ {t+n},a'). $$

N-step targets can speed learning and reduce bias early, but they are only exactly correct when the intermediate trajectory follows the target policy. In practice, many algorithms still use them because the empirical benefit is often worth the approximation.

The off-policy subtlety is easy to miss. The one-step target is valid for Q-learning because the target policy only appears in the final $\max_ {a'}Q(s',a')$. In an $n$-step target, the intermediate rewards came from actions selected by the behavior policy. If those actions differ greatly from the greedy target policy, the target is no longer an exact Bellman optimality sample. Common responses are to keep $n$ small, use mostly on-policy recent data, use importance sampling corrections, or accept the approximation.

DQN-Style Recipe

A practical DQN loop is:

  1. Act with an exploratory policy, usually $\epsilon$-greedy with respect to $Q_ \phi$.
  2. Store $(s,a,r,s')$ in a replay buffer.
  3. Sample a minibatch from the replay buffer.
  4. Compute targets with a target network:

$$ y_ i=r_ i+\gamma\max_ {a'}Q_ {\bar{\phi}}(s'_ i,a'). $$

  1. Minimize the Bellman error:

$$ \frac{1}{B}\sum_ i (Q_ \phi(s_ i,a_ i)-y_ i)^2. $$

  1. Periodically copy or Polyak-average $\phi$ into $\bar{\phi}$.

The method is value-based because the policy is implicit: act greedily or $\epsilon$-greedily with respect to the learned Q-values.

Algorithm Selection

The slides summarize online model-free choices:

  • PPO: stable, easy to use, data inefficient.
  • DQN and variants: good for discrete or low-dimensional action spaces.
  • SAC and variants: data efficient for continuous control, but less stable and more sensitive to hyperparameters.