5. Offline Reinforcement Learning
Problem Setup
Offline RL asks whether we can learn a reward-maximizing policy from a fixed dataset without collecting additional online data:
$$ \mathcal{D}={(s,a,r,s')} \quad\text{collected by behavior policy}\quad \pi_ \beta. $$
The goal is to learn $\pi_ \theta$ with high return, possibly better than $\pi_ \beta$, using only $\mathcal{D}$.
Offline RL is useful when online exploration is expensive, risky, slow, or impossible. Examples include medical decision-making, autonomous driving logs, robot datasets, and previously collected RL runs.
The behavior policy $\pi_ \beta$ is the policy or mixture of policies that produced the dataset. It may be unknown. What matters is its support:
$$ \mathrm{supp}(\pi_ \beta(\cdot\mid s)) ={a:\pi_ \beta(a\mid s)>0}. $$
Offline improvement is only reliable when the learned policy stays near actions that the dataset covers, or when the value function is explicitly conservative about uncovered actions.
Why Offline RL Is Not Just Off-Policy RL
In online off-policy RL, if the learned policy exploits an error and visits bad states, future data collection can reveal and correct the mistake. In offline RL, the dataset is fixed. If the policy chooses out-of-distribution (OOD) actions, the Q-function may be unreliable there, and no new data will correct it.
The core failure mode is overestimation:
- Q-values are inaccurate for actions not well covered by the dataset.
- Policy improvement selects actions with high estimated Q.
- The selected actions may be OOD and overestimated.
- Bootstrapping propagates these errors.
So offline RL must be conservative about actions outside the data support.
This is stronger than the usual off-policy concern. Online off-policy algorithms can recover from bad extrapolation by trying actions and observing new rewards. Offline algorithms cannot. A policy that looks excellent under the learned Q-function may be excellent only because the Q-function has never seen those actions.
Another way to state the issue is: the Bellman target for an actor-critic method often requires
$$ \mathbb{E}_ {a'\sim\pi_ \theta(\cdot\mid s')}[Q(s',a')], $$
but the dataset may only contain actions from $\pi_ \beta$. If $\pi_ \theta$ puts mass on unsupported actions, both the target and the actor update query Q-values where the training data gives little constraint.
Why Not Just Imitation?
Imitation learning cannot reliably outperform the behavior policy. Offline RL can use rewards to stitch together good parts of suboptimal trajectories. If one trajectory demonstrates how to reach an intermediate state and another demonstrates how to continue from there, temporal-difference learning can combine them, even if no single demonstrator performed the full optimal trajectory.
This "trajectory stitching" is a major reason offline RL can outperform behavior cloning.
Filtered Behavior Cloning
A simple reward-aware baseline:
- Compute trajectory return:
$$ R(\tau)=\sum_ {(s,a)\in\tau}r(s,a). $$
- Keep only the top $k%$ of trajectories:
$$ \tilde{\mathcal{D}}={\tau:R(\tau)>\eta}. $$
- Behavior-clone the filtered data:
$$ \max_ \theta \sum_ {(s,a)\in\tilde{\mathcal{D}}}\log\pi_ \theta(a\mid s). $$
This is primitive but often a strong baseline.
Advantage-Weighted Regression
AWR keeps the policy close to dataset actions but weights actions by estimated advantage:
$$ \max_ \theta \mathbb{E}_ {(s,a)\sim\mathcal{D}} \left[ \log\pi_ \theta(a\mid s)\exp\left(\frac{\hat{A}(s,a)}{\alpha}\right) \right]. $$
High-advantage dataset actions receive larger weight. Crucially, the actor is only trained on actions actually present in the dataset, avoiding direct policy optimization over OOD actions.
The simplest Monte Carlo version estimates advantages for the behavior policy $\pi_ \beta$, because the observed returns come from behavior-policy rollouts. That is useful but limited: it can prefer better actions in the data, but it does not fully evaluate the improved policy you are extracting. AWAC-style methods use TD critics to get stronger advantage estimates while still training the actor only on dataset actions.
This objective can be viewed as an approximation to KL-constrained policy improvement:
$$ \max_ \pi \mathbb{E}_ {a\sim\pi(\cdot\mid s)}[Q(s,a)] \quad \text{s.t.} \quad D_ {\mathrm{KL}}(\pi|\pi_ \beta)\le \epsilon. $$
AWAC is a related algorithmic style: use an off-policy critic to estimate advantages, then update the actor with advantage-weighted behavior cloning. The shared principle is that policy extraction is supervised on dataset actions rather than unconstrained maximization over all actions.
Implicit Q-Learning
Implicit Q-learning (IQL) tries to estimate values for a policy better than the behavior policy without querying OOD actions.
The key idea is expectile regression. Instead of fitting $V(s)$ to the mean of $Q(s,a)$ over dataset actions, fit it to a high expectile, which approximates the value of the better actions in the dataset support.
Expectile loss:
$$ L_ \tau(u)=|\tau-\mathbf{1}{u<0}|u^2. $$
For high expectile $\tau>0.5$, positive residuals are weighted differently from negative residuals, pushing $V$ toward upper values. Some slide notation may present the asymmetry with the opposite residual sign; what matters is that the chosen asymmetry estimates a value above the behavior-policy mean but still within dataset support.
Intuitively, a mean value over dataset actions evaluates the behavior policy:
$$ V^{\pi_ \beta}(s)\approx \mathbb{E}_ {a\sim\pi_ \beta(\cdot\mid s)}[Q(s,a)]. $$
A high expectile instead tracks the better actions in the dataset support without taking a max over all possible actions. That is the "implicit" policy improvement: improve toward the best supported actions rather than explicitly optimizing over unsupported actions.
IQL has three stages:
- Fit $V$ with asymmetric expectile regression:
$$ \min_ \psi \mathbb{E}_ {(s,a)\sim\mathcal{D}} \left[ L_ \tau(Q_ \phi(s,a)-V_ \psi(s)) \right]. $$
- Fit $Q$ with TD targets using $V$:
$$ \min_ \phi \mathbb{E}_ {(s,a,r,s')\sim\mathcal{D}} \left[ \left(Q_ \phi(s,a)-[r+\gamma V_ \psi(s')]\right)^2 \right]. $$
- Extract the policy by advantage-weighted behavior cloning:
$$ \max_ \theta \mathbb{E}_ {(s,a)\sim\mathcal{D}} \left[ \log\pi_ \theta(a\mid s) \exp\left(\frac{Q_ \phi(s,a)-V_ \psi(s)}{\alpha}\right) \right]. $$
IQL's strength is that it never has to evaluate $Q(s,a)$ for actions sampled from the learned policy. Policy improvement is implicit through weighted supervised learning.
The policy-extraction weights are usually clipped or otherwise stabilized in implementation because
$$ \exp\left(\frac{Q(s,a)-V(s)}{\alpha}\right) $$
can become very large. The temperature $\alpha$ controls how sharply the policy focuses on high-advantage dataset actions. Small $\alpha$ behaves like selecting only top actions; large $\alpha$ behaves more like behavior cloning.
Conservative Q-Learning
Another major offline-RL strategy is to make the Q-function pessimistic on actions not supported by the dataset. Conservative Q-learning (CQL) adds a regularizer that lowers Q-values for actions sampled from a broad distribution while preserving Q-values on dataset actions. A simplified form is
$$ \min_ Q \alpha \left( \mathbb{E}_ {s\sim\mathcal{D},a\sim\mu(\cdot\mid s)}[Q(s,a)] - \mathbb{E}_ {s,a\sim\mathcal{D}}[Q(s,a)] \right) +\text{Bellman error}, $$
where $\mu$ might sample random actions or current-policy actions. The first term pushes down values for non-data actions; the second prevents pushing down the actual dataset actions too much. CQL is useful to remember because it represents the explicit-pessimism approach, while IQL represents the avoid-OOD-action-queries approach.
The common log-sum-exp version of the conservative penalty is
$$ \alpha \left( \mathbb{E}_ {s\sim\mathcal{D}} \left[\log\int\exp(Q(s,a)),da\right] - \mathbb{E}_ {(s,a)\sim\mathcal{D}}[Q(s,a)] \right). $$
For discrete actions, the integral becomes a sum. The log-sum-exp term is high when any action has a large Q-value, so minimizing it suppresses unsupported high values. The data-action term counterbalances this by keeping observed actions from being pushed down indiscriminately.
Conservatism is a bias on purpose. If it is too weak, the learned policy exploits value errors. If it is too strong, the method becomes behavior cloning with extra machinery and cannot improve much.
Conservative Off-Policy Evaluation
Before optimizing a policy, one can ask for a lower-bound estimate of a policy's value from offline data. Conservative off-policy evaluation tries to learn $\hat{Q}$ such that
$$ \mathbb{E}_ {a\sim\pi(\cdot\mid s)}[\hat{Q}(s,a)] \le \mathbb{E}_ {a\sim\pi(\cdot\mid s)}[Q^\pi(s,a)] $$
on the relevant states. This is useful because an underestimated policy value is safer than an overestimated one when no online correction is possible. CQL-style penalties can be viewed as pushing value estimation in this pessimistic direction.
Offline RL Method Map
| Method | How it avoids OOD action exploitation | Can improve over BC? |
|---|---|---|
| Filtered BC | Only imitates high-return trajectories | Limited, no stitching beyond selected data |
| AWR/AWAC | Weighted supervised learning on dataset actions | Yes, through advantage weights |
| IQL | Learns high expectile value and extracts policy by weighted BC | Yes, without querying learned-policy actions |
| CQL | Penalizes high Q-values on non-data actions | Yes, with pessimistic value estimates |
Takeaways
Offline RL must balance improvement and conservatism. It should use reward information to outperform plain imitation, but it must avoid exploiting value errors on unsupported actions. Practical methods often constrain the policy toward data actions, lower OOD value estimates, or avoid OOD action queries entirely.