7. Model-Based Reinforcement Learning

Shurui Liu

Key Idea

Model-based RL learns or uses a model of the environment dynamics:

$$ p_ \theta(s_ {t+1}\mid s_ t,a_ t). $$

This model can be called a dynamics model, simulator, world model, or predictive model. It may be known exactly, partially known with fitted parameters, or learned end-to-end.

Model-based RL can use the model in two main ways:

  • Generate synthetic experience for policy/value learning.
  • Plan at test time by simulating candidate futures.

Often a reward model is also needed:

$$ r_ \eta(s,a). $$

Not every model-based method learns a model from scratch. In chess, the transition model is known exactly. In robotics, one may know some physics and learn residual effects. In video prediction, the model may predict future observations or latent states rather than low-dimensional physical states.

Learning a Dynamics Model

Given transitions $(s_ t,a_ t,s_ {t+1})$, train by maximum likelihood:

$$ \max_ \theta \sum_ t \log p_ \theta(s_ {t+1}\mid s_ t,a_ t). $$

For deterministic models, this may reduce to MSE:

$$ \min_ \theta \sum_ t |f_ \theta(s_ t,a_ t)-s_ {t+1}|^2. $$

In high-dimensional settings, models may operate over learned latent states rather than raw observations.

Model learning is usually supervised or self-supervised: it does not require reward labels. This is one reason models can transfer across tasks with the same dynamics but different rewards.

Good one-step prediction loss is not sufficient by itself. A model used for control must be accurate under the action sequences that the planner or policy will query. A visually plausible video prediction model, for example, may still be bad for control if it misses small contact events that determine whether a robot grasps an object.

Model-Based Policy Optimization

The Dyna-style recipe:

  1. Collect real data with current policy and add it to $\mathcal{D}_ {\text{env}}$.
  2. Train the dynamics model on real data.
  3. Use the model to generate synthetic rollouts and add them to $\mathcal{D}_ {\text{model}}$.
  4. Train the policy and critic on real plus synthetic data.

Synthetic data can greatly improve data efficiency, but model errors compound when rollouts are long or when the policy exploits model inaccuracies.

The objective is not just "predict well on average." A model used for control must be accurate in parts of state-action space the policy or planner will query. A small one-step prediction error can be disastrous if it creates a fake high-reward path that planning repeatedly chooses.

Handling Model Distribution Shift

A learned model is most accurate near the data it was trained on. If synthetic rollouts drift far from real data, the model may become unreliable.

The slides emphasize two mitigations:

  • Start synthetic rollouts from real states in the replay buffer.
  • Use short model rollouts rather than full trajectories.

This is the idea behind model-based policy optimization (MBPO). Short rollouts reduce compounding model error while still augmenting real experience.

Model ensembles can also help. If several models are trained independently, disagreement can indicate uncertainty, and averaging can reduce idiosyncratic errors.

This is the key MBPO intuition: use the real replay buffer to anchor synthetic rollouts, and keep model rollout length short enough that the model does not need to be globally accurate.

Model uncertainty can be used in several ways:

  • Stop or shorten rollouts when ensemble disagreement becomes large.
  • Penalize rewards in uncertain model states.
  • Prefer action sequences whose predicted outcomes are robust across models.
  • Collect real data where the model is uncertain if online exploration is available.

These are different ways to avoid model exploitation: the policy should not get high return merely by finding a part of the learned simulator that is wrong.

Planning with a Learned Model

At test time, the model itself can define the policy through planning:

  1. Observe current state $s_ t$.
  2. Sample or optimize candidate action sequences $a_ {t:t+H}^{(i)}$.
  3. Set $\hat{s}_ t^{(i)}=s_ t$ and roll each sequence forward in the model:

$$ \hat{s}_ {t+1:t+H+1}^{(i)} \sim p_ \theta(\cdot\mid s_ t,a_ {t:t+H}^{(i)}). $$

  1. Score each sequence:

$$ S_ i= \sum_ {t'=t}^{t+H} \gamma^{t'-t}r(\hat{s}_ {t'}^{(i)},a_ {t'}^{(i)}). $$

  1. Execute the first action from the best sequence.

Executing only the first action and replanning at the next state is receding-horizon planning, also called model predictive control (MPC).

For long-horizon problems, add a terminal value estimate:

$$ S_ i= \sum_ {t'=t}^{t+H} \gamma^{t'-t}r(\hat{s}_ {t'}^{(i)},a_ {t'}^{(i)}) +\gamma^{H+1}V_ \psi(\hat{s}_ {t+H+1}^{(i)}). $$

The value function summarizes rewards beyond the planning horizon, reducing myopia.

Without a terminal value, short-horizon planning can be greedy in the wrong way. For example, it may refuse to take a temporarily bad action that is necessary for a large later reward. A terminal value estimate lets model-based planning reuse model-free value learning for long-horizon credit assignment.

Sampling and Optimization in Planning

Candidate actions can be sampled randomly, proposed by a policy network, or optimized iteratively. A common sampling-based optimizer is the cross-entropy method (CEM):

  1. Sample action sequences from a distribution.
  2. Keep the elite high-scoring sequences.
  3. Refit the sampling distribution to elites.
  4. Repeat for a few iterations.

This works well for short-horizon continuous control, especially when rewards are shaped.

Planning methods differ in how they generate candidate actions:

  • Random shooting: sample many action sequences independently and pick the best.
  • CEM: iteratively refit a sampling distribution to elite sequences.
  • Gradient-based planning: differentiate through a differentiable model to optimize actions.
  • Policy-guided planning: sample candidate actions from a learned policy rather than from a broad random distribution.

In receding-horizon control, the planner is the policy. There may be no separate neural actor at test time:

$$ \pi(s_ t)= \text{first action of the highest-scoring planned sequence}. $$

A learned policy can still be useful as a proposal distribution for planning, especially in high-dimensional action spaces where random shooting is inefficient.

Model-Based RL Tradeoffs

Upsides:

  • Can be much more data efficient if the model is learnable.
  • Dynamics learning can be self-supervised and reward-free.
  • A model may transfer across tasks with different rewards.
  • Planning can adapt to new situations at test time.

Downsides:

  • The model does not directly optimize task performance.
  • Model learning may be harder than policy learning.
  • Long-horizon prediction errors compound.
  • Adds hyperparameters, compute cost, and another failure mode.

Whether to use model-based RL depends on how difficult the environment is to model accurately enough for the intended use.

Other useful learned models include inverse models $p(a_ t\mid s_ t,s_ {t+1})$ and multi-step inverse models $p(a_ {t:t+n}\mid s_ t,s_ {t+n})$. They are not the main focus of the lecture, but they reinforce the broader point: "model-based" means learning predictive structure that helps decision-making, not only one-step forward dynamics.