9. Meta-Reinforcement Learning

Shurui Liu

From Multi-Task RL to Meta-RL

Multi-task RL trains one policy to solve many tasks by conditioning on a task descriptor $z_ i$. Meta-RL changes the interface. At test time, the learner is not given a clean task ID. Instead, it receives a small amount of experience in the new task and must use that experience to act well.

This is why meta-RL is often described as "learning to learn." The policy should not merely solve the training tasks. It should learn an adaptation procedure that uses a few trajectories, transitions, rewards, or observations to infer what task it is currently facing.

A task is an MDP:

$$ \mathcal{T}_ i= (\mathcal{S}_ i,\mathcal{A}_ i,p_ i(s_ 1),p_ i(s'\mid s,a),r_ i(s,a)). $$

Meta-learning assumes a task distribution:

$$ \mathcal{T}_ i\sim p(\mathcal{T}). $$

During meta-training, the algorithm sees many tasks from this distribution. During meta-testing, it sees a new task from the same distribution and must adapt quickly. The same-distribution assumption is crucial. Meta-RL can generalize to unseen tasks, but only when the new task shares enough structure with the meta-training family.

Examples:

  • Maze navigation with different goal locations or layouts.
  • Locomotion over different terrains, slopes, payloads, or robot parameters.
  • Manipulation with different objects and goals.
  • Dialogue with users who have different preferences.
  • Program testing, where each student submission defines a different environment to explore.

Transfer Settings

The slides contrast three related settings:

  • Forward transfer: train on a source task and fine-tune on a target task.
  • Multi-task transfer: train on many tasks and use a task descriptor for zero-shot transfer.
  • Meta-learning: train on many tasks while explicitly accounting for adaptation at test time.

Forward transfer can work when the source and target are close. Multi-task transfer can work when the descriptor $z_ i$ captures the structure needed to choose the right behavior. Meta-learning is different because the "descriptor" is usually a dataset:

$$ \mathcal{D}_ {\mathrm{train}} ={(s_ t,a_ t,r_ t,s_ {t+1})}_ {t=1}^k. $$

The adaptation policy is therefore:

$$ a\sim\pi_ \theta(\cdot\mid s,\mathcal{D}_ {\mathrm{train}}). $$

This resembles few-shot supervised learning, where a predictor uses a few labeled examples to classify a new query. In meta-RL, the examples are interactive and sequential. The learner may choose which data to collect, so exploration is part of the problem.

Episodic and Online Meta-RL

There are two common variants.

In the episodic variant, the agent first collects $k$ rollout episodes in the new task using an exploration policy:

$$ \mathcal{D}_ {\mathrm{train}}\sim\pi_ {\mathrm{exp}}. $$

Then it acts with a task-adapted policy:

$$ a\sim\pi_ {\mathrm{task}}(\cdot\mid s,\mathcal{D}_ {\mathrm{train}}). $$

In the online variant, the policy adapts continuously from the first few timesteps:

$$ a_ t\sim\pi_ \theta(\cdot\mid s_ t,h_ t), \qquad h_ t=f_ \theta(h_ {t-1},s_ {t-1},a_ {t-1},r_ {t-1}). $$

Here $h_ t$ is the internal memory of the agent. It summarizes what the agent has learned so far about the current task.

The exploration policy and task policy may be separate, but they often share parameters. Sharing is efficient, but it couples two jobs: collect information and exploit that information.

Meta-RL Objective

A high-level objective is:

$$ \max_ \theta \mathbb{E}_ {\mathcal{T}\sim p(\mathcal{T})} \left[ \mathbb{E}_ {\mathcal{D}_ {\mathrm{train}}\sim\pi_ {\mathrm{exp}}} \mathbb{E}_ {\tau\sim\pi_ {\mathrm{task}}(\cdot\mid\mathcal{D}_ {\mathrm{train}})} [R_ \mathcal{T}(\tau)] \right]. $$

The inner data $\mathcal{D}_ {\mathrm{train}}$ is collected inside the task. The outer expectation measures whether the learned adaptation procedure works across tasks. This is different from ordinary RL, which optimizes one policy for one MDP, and different from ordinary multi-task RL, which assumes the task information is already available.

Black-Box Meta-RL

Black-box meta-RL uses a neural network with memory to implement adaptation. The adaptation algorithm is not written by hand. It is represented by the network's hidden state, attention context, or latent variable.

A recurrent version can be written as:

$$ h_ t=f_ \theta(h_ {t-1},s_ t,a_ {t-1},r_ {t-1}), \qquad a_ t\sim\pi_ \theta(\cdot\mid s_ t,h_ t). $$

During a rollout, the network sees states, previous actions, and rewards. If training succeeds, the hidden state becomes an inferred task representation. For example, after seeing reward in one part of a maze, the memory can encode which goal is active.

A black-box meta-RL training loop is:

  1. Sample a task $\mathcal{T}_ i$.
  2. Roll out the meta-policy for several episodes in that task.
  3. Feed the history back into the policy as context.
  4. Optimize the total return across those episodes with an RL algorithm.

This is why black-box meta-RL can be viewed as RL over an augmented partially observed problem. The current physical state may not identify the task, but the history can.

Why This Is Not Just a Recurrent Policy

A recurrent policy in a single MDP may use memory for velocity estimation, partial observability, or temporal patterns. A meta-RL recurrent policy is trained across many MDPs so that its memory performs task inference and fast adaptation.

The distinction is in the training distribution and evaluation protocol:

  • Ordinary recurrent RL: train and test in the same task family without requiring fast adaptation to a held-out task.
  • Black-box meta-RL: train over many tasks, then evaluate how quickly the hidden state adapts to a new task.

In practice the architecture may look similar, but the intended behavior is different.

Exploration Is the Hard Part

In supervised few-shot learning, the examples are given. In meta-RL, the agent may need to gather the examples itself. That creates an exploration-exploitation tradeoff inside each new task.

In a maze, the agent may need to explore to discover the goal. In a kitchen, it may need to find ingredients before learning how to cook. In a program-testing environment, it may need to hit informative states before it can infer which bug exists.

End-to-end black-box meta-RL tries to learn exploration and exploitation together by maximizing task reward. This is conceptually clean:

$$ \max_ \theta \mathbb{E}[R_ {\mathrm{explore}}+R_ {\mathrm{exploit}}], $$

but the optimization can be difficult. Informative exploration may receive no immediate reward. If the agent never finds the information needed for the task, it also cannot learn the exploitation behavior. The two failures reinforce each other.

The slides call this a coupling problem:

  • Bad exploration prevents useful task inference.
  • Bad task execution makes exploration look useless.
  • Sparse rewards make both parts hard to credit assign.

Posterior Sampling

One way to separate task inference from control is to introduce a latent task variable $z$:

$$ z\sim p(z), \qquad q_ \phi(z\mid\mathcal{D}_ {\mathrm{train}}), \qquad a\sim\pi_ \theta(\cdot\mid s,z). $$

PEARL is an example of this style. It learns an inference model $q_ \phi(z\mid\mathcal{D}_ {\mathrm{train}})$ and a policy conditioned on $z$. At test time, the agent samples $z$ from its posterior and acts as if that sampled task hypothesis were true. This is posterior sampling, also called Thompson sampling.

Posterior sampling is attractive because uncertainty drives exploration. If several tasks are plausible, sampling different $z$ values leads to different behaviors. But it can be poor when task identification requires an action that is not useful under any sampled task policy. A hallway with a sign that tells the correct goal is an example: the best information-gathering action may be to read the sign, not to commit to one sampled goal.

Prediction-Based Exploration Objectives

Another family of methods gives the exploration policy an explicit information objective. Instead of only rewarding final task return, train exploration to collect data that makes some prediction problem easy.

A dynamics-and-reward prediction objective is:

$$ \max_ {\pi_ {\mathrm{exp}}} I(\mathcal{D}_ {\mathrm{train}};\mathcal{T}) \quad \text{or approximately minimize}\quad \ell(f_ \psi(s',r\mid s,a,\mathcal{D}_ {\mathrm{train}})). $$

MetaCURE follows this spirit: collect data that helps predict task dynamics and rewards. This is easier to optimize than sparse downstream return, but it may waste effort on irrelevant details in high-dimensional environments. If the state contains many distractors, predicting everything is not the same as discovering what matters for control.

DREAM instead predicts a compressed task representation:

$$ f_ \psi(\mathcal{D}_ {\mathrm{train}})\approx z_ {\mathrm{comp}}. $$

If the compressed representation captures the control-relevant task information, then exploration can be trained to identify exactly what the exploitation policy needs. This can be much easier than end-to-end reward optimization, but it requires access to or construction of useful task identifiers during training.

Application: Meta-Exploration for Program Feedback

The lecture's CS education example treats each student program as a new environment. The goal is to explore the program to find behavior that helps grade or give feedback. A learned exploration policy can discover informative interactions, such as what happens when an object hits a wall, floor, or target.

This fits the meta-RL pattern:

  • A task is a particular student submission.
  • Exploration collects traces from that submission.
  • The downstream objective is feedback, grading, or bug identification.
  • Meta-training over many submissions teaches which interactions are usually informative.

The broader lesson is that meta-RL is not only about robots adapting to new mazes. It applies whenever the agent must actively collect a small dataset that reveals how a new instance works.

Takeaways

Meta-RL learns an adaptation procedure over a task distribution. The key input at test time is not a clean task ID but a small amount of experience. Black-box meta-RL implements adaptation with memory or latent variables, but the hard part is often exploration: collecting data that is informative before the agent knows the task. Posterior sampling, prediction-based exploration, and compressed task representations are different ways to make that exploration problem more learnable.