8. Multi-Task and Goal-Conditioned RL

6 minShurui Liu

Multi-Task Setup

Multi-task RL asks for one policy that can solve many tasks. Each task can be viewed as an MDP:

$$ \mathcal{T}_ i= (\mathcal{S}_ i,\mathcal{A}_ i,p_ i(s_ 1),p_ i(s'\mid s,a),r_ i(s,a)). $$

Tasks may differ in reward, initial-state distribution, dynamics, state space, or action space.

The common strategy is to provide a task identifier $z_ i$:

$$ \pi_ \theta(a\mid \bar{s},z_ i), \qquad Q_ \phi(\bar{s},a,z_ i). $$

The task identifier can be a one-hot ID, language instruction, goal image, goal state, video, or prompt. Equivalently, treat the task identifier as part of the state:

$$ s=(\bar{s},z_ i). $$

If this augmented state is Markov, standard RL algorithms can be applied.

This is the idea behind universal policies and universal value functions:

$$ V^\pi(s,z),\qquad Q^\pi(s,a,z),\qquad \pi(a\mid s,z). $$

The task variable $z$ tells the shared network which objective it is currently solving.

The augmented-state view is important: if $z$ is part of the state and the augmented process is Markov, the algorithms from previous lectures apply directly. If $z$ is missing or ambiguous, the same physical observation can require different actions and the problem becomes partially observed.

Why Multi-Task Learning Helps

The biggest recurring challenge in RL is data efficiency. Multi-task learning amortizes data and computation across tasks:

A legged robot can share balance and locomotion skills across walking, running, and crouching.
A language assistant can share grammar, tool use, and reasoning across user tasks.
A recommender can share representations across users.
A robot manipulator can share visual and motor representations across household tasks.

Generalist systems can be more robust than specialist policies if tasks share useful structure.

The benefit depends on shared structure. Multi-task learning can hurt when tasks conflict, when high-data tasks dominate training, or when one shared policy does not have enough capacity. This is negative transfer.

Multi-Task Imitation Learning

Single-task behavior cloning becomes:

$$ \max_ \theta \mathbb{E}_ {(s,a,z)\sim\mathcal{D}} \left[\log\pi_ \theta(a\mid s,z)\right]. $$

Task conditioning can be implemented by concatenating a task embedding, cross-attending to a language instruction, prompting a vision-language-action model, or conditioning a diffusion/action-chunking policy.

Stratified sampling is useful: construct minibatches with examples from multiple tasks so gradients are not dominated by high-data tasks.

Language conditioning is especially important in modern robotics and LLM settings. The instruction is not just metadata; it changes which behavior should be imitated for the same observation.

A subtle failure mode is task imbalance. If one task has far more data, naive minibatch sampling can produce a policy that is excellent on the dominant task and weak elsewhere. Stratified sampling, per-task losses, or task-balanced replay buffers are simple but important engineering choices.

Multi-Task RL

The model-free algorithms from earlier lectures extend directly:

$$ \pi_ \theta(a\mid s) \rightarrow \pi_ \theta(a\mid \bar{s},z), $$

$$ Q_ \phi(s,a) \rightarrow Q_ \phi(\bar{s},a,z). $$

Per-task replay buffers can maintain balanced sampling. If dynamics and action spaces are shared, data can sometimes be reused across tasks through relabeling.

If tasks have different action spaces or dynamics, sharing is harder. Common engineering choices include action-space adapters, task-specific heads, or restricting multi-task training to a compatible task family.

Hindsight Relabeling

Suppose data collected for one task accidentally achieves another task. Hindsight relabeling stores the same transition sequence with a different task identifier and recomputed rewards.

Generic multi-task relabeling:

Collect trajectory with task $z_ i$.
Store it normally.
Choose another task $z_ j$.
Recompute rewards using $r_ j(s_ t,a_ t)$.
Store relabeled data for task $z_ j$.

This requires:

Shared dynamics across tasks.
A reward function that can be evaluated after the fact.
An off-policy algorithm that can learn from relabeled replay data.

Relabeling is not valid for arbitrary tasks. It works when the observed transition would have been physically the same under the relabeled task and only the reward or goal label changes. If the task changes the dynamics, the relabeled transition may be impossible under that task.

Relabeling also assumes the relabeled reward is honest. If a task requires an intention that affected unobserved choices, simply changing the label may be misleading. Goal reaching is the clean case because the same trajectory really did reach some future state, even if it was not the commanded one.

Goal-Conditioned RL and HER

Goal-conditioned RL is a special case where the task identifier is a goal state:

$$ z=s_ g. $$

The policy is:

$$ \pi_ \theta(a\mid s,s_ g). $$

Rewards often measure goal reaching:

$$ r(s,a,s_ g)=\mathbf{1}{|s-s_ g|\le \epsilon}, $$

$$ r(s,a,s_ g)=-d(s,s_ g). $$

Hindsight experience replay (HER) relabels a failed trajectory using a state it actually reached as the goal. If a robot failed to reach the commanded goal but ended at $s_ T$, then replay the trajectory as if the goal had been $s_ T$:

$$ (s_ {1:T},a_ {1:T},s_ g) \rightarrow (s_ {1:T},a_ {1:T},s_ T). $$

Future states from the same trajectory can also be used as goals. This turns failures into useful supervised signal and alleviates sparse-reward exploration.

HER is most useful when rewards are sparse and many attempted trajectories reach some meaningful state, even if not the commanded goal. It converts "failed at goal $g$" into "successful at goal $g'$" for a goal $g'$ that was actually achieved.

Common relabeling choices include the final state, a random future state, or several future states from the same trajectory. Future-state relabeling is usually stronger than arbitrary-state relabeling because the trajectory prefix actually leads to that goal under the recorded actions.

Takeaways

Multi-task learning shares weights; relabeling shares data. Goal-conditioned RL is especially suitable for relabeling because goals and rewards can often be recomputed after data collection. The main limitation is that relabeling assumes compatible dynamics and evaluatable rewards.

`?`	Toggle this help
`/`	Search
`f`	Link hints (vim-like)
`t`	Toggle dark mode
`j` / `k`	Scroll down / up
`g` / `G`	Top / bottom
`o`	Jump back
`l`	Cycle language (en→zh→fr)
`H` / `L`	History back / forward
`r`	Reload
`F`	Fullscreen
`i`	Idle in the Matrix
`a`	ASCII Aquarium
`Esc`	Close / cancel