10. Hierarchical Imitation and Reinforcement Learning
Why Hierarchy
Long-horizon tasks are hard because the agent visits many states, has many chances to make irreversible mistakes, and may get stuck far from useful feedback. Cooking a meal, driving to a distant location, fixing a software bug, or giving feedback on a long report all require many coordinated decisions.
Hierarchy tries to decompose a hard task into easier subtasks. Instead of mapping directly from the current observation to low-level actions for the entire task, a high-level policy proposes an intermediate goal and a low-level policy tries to achieve it.
The usual two-level form is:
$$ g_ t\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ t,c), \qquad a_ t\sim\pi_ {\mathrm{LL}}(\cdot\mid o_ t,g_ t). $$
Here $c$ is the task command or prompt, $g_ t$ is a subgoal, and $a_ t$ is the primitive action. The low-level policy usually runs at a higher frequency than the high-level policy.
Subgoals have many names: skills, options, subtasks, waypoints, high-level actions, or intermediate goals. The vocabulary differs across robotics, RL, and language-model systems, but the idea is the same: choose a simpler target that makes the next part of the problem easier.
Rolling Out a Hierarchical Policy
A simple rollout procedure is:
- Observe $o_ 1$ and receive the task command $c$.
- Query the high-level policy:
$$ g_ t\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ t,c). $$
- Execute the low-level policy for some duration:
$$ a_ t\sim\pi_ {\mathrm{LL}}(\cdot\mid o_ t,g_ t). $$
- Decide whether to keep pursuing the same subgoal or ask for a new one.
This creates two coupled problems. The high-level policy must issue subgoals that are useful for the task and achievable by the low-level policy. The low-level policy must be trained on the kind of subgoals that the high-level policy will actually produce.
Why It Can Help
Hierarchy can help for several reasons.
First, subgoals provide a supervision signal about how the task should be completed. A demonstration labeled with "open the drawer," "pick up the bowl," and "pour into the bowl" contains more structure than a flat stream of actions.
Second, low-level skills can be reused across tasks. The same reaching, navigation, grasping, or tool-use behavior may appear in many long-horizon tasks.
Third, in RL, exploration can happen in a smaller high-level space. Searching over subgoals can be easier than searching over every primitive action sequence.
Fourth, hierarchy can match system constraints. A robot controller may need actions at 50 Hz, while a language or planning model may only need to run every few seconds. Separate policies allow different latency and compute budgets at different levels.
Flat Policies and Chain-of-Thought Policies
Hierarchy is not the only way to expose intermediate structure. A flat policy can map directly from command and observation to action:
$$ a_ t\sim\pi(\cdot\mid o_ t,c). $$
A chain-of-thought-style policy can produce both an intermediate plan and an action:
$$ (g_ t,a_ t)\sim\pi(\cdot\mid o_ t,c). $$
This can benefit from similar supervision, but it may be computationally expensive when low-level actions are needed at high frequency. There is no universal rule that multiple policies are always better. The point is to choose the decomposition that gives useful supervision, reusable skills, or easier exploration without introducing unnecessary coordination failures.
Options Background
Classic hierarchical RL often uses the options framework. An option $\omega$ includes:
- An initiation set $\mathcal{I}_ \omega$ where the option can start.
- An intra-option policy $\pi_ \omega(a\mid s)$.
- A termination rule $\beta_ \omega(s)$.
The high-level policy chooses an option:
$$ \omega\sim\pi_ {\mathrm{HL}}(\cdot\mid s), $$
and the option produces primitive actions until it terminates. Modern hierarchical policies often replace discrete options with language, images, continuous goal states, or learned latent skills, but the same design questions remain: what should the high-level action mean, how is the low-level policy trained, and when does control return to the high level?
Choosing a Goal Representation
Good subgoal representations are domain-specific. A navigation system might use coordinates or waypoints. A robot manipulation system might use object poses, goal images, or language commands. A legal-writing assistant might use section plans or claims to support. A vacation-planning assistant might use constraints, reservations, and itinerary milestones.
The lecture highlights three properties:
- Expressive: $g$ should communicate many different low-level behaviors.
- Structured: similar behaviors should have similar goals.
- Appropriately abstract: $g$ should be neither too hard for the low-level policy nor too low-level for the high-level policy.
If the subgoal is too abstract, the low-level policy cannot reliably execute it. If it is too detailed, the high-level policy is almost solving the original action-selection problem.
Supervising the Levels
Training everything end-to-end with a latent goal can collapse into a flat policy. If the latent variable is not constrained or supervised, the model may ignore it or use it in a way that does not provide reusable structure.
A low-level policy should be trained to accomplish the subgoal, not the original long-horizon command:
$$ \max_ \theta \mathbb{E}_ {(o,a,g)\sim\mathcal{D}_ {\mathrm{LL}}} [\log\pi_ {\mathrm{LL},\theta}(a\mid o,g)]. $$
A high-level policy should be trained to choose subgoals that complete the original task:
$$ \max_ \phi \mathbb{E}_ {(o,c,g)\sim\mathcal{D}_ {\mathrm{HL}}} [\log\pi_ {\mathrm{HL},\phi}(g\mid o,c)]. $$
The circular dependency is important:
- The low-level policy should train on goals the high-level policy will output.
- The high-level policy should train assuming the low-level policy that will actually execute its goals.
A practical recipe is to train the two levels separately at first, then adapt at least one level to the other. If both levels are frozen after independent training, small mismatches can compound over long horizons.
When to Replan
The high-level policy can be queried in two main ways.
One option is to replan when the low-level policy has completed $g_ t$. This is ideal when completion can be estimated reliably, but mistakes are dangerous. If the agent falsely believes a subgoal is incomplete, it may get stuck forever. If it falsely believes a subgoal is complete, the high-level policy may build on a state that was never reached.
Another option is fixed-interval replanning. Every $n$ timesteps, query:
$$ g_ {t+n}\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ {t+n},c). $$
This is simple and robust to completion-detector failures. The tradeoff is choosing $n$. A large $n$ delays recovery from bad subgoals. A small $n$ puts more burden on the high-level policy and may increase compute cost.
Hierarchical Imitation with Language Subgoals
With segmented demonstrations, each segment can have a language label:
$$ (o_ {t:t+k},a_ {t:t+k},g). $$
The low-level language-conditioned behavior cloning policy learns:
$$ \pi_ {\mathrm{LL}}(a\mid o,g). $$
The high-level policy learns to issue the language command $g$ from observations and the task prompt:
$$ \pi_ {\mathrm{HL}}(g\mid o,c). $$
This setup is useful because language corrections can supervise only the high-level policy. In the DAgger-style version from the slides, a human correction overrides a bad high-level language prediction, the low-level policy executes the corrected command, and the high-level policy is updated on the correction. The low-level controller can remain frozen.
This is a clean example of hierarchy reducing supervision cost. Instead of providing dense robot actions, the human can intervene with a language instruction such as "move closer to the cup" or "avoid pouring outside the bag."
Hierarchical Imitation with Image Subgoals
Language labels are not always available. Another approach is to use image subgoals: a high-level policy proposes a goal image, and the low-level policy is conditioned on that image.
Conceptually:
$$ g_ t^{\mathrm{img}}\sim\pi_ {\mathrm{HL}}(\cdot\mid o_ t,c), \qquad a_ t\sim\pi_ {\mathrm{LL}}(\cdot\mid o_ t,g_ t^{\mathrm{img}}). $$
The SuSIE example in the slides uses image editing to synthesize plausible visual subgoals. The benefit is that high-level training can incorporate unlabeled video data and does not require every segment to have a human-written language annotation.
The limitation is that image goals must be physically meaningful and controllable. A generated goal image that violates geometry, contacts, or object affordances can mislead the low-level policy.
Hierarchical RL
In hierarchical RL, the high-level actions are often goals in state space. A low-level goal-conditioned policy receives a goal-reaching reward:
$$ r_ {\mathrm{LL}}(s,a,g)=\mathbf{1}{d(s,g)\le \epsilon}. $$
The high-level policy chooses goal states:
$$ g_ t\sim\pi_ {\mathrm{HL}}(\cdot\mid s_ t,c). $$
Because the high-level action is itself a goal, off-policy relabeling can be useful. If the low-level policy failed to reach the commanded goal but reached another state, the transition can sometimes be relabeled as an attempt to reach the achieved state. This connects hierarchical RL back to HER from goal-conditioned RL.
Language can also be used as the high-level action for semantic tasks. The low-level policy is language-conditioned, and hindsight language relabeling can reinterpret outcomes in terms of commands that were actually achieved.
Skill Discovery
Another active direction is unsupervised skill discovery. Instead of manually specifying subgoals, learn a diverse set of behaviors:
$$ z\sim p(z), \qquad a\sim\pi(a\mid s,z). $$
Methods such as DIAYN reward skills that are distinguishable from their visited states. The goal is to produce a library of reusable behaviors before knowing the downstream task.
Skill discovery is appealing because it can use unsupervised environment interaction. The hard part is ensuring that diversity is useful. A policy can learn many distinguishable but irrelevant behaviors unless the skill objective is aligned with future tasks.
Takeaways
Hierarchy is useful when intermediate structure makes long-horizon behavior easier to learn, supervise, reuse, or execute under latency constraints. The central design choices are the subgoal representation, the supervision for each level, and the replanning rule. The main risk is mismatch: the high-level policy may output goals the low-level policy cannot execute, while the low-level policy may be trained on goals the high-level policy never uses.