1. Imitation Learning

Shurui Liu

Behavior Cloning

Imitation learning starts with demonstrations:

$$ \mathcal{D}={\tau_ i},\qquad \tau_ i=(s_ 1,a_ 1,\ldots,s_ T,a_ T), $$

sampled from an unknown expert policy $\pi_ {\text{expert}}$.

For a deterministic policy, behavior cloning is supervised regression:

$$ \min_ \theta \frac{1}{|\mathcal{D}|} \sum_ {(s,a)\in\mathcal{D}}|a-\pi_ \theta(s)|^2. $$

For a stochastic policy, the usual objective is maximum likelihood:

$$ \max_ \theta \mathbb{E}_ {(s,a)\sim\mathcal{D}}\left[\log \pi_ \theta(a\mid s)\right]. $$

Behavior cloning is simple, scalable, and avoids reward design. Its core limitation is that it learns to match the behavior distribution in the dataset, so it cannot reliably improve beyond the demonstrator and can fail under distribution shift.

For continuous actions, squared error corresponds to maximum likelihood under a fixed-variance Gaussian policy:

$$ \pi_ \theta(a\mid s)=\mathcal{N}(a;\mu_ \theta(s),\sigma^2 I). $$

Then maximizing $\log \pi_ \theta(a\mid s)$ is equivalent to minimizing $|a-\mu_ \theta(s)|^2$ up to constants. This is useful for remembering why $\ell_ 2$ regression produces averages: it assumes one Gaussian mode around the mean.

Why Expressive Policy Distributions Matter

The slides highlight a common failure mode: if multiple demonstrators or multiple valid strategies appear in the dataset, an $\ell_ 2$ regression policy predicts the mean action. In driving, if both turning left and turning right are valid in similar observations, averaging steering commands can produce an invalid "drive straight" action.

The issue is not only neural network expressivity. A large network with a unimodal Gaussian output can still represent only one main mode per state. The policy distribution itself must be expressive.

Common policy-output choices:

  • Discrete actions: neural network outputs categorical probabilities.
  • Continuous unimodal actions: network outputs Gaussian mean and variance.
  • Mixture density policies: network outputs mixture weights, means, and variances.
  • Autoregressive policies: discretize action dimensions and predict them sequentially.
  • Diffusion or flow policies: sample actions by iterative denoising or vector-field integration.

The supervised objective for expressive imitation is still likelihood:

$$ \min_ \theta -\mathbb{E}_ {(s,a)\sim \mathcal{D}}\log\pi_ \theta(a\mid s), $$

but $\pi_ \theta(\cdot\mid s)$ is now a richer conditional generative model.

Remark: two kinds of expressivity

  • Neural-network expressivity: can the model compute complicated features from $s$?
  • Distributional expressivity: can the output distribution represent multiple valid actions for the same $s$?

A large network with a unimodal output can still fail on multimodal demonstrations. The steering example from the slides is the canonical intuition: left-turn and right-turn demonstrations can average into an unsafe straight-ahead action.

There is a useful likelihood interpretation behind the common losses:

  • Squared error is negative log likelihood for a fixed-variance Gaussian.
  • Cross entropy is negative log likelihood for a categorical distribution.
  • Mixture-density, autoregressive, and diffusion policies are still trained by likelihood-like supervised objectives, but they change the family of conditional distributions.

The algorithmic lesson is that "just make the network bigger" does not fix the wrong output distribution. If the policy head collapses every state to one mean action, more hidden layers can make a more complicated mean, but they still cannot represent multiple valid action modes.

Diffusion and Flow Matching for Actions

The slides introduce action diffusion through flow matching. The goal is to transform simple noise into samples from the action distribution conditioned on state.

At training time:

  1. Sample a demonstration action $x_ 1$ and Gaussian noise $x_ 0\sim\mathcal{N}(0,I)$.
  2. Sample interpolation time $t\in[0,1]$.
  3. Construct $x_ t=t x_ 1+(1-t)x_ 0$.
  4. Train a vector field $v_ \theta(x_ t,t,s)$ to predict $x_ 1-x_ 0$.

At test time:

  1. Sample noise $x_ 0$.
  2. Integrate the learned vector field from $t=0$ to $t=1$.
  3. Return the resulting action.

For imitation learning, this is a conditional generative model:

$$ a_ t \sim \pi_ \theta(\cdot\mid s_ t), $$

where sampling happens through iterative denoising. Diffusion policies have been especially effective for visuomotor robot learning because they can represent multimodal continuous controls.

Two implementation details matter:

  • The denoising or vector-field model must be conditioned strongly on the current observation. Otherwise it learns a broad action prior rather than the right action distribution for this state.
  • Sampling time is part of the deployment cost. A diffusion action policy may need several denoising steps, which motivates action chunking or faster samplers.

Action Chunking

Instead of predicting one action every control step, a policy can predict a chunk:

$$ \pi_ \theta(a_ {t:t+k}\mid s_ t). $$

The system may execute the chunk open loop for several steps, or repeatedly replan and blend overlapping chunks. Action chunking often improves robotics policies because it:

  • Reduces the burden of making a new decision at very high frequency.
  • Gives the policy more computation time per decision.
  • Encodes short-horizon temporal consistency in the action sequence.

The tradeoff is less feedback within the chunk. If the environment changes unexpectedly, fully open-loop chunks can be brittle.

In practice, chunking is often combined with receding-horizon execution: predict $a_ {t:t+k}$, execute only the first few actions, observe again, and predict a new chunk. This keeps the temporal smoothness benefit while reducing open-loop brittleness.

Compounding Errors and DAgger

Behavior cloning trains on states from the expert distribution $p_ {\text{expert}}(s)$ but deploys under the learned-policy distribution $p_ {\pi_ \theta}(s)$. Small action errors can move the agent into states that are rare or absent in the demonstrations. Errors then compound because the policy is forced to act in unfamiliar states.

This is covariate shift:

$$ p_ {\text{expert}}(s)\neq p_ {\pi_ \theta}(s). $$

A useful rule of thumb is that small supervised errors can become large sequential errors. In the standard analysis, behavior cloning with per-step error $\epsilon$ on expert states can suffer $O(T^2\epsilon)$ total cost because early mistakes push the agent into unfamiliar states where more mistakes become likely. DAgger reduces the dependence toward $O(T\epsilon)$ by training on states induced by the learner, not only by the expert.

Dataset Aggregation (DAgger) addresses this by collecting corrective labels on states visited by the learned policy:

  1. Train an initial policy on demonstrations.
  2. Roll out the learned policy.
  3. Ask the expert for the correct action at visited states.
  4. Aggregate these corrections into the dataset.
  5. Retrain or finetune the policy.

Human-gated DAgger is a practical variant: the expert watches the policy, intervenes when needed, and the intervention trajectory is added to the dataset. This interface is often easier than asking for labels at every visited state.

The key assumption behind DAgger is access to an expert or supervisor during online rollouts. It is more data efficient than collecting many full demonstrations, but it may be unsafe or expensive in domains where letting the current policy act can cause damage.

A practical subtlety is mixing old demonstrations and new corrections. If the aggregated dataset is dominated by early demonstrations, rare corrective states may be underweighted. If corrections dominate too much, the policy may overfit to recovery behavior and lose nominal-task performance. Many systems use balanced sampling, intervention weighting, or separate buffers for initial demonstrations and corrections.

Demonstration Collection

Demonstrations may come from naturally logged behavior, such as driving or text writing, or from purpose-built collection. In robotics, common interfaces include:

  • Kinesthetic teaching: physically guide the robot.
  • Remote control: joystick, keyboard, VR, or teleoperation.
  • Puppeteering: use paired hardware or a master device to control the learner.

Human videos can help guide exploration, but direct imitation is hard because of the embodiment gap: humans and robots differ in appearance, degrees of freedom, dynamics, and action spaces.

The demonstration interface changes the data distribution. Kinesthetic teaching can be easy for precise robot motion but may put the human body or hands in camera view. Teleoperation is scalable but can have latency and control mismatch. Puppeteering can produce high-quality actions but requires paired hardware. These are not just data-collection inconveniences; they affect what the policy sees and therefore what it can safely imitate.

Takeaways

Behavior cloning is often the first practical baseline. It works best when demonstrations cover the states the learned policy will visit and when the policy distribution can represent the demonstrated behavior. Expressive generative policies and action chunking are major practical tools. DAgger-style intervention reduces compounding errors but requires online expert interaction.