11. Sim2Real Robot Learning

Shurui Liu

Why Use Simulation

Sim2Real robot learning trains or improves a policy in simulation and then deploys it on a physical robot. The motivation is practical: real-world robot data is slow, expensive, potentially unsafe, and can damage hardware. Simulation can generate large amounts of experience quickly and expose ground-truth labels that would be hard to measure in the real world.

The basic recipe is:

  1. Build or choose a simulator.
  2. Create many parallel training environments.
  3. Train a policy with RL, often PPO.
  4. Use randomization, adaptation, or real-data calibration to reduce the sim2real gap.
  5. Deploy the policy on the real robot.

The main promise is scale. A simulator can run faster than real time, reset automatically, randomize conditions, and provide privileged quantities such as contact forces, terrain shape, object pose, or exact robot state.

Simulators and Frameworks

The lecture distinguishes simulators from robot-learning frameworks. A simulator computes the physical evolution of a scene. A framework wraps simulation with task definitions, environment managers, parallelization, reward code, logging, and training infrastructure.

Examples:

  • MuJoCo is a physics simulator.
  • IsaacSim is a simulator.
  • IsaacLab is a robot-learning framework on top of IsaacSim.
  • MJX and MuJoCo Warp are accelerator-oriented members of the MuJoCo family.

For RL, the framework matters because training performance is often limited by environment throughput, reset logic, observation construction, reward computation, and batching. A good simulator alone is not enough.

The Sim2Real Gap

A simulator is only a model of the real world. The sim2real gap is the mismatch between simulated transitions and real transitions:

$$ x_ {t+1}^{\mathrm{sim}} =f_ {\mathrm{sim}}(x_ t,u_ t,e), \qquad x_ {t+1}^{\mathrm{real}} =f_ {\mathrm{real}}(x_ t,u_ t). $$

Here $e$ represents environment and simulator parameters such as mass, inertia, friction, delay, terrain, sensor noise, actuator strength, or lighting.

The slides separate two kinds of mismatch:

  • Parametric mismatch: the simulator has the right form, but parameters are wrong.
  • Non-parametric mismatch: the simulator omits effects entirely.

Parametric mismatch includes incorrect mass, inertia, friction, damping, delay, or motor constants. Non-parametric mismatch includes effects such as complex aerodynamics, fluid dynamics, deformable contacts, tire-soil interactions, cable dynamics, or unmodeled link flex.

Reward design is another gap. Simulation may make it cheap to evaluate a reward, but specifying the right reward for dexterous manipulation, whole-body interaction, or long-horizon tasks can still be hard.

Domain Randomization

Domain randomization trains a policy over a distribution of simulator parameters:

$$ e\sim p(e), \qquad x_ {t+1}=f_ {\mathrm{sim}}(x_ t,u_ t,e). $$

The objective is:

$$ \max_ \theta \mathbb{E}_ {e\sim p(e)} [J_ {\mathrm{sim}}(\pi_ \theta;e)]. $$

The policy is not told the exact real parameters. It must work robustly across the randomized family:

$$ u_ t\sim\pi_ \theta(\cdot\mid x_ t). $$

This is essentially robust control with a learned policy. Randomization can be applied to dynamics, observations, sensors, delays, terrain, textures, lighting, camera poses, action latency, contact properties, external pushes, and initial conditions.

The distribution $p(e)$ is an engineering choice. If it is too narrow, the real world may fall outside the training envelope. If it is too broad, training may become overly conservative or fail because the policy must solve incompatible worlds.

Learning to Adapt

Domain randomization asks for one robust policy. Adaptation asks the policy to infer hidden environment parameters online.

If the true parameters were available, one could train:

$$ u_ t\sim\pi_ \theta(\cdot\mid x_ t,e). $$

But $e$ is usually not directly observable on the real robot. A common solution is privileged teacher-student learning.

First, train a teacher in simulation with privileged information:

$$ u_ t\sim\pi_ {\mathrm{teach}}(\cdot\mid x_ t,e). $$

Then train a student that only sees deployable observations, such as proprioceptive history:

$$ u_ t\sim\pi_ {\mathrm{stud}}(\cdot\mid o_ {t-k:t}). $$

The second stage is imitation learning. The student learns to infer what matters from history: slipping, payload changes, terrain effects, actuator weakness, or external disturbance.

RMA is an example where student learning happens in a latent space rather than directly imitating actions. The teacher or adaptation module produces a latent context, and the student learns to infer that context from deployable observations.

Asymmetric Actor-Critic

Another common pattern is asymmetric actor-critic. The actor is constrained to use deployable observations:

$$ u_ t\sim\pi_ \theta(\cdot\mid o_ t), $$

while the critic can use privileged simulation state:

$$ V_ \phi(x_ t,e). $$

This is useful because the critic is only needed during training. Privileged information can reduce value-estimation noise and improve policy gradients without making the deployed actor depend on unavailable sensors.

The FALCON example from the slides follows this spirit: the actor uses proprioception and recent history, while the critic can access extra quantities such as root velocity or end-effector force.

Real2Sim2Real

Real2Sim2Real uses real data to improve the simulator before training or redeploying the policy. The simplest version is system identification:

$$ e^*= \arg\min_ e \sum_ t \left| x_ {t+1}^{\mathrm{real}} -f_ {\mathrm{sim}}(x_ t^{\mathrm{real}},u_ t,e) \right|^2. $$

The fitted parameters $e^*$ are then used to train or fine-tune the policy in simulation. Active system identification chooses real-world actions that make the unknown parameters easier to estimate, for example by maximizing information about $e$.

A more flexible version learns residual models. Instead of assuming only a few parameters are wrong, learn a correction:

$$ x_ {t+1}^{\mathrm{real}} \approx f_ {\mathrm{sim}}(x_ t,u_ t,e) +\Delta_ \psi(x_ t,u_ t). $$

Residuals can model actuator errors, perception errors, contact effects, or other missing dynamics. The risk is overfitting limited real data or learning corrections that behave poorly outside the calibration distribution.

Actuator Models

Actuators are often a major source of mismatch. The commanded action may not equal the realized torque or motion because of motor dynamics, delays, saturation, temperature, backlash, control firmware, or low-level PD loops.

An actuator model predicts realized control effects from commands and history:

$$ \hat{u}_ t=f_ \psi(u_ {t-k:t},x_ {t-k:t}). $$

If torque labels are available, this can be supervised. If not, the lecture notes an "unsupervised" approach: use RL or trajectory matching to learn a residual torque model that makes simulated trajectories match real trajectories.

Human Data and Physics Grounding

Simulation also helps use human data. Human videos or motions provide intent and task structure, but there is a physics gap between human motion and robot action. Robots have different bodies, joints, contacts, strength limits, and dynamics.

A common pipeline is:

  1. Retarget human motion to the robot kinematics.
  2. Train a policy in simulation to track the retargeted behavior.
  3. Use RL to make the behavior dynamically feasible and robust.
  4. Deploy the learned policy on the robot.

Retargeting is easiest when only the robot body matters. It is harder for loco-manipulation and scene interaction because the object and environment must be retargeted consistently with the robot. OmniRetarget, PHP, and SPIDER in the slides are examples of methods in this broader family.

The key idea is physics grounding. Human data supplies high-level behavior; simulation tests whether the robot can actually execute it under dynamics and contact.

RL Algorithms for Sim2Real

The lecture emphasizes that on-policy policy-gradient methods, especially PPO, remain highly effective for many sim2real robot policies. PPO works well with massively parallel simulation because it can collect large fresh batches and perform stable policy updates.

Other directions try to improve sample efficiency or better exploit parallelism:

  • SAPG: split-and-aggregate policy gradient for massive parallel environments.
  • FastTD3 and FastSAC: off-policy methods adapted for fast simulation loops.
  • FPO and FPO++: policy-gradient methods for flow-matching policies.
  • BFM-Zero: unsupervised RL with forward-backward representations for promptable humanoid control.

The algorithm choice is tied to the simulator. If simulation is cheap and parallel, on-policy methods can be competitive despite low sample reuse. If simulation is slower or real data is involved, off-policy reuse becomes more valuable.

Sim2Real 1.0 to 4.0

The lecture's zoomed-out view places modern sim2real in a longer control history.

Sim2Real 1.0 used reduced-order models and online reasoning, such as model predictive control. These systems relied less on offline pretraining and more on fast control-time optimization.

Sim2Real 2.0 uses full simulators and offline RL training. The policy is trained before deployment, usually with domain randomization and massive parallel simulation.

Sim2Real 3.0 adds real2sim calibration, better policy learning, and stronger use of real data.

Sim2Real 4.0 points toward better models, generative simulation, world models, better RL algorithms, and more online reasoning. The future direction is not just "more simulation"; it is more faithful, diverse, calibrated, and compute-aware models of the robot's world.

Takeaways

Sim2Real works when the training distribution covers the real deployment conditions or when the policy can adapt to the differences. Domain randomization buys robustness, teacher-student learning buys online adaptation, Real2Sim buys calibration, and human-data retargeting buys task structure. The central engineering problem is deciding which mismatch matters most for the robot and task at hand.