Sim-to-real transfer (also written as sim2real) is the process of taking a policy, controller, or perception model trained inside a physics simulator and deploying it on a physical robot in the real world. Because collecting training data on real hardware is slow, expensive, and risky, researchers first train agents in simulation, where millions of trials can run in parallel and failures cost nothing. The core difficulty is the reality gap: differences between the simulated and physical environments that cause a policy that works perfectly in simulation to fail, sometimes catastrophically, on actual hardware.
Sim-to-real transfer sits at the intersection of reinforcement learning, robot learning, computer vision, and control theory. It has become one of the most active research areas in robotics and embodied AI, with applications ranging from quadruped locomotion and dexterous manipulation to autonomous driving and drone navigation.
Training robots directly in the real world presents several practical problems. Physical hardware wears out, collisions can damage expensive equipment, and each trial takes real time. A single reinforcement learning run might require millions of episodes, which could take years of continuous wall-clock time on a real robot. Simulation removes all of these constraints. Modern GPU-accelerated simulators can run thousands of parallel environments at speeds far exceeding real time.
The idea of training in simulation and transferring to reality dates back to early work in neural networks and control systems in the 1990s. However, the field accelerated rapidly after 2015 due to three converging trends: advances in deep reinforcement learning (deep RL), the availability of fast GPU-based physics engines, and the growing interest in deep learning-based perception for robotics.
The reality gap refers to the mismatch between what a simulator models and what actually happens in the physical world. This mismatch arises from several sources:
| Source of gap | Description | Example |
|---|---|---|
| Physics modeling errors | Simulators approximate physical phenomena with simplified equations. Contact dynamics, friction, and deformation are especially hard to model accurately. | A simulated gripper slides smoothly along a surface, but the real gripper sticks due to unmodeled surface roughness. |
| Actuator dynamics | Real motors have delays, backlash, nonlinear torque curves, and temperature-dependent behavior that simulators often ignore or approximate. | A simulated joint responds instantly to a torque command, but the real servo has a 10 ms delay and a deadband region. |
| Sensor noise and bias | Real cameras, IMUs, and force sensors produce noisy, sometimes biased readings. Simulated sensors are often idealized. | A depth camera produces accurate point clouds in simulation but returns noisy data with missing pixels on real hardware. |
| Visual appearance | Rendered images differ from real photographs in lighting, texture, reflections, and color. | A policy trained on procedurally generated textures in simulation fails when it encounters glossy or transparent real-world objects. |
| Unmodeled phenomena | Many real-world effects, such as cable tangling, air currents, or compliant contacts, are absent from simulation entirely. | A drone policy ignores ground effect because the simulator does not model it. |
Studies have documented performance drops of 24 to 30 percent when policies are transferred without any mitigation, with some tasks failing entirely. The research community has developed multiple strategies to close or work around this gap.
Domain randomization is the most widely used technique for sim-to-real transfer. Rather than trying to make the simulator perfectly match reality, domain randomization varies the simulation parameters during training so that the policy learns to handle a wide range of conditions. The hope is that the real world falls within the distribution of randomized environments.
Josh Tobin and colleagues at OpenAI introduced the term in 2017, applying it to object detection for robotic grasping. The original paper randomized visual properties: object positions, textures, lighting, and camera angles. When trained across enough visual variation, a neural network learned features robust enough to work on real camera images without fine-tuning.
Domain randomization divides into two main categories:
Visual randomization varies the appearance of the simulated scene. Parameters include object colors and textures, lighting direction and intensity, camera position and field of view, background imagery, and image noise patterns. This approach is most relevant for vision-based policies that take RGB or depth images as input.
Dynamics randomization varies the physical properties of the simulation. Parameters include masses and inertias of objects and robot links, friction coefficients and restitution (bounciness), actuator gains and delays, joint damping and stiffness, and observation noise levels. Xue Bin Peng and colleagues demonstrated dynamics randomization for locomotion transfer in 2018, showing that a policy trained with randomized physics could transfer zero-shot to a real robot arm for an object-pushing task.
Setting randomization ranges by hand is tedious and error-prone. If ranges are too narrow, the real world falls outside the training distribution. If they are too wide, the learning problem becomes needlessly difficult. Automatic Domain Randomization, developed by OpenAI in 2019, addresses this by progressively expanding randomization ranges during training. When the agent achieves a performance threshold at the current difficulty level, ADR widens the parameter distributions. This creates a curriculum of increasing difficulty and was a central innovation behind OpenAI's Rubik's Cube result.
Bayesian Domain Randomization (BayRn) uses Bayesian optimization to search the space of randomization distribution parameters. Instead of uniform sampling, BayRn adapts the source domain distribution by collecting data from the real target domain and finding parameter settings that maximize real-world performance.
Domain adaptation takes a different approach: instead of randomizing the source domain, it transforms simulated data to look more like real data (or vice versa). The goal is to learn representations that are invariant to the domain shift.
GAN-based adaptation uses generative adversarial networks to translate simulated images into realistic-looking images. CycleGAN, which performs unpaired image-to-image translation, has been widely applied. The cycle-consistency loss ensures that an image translated from simulation to reality and back again matches the original, preserving the underlying structure. More recent variants include RL-CycleGAN, which jointly trains the image translator with a reinforcement learning agent, and RetinaGAN, which adds an object detection consistency loss to preserve semantic content during translation.
Feature-level adaptation learns to map simulated and real observations into a shared latent space where they are indistinguishable. This can be done with adversarial training (a discriminator tries to tell simulated from real features) or with explicit feature matching losses. Language-based pretraining has shown effectiveness here: using natural language to guide image encoders toward learning domain-invariant visual features while ignoring domain-specific details such as texture or lighting. Zero-shot performance improvements of 25 to 40 percent have been reported for object manipulation tasks using this approach.
System identification (SysID) takes the opposite approach from domain randomization. Instead of training the policy to be robust to uncertainty, SysID tries to make the simulator as accurate as possible by measuring the real system's physical parameters.
Traditional SysID involves carefully measuring masses, moments of inertia, friction coefficients, actuator transfer functions, and sensor characteristics, then configuring the simulator to match. When done well, this can produce highly accurate simulations. The downside is that it requires significant manual effort and specialized equipment, and even careful measurements cannot capture all real-world effects.
Modern approaches automate this process. Iterative Residual Tuning (IRT) is a deep learning-based SysID method that adjusts simulator parameters to better match real-world observations using minimal data. A 2024 approach uses in-context learning to dynamically adjust simulation environment parameters online, leveraging past interaction histories as context to adapt simulation dynamics to real-world dynamics without requiring gradient updates.
A common sim-to-real pipeline involves training two policies. The teacher policy trains in simulation with access to privileged information: perfect state knowledge, ground-truth object poses, and exact physical parameters that would be unavailable on a real robot. Because it has access to this privileged information, the teacher can learn a high-quality policy relatively quickly.
The student policy then learns to imitate the teacher's behavior using only observations available from real-world sensors (cameras, joint encoders, IMUs). The distillation process uses behavior cloning, minimizing the difference between teacher and student actions. This approach was used in ETH Zurich's work on ANYmal quadruped locomotion and has become standard practice in legged robotics.
The TWIST framework (Teacher-Student World Model Distillation) extends this to model-based RL by distilling not just the policy but an entire world model from privileged state observations to image observations.
Curriculum learning structures the training process from simple to complex scenarios. For sim-to-real transfer, this means the agent first learns basic skills in easy environments and gradually encounters harder, more realistic conditions. This approach helps avoid the problem of learning degenerate strategies in overly randomized environments.
Reward shaping provides additional reward signals that guide the agent toward behaviors that transfer well. For locomotion, this might include penalties for jerky motions or high joint velocities, which tend to exploit simulation artifacts. Combining domain randomization with curriculum learning and careful reward shaping has produced some of the most reliable sim-to-real results.
A growing trend is the real-to-sim-to-real pipeline, where the simulator is constructed from real-world data rather than hand-authored. This can involve 3D scanning of the environment, calibrating physics parameters from recorded robot trajectories, and continuously updating the simulation to match reality.
MIT's RialTo system creates digital twins on the fly using computer vision, allowing robots to train in environments that closely match their actual deployment setting. The Real-is-Sim framework from 2025 uses an Embodied Gaussian simulator that synchronizes with the real world at 60 Hz, allowing policies to seamlessly switch between running on real hardware and running in simulation. This dynamic digital twin approach has shown promise for evaluation and rapid iteration.
The choice of simulator affects both the quality and efficiency of sim-to-real transfer. The following table compares the major platforms used in the field as of 2025.
| Simulator | Developer | Physics engine | GPU acceleration | Open source | Key strengths |
|---|---|---|---|---|---|
| Isaac Sim / Isaac Lab | NVIDIA | PhysX 5 | Yes | Yes (since 2025) | Photorealistic rendering, thousands of parallel environments, tight integration with NVIDIA hardware |
| MuJoCo | Google DeepMind (originally Todorov et al.) | Custom | Yes (via MJX/JAX) | Yes (Apache 2.0, since 2022) | Fast, accurate contact dynamics, lightweight, widely used in RL research |
| PyBullet | Erwin Coumans | Bullet | No | Yes | Easy to use, large community, good documentation, widely used for benchmarks |
| SAPIEN / ManiSkill | UC San Diego / Hillbot | PhysX 5 + Warp | Yes | Yes | Articulated object manipulation, heterogeneous GPU simulation, tactile sensing |
| Genesis | Genesis-Embodied-AI | Custom (differentiable) | Yes | Yes | Extremely fast (claims 10-80x over Isaac Gym), differentiable physics, generative capabilities |
| Newton | NVIDIA, Google DeepMind, Disney Research | Built on NVIDIA Warp | Yes | Yes (Linux Foundation, 2025) | Contact-rich simulation, open governance, built specifically for robot learning |
MuJoCo (Multi-Joint dynamics with Contact) was originally developed by Emanuel Todorov, Tom Erez, and Yuval Tassa at the University of Washington and described in a 2012 paper. It was commercialized under Roboti LLC in 2015 and became the de facto standard for RL research. Google DeepMind acquired MuJoCo in October 2021 and released it as open source under the Apache 2.0 license in May 2022. MuJoCo's strengths are fast, stable contact simulation and a lightweight C codebase. MJX, a JAX-based reimplementation, enables GPU-accelerated parallel simulation.
NVIDIA's Isaac platform provides a full simulation and training pipeline for robotics. Isaac Sim offers photorealistic rendering through ray tracing and can simulate complex scenes with deformable objects, fluids, and cloth. Isaac Lab is a lightweight, GPU-accelerated application built on Isaac Sim that is optimized for running thousands of parallel robot learning environments. Isaac Sim 5.0 was released as open source in 2025.
Newton is an open-source, GPU-accelerated physics engine co-developed by NVIDIA, Google DeepMind, and Disney Research. Announced at GTC 2025 by Jensen Huang, Newton was contributed to the Linux Foundation in September 2025. Built on NVIDIA Warp and OpenUSD, it is designed for contact-rich robot behaviors such as walking on varied terrain and manipulating delicate objects. Disney uses Newton to power its next-generation entertainment robots, including the Star Wars-inspired BDX droids.
Genesis is an open-source physics simulation platform designed for general-purpose robotics and embodied AI. It integrates multiple physics solvers (rigid body, soft body, cloth, fluid) into a unified framework. Genesis claims extremely high simulation speeds, citing 43 million FPS for a manipulation scene with a Franka arm, which would be 430,000 times faster than real time. Its differentiable physics engine supports gradient-based optimization. Genesis also includes a generative data engine that can produce training data from natural language descriptions.
SAPIEN focuses on articulated object manipulation, providing GPU-parallelized simulation of robots interacting with drawers, faucets, and other jointed objects. The ManiSkill framework, built on SAPIEN, is one of the fastest GPU-parallelized robotics simulators for contact-rich manipulation tasks, supporting RGBD data collection at 30,000+ FPS on a single RTX 4090. ManiSkill3 is unique in supporting heterogeneous GPU simulation, meaning different parallel environments can contain different object geometries and articulation structures.
One of the most widely publicized sim-to-real demonstrations was OpenAI's use of a Shadow Dexterous Hand to solve a Rubik's Cube using a single robotic hand. The entire control policy was trained in simulation using approximately 13,000 simulated years of experience. The breakthrough relied on Automatic Domain Randomization, which progressively expanded the range of physical parameters during training. The system used a PhaseSpace motion capture setup and RGB cameras for state estimation.
The robot solved the Rubik's Cube about 60 percent of the time overall and 20 percent of the time for maximally difficult scrambles. The trained policy was robust to significant perturbations: the robot could still solve the cube while wearing a rubber glove, with several fingers taped together, or while being poked with a stuffed toy. This demonstrated that aggressive domain randomization could produce policies with genuine robustness rather than narrow simulation-specific skills.
Joonho Lee and colleagues at ETH Zurich's Robotic Systems Lab demonstrated sim-to-real transfer for the ANYmal quadruped robot over challenging natural terrain. Their approach used a two-stage teacher-student framework: a teacher policy trained in simulation with privileged terrain information, and a student policy that used only proprioceptive feedback (joint positions, velocities, and IMU data). The student policy was deployed zero-shot on the real robot.
The ANYmal robot demonstrated locomotion over terrain never encountered during training, including mud, snow, rubble, thick vegetation, and flowing water. The work was published in Science Robotics and has become a benchmark for sim-to-real locomotion research. Subsequent work has extended this approach to parkour-style agile locomotion for quadrupeds.
Google Brain's QT-Opt demonstrated a different philosophy: using both simulated and real data for large-scale robotic grasping. The system trained a deep neural network Q-function on over 580,000 real-world grasp attempts collected from seven robots, supplemented with simulated data. The resulting policy achieved a 96 percent grasp success rate on previously unseen objects.
While QT-Opt was not purely sim-to-real (it used substantial real data), it demonstrated how simulation could augment real-world data collection and showed the potential of scalable robot learning pipelines.
The Unitree A1, Go2, and other quadrupeds have become popular platforms for sim-to-real locomotion research. The Unified Locomotion Transformer (ULT), published in 2025, uses a transformer architecture for simultaneous optimization of teacher and student policies, significantly reducing the data needed for sim-to-real transfer. The policy was validated on a Unitree A1 with a Jetson AGX Orin. Other work has demonstrated loco-manipulation (locomotion combined with manipulation) on the Unitree B1 quadruped with a Z1 arm, using Isaac Gym for training and deploying through a hardware abstraction layer.
The emergence of humanoid robots has created strong demand for sim-to-real methods. NVIDIA announced Isaac GR00T N1 in March 2025 as the first open, fully customizable foundation model for humanoid robot reasoning and skills. The GR00T N1.6 update integrates multimodal vision-language-action policies with world models such as NVIDIA Cosmos Reason, enabling end-to-end loco-manipulation and reasoning tasks.
The sim-to-real workflow for GR00T leverages whole-body reinforcement learning in Isaac Lab and synthetic data-driven navigation. NVIDIA reported generating 780,000 synthetic trajectories (equivalent to 6,500 hours of human demonstration data) in just 11 hours. Combining this synthetic data with real data improved the GR00T N1 performance by 40 percent compared to using only real data, demonstrating the value of simulation-generated training data even when real demonstrations are available.
The convergence of large language models, vision-language models, and robotics has introduced new approaches to sim-to-real transfer. Vision-Language-Action (VLA) models like Google's RT-2 and Gemini Robotics (2025) process multimodal data (text, images, video, and audio) and output robot actions directly. These models can leverage foundation model representations that provide consistent semantic features across simulation and reality, potentially reducing the visual domain gap.
Simulation plays a growing role in training these models. Skild AI, for example, reported training on 100,000 different robot embodiments generated in simulation, aiming to build policies that generalize across robot body types. The use of large-scale simulation data to pretrain or augment VLA models represents a new frontier in sim-to-real research.
The autonomous vehicle industry relies heavily on simulation. Companies such as Waymo conduct billions of virtual kilometers of driving before deploying on real roads. The simulation provides a way to encounter rare but dangerous scenarios (pedestrian jaywalking, sensor failures, unusual weather) that would be impractical to collect in real driving data. Domain randomization is used to vary weather conditions, traffic patterns, and sensor characteristics.
Sim-to-real transfer for unmanned aerial vehicles uses platforms such as AirSim (built on Unreal Engine by Microsoft) and custom simulators. Challenges specific to aerial robots include aerodynamic effects (ground effect, turbulence), wind disturbance, and the need for very low-latency control. Domain randomization over wind direction, magnitude, and flight conditions has been shown to help agents learn general policies that transfer to physical drones. RL-based flight controllers trained in simulation have been successfully deployed on fixed-wing aircraft and demonstrated superior performance compared to commercial flight controllers in some tests.
Sim-to-real transfer has been applied to robot-assisted surgery, where physical experiments on real tissue are limited by ethical and practical constraints. Researchers have trained visual reinforcement learning policies for tasks such as suture knot-tying in simulation and transferred them to real surgical robots. The challenge of simulating deformable tissue contact dynamics makes this domain particularly difficult.
Factory environments benefit from sim-to-real because they involve repetitive tasks where a small improvement in automation yields large economic gains. NVIDIA's AutoMate system demonstrated a mean success rate of 84.5 percent for real-world assembly tasks, with policies trained primarily in simulation. The R2D2 project combines simulation with language models to improve robotic manipulation capabilities.
Despite significant progress, several fundamental challenges remain.
Contact dynamics fidelity. Contact simulation remains one of the weakest points of current physics engines. Real-world contact involves complex phenomena (microslip, elastic deformation, surface roughness) that are computationally expensive to simulate accurately. This is especially problematic for contact-rich tasks like assembly, tool use, and manipulation of deformable or fragile objects.
Deformable and soft objects. Manipulating cloth, rope, food, and other deformable materials remains extremely difficult to transfer from simulation to reality. The computational demands of soft-body simulation limit the scale of parallel training, and the parameter space of deformable object properties is vast.
Visual fidelity. While rendering quality has improved substantially, traditional simulators still struggle to reproduce real-world lighting, reflections on glossy surfaces, transparency, and fine textures. Photorealistic rendering (ray tracing) helps but is computationally expensive and slows training.
Physics exploitation. Agents sometimes learn to exploit artifacts of the physics engine, discovering "cheats" that work in simulation but have no real-world equivalent. For example, a locomotion policy might learn to gain energy from numerical integration errors, or a manipulation policy might exploit unrealistic friction models. These failure modes are difficult to detect until deployment.
Scalability of system identification. While automated SysID methods are improving, they still require real-world data collection, which limits scalability. The trade-off between making the simulator more accurate (SysID) and making the policy more robust (domain randomization) remains an active area of research.
Long-horizon tasks. Most successful sim-to-real demonstrations involve relatively short tasks (grasping, stepping, pushing). Longer task horizons compound small errors at each step, making transfer progressively harder. Hierarchical approaches and task decomposition show promise but introduce their own transfer challenges.
Benchmark standardization. The field lacks standardized benchmarks for measuring sim-to-real transfer quality. Performance is typically reported on specific robot-task combinations, making it difficult to compare methods across labs. Recent efforts such as the RoboVerse platform, which provides a unified API across multiple simulators, aim to address this.
Researchers assess sim-to-real transfer along several dimensions:
| Metric | What it measures |
|---|---|
| Zero-shot success rate | Task completion on real hardware without any real-world fine-tuning |
| Generalization error | Performance difference between simulation and reality (often measured by MSE or success rate drop) |
| Robustness | Ability to maintain performance under varied real-world conditions (different objects, lighting, disturbances) |
| Sample efficiency | Amount of simulation data and real data needed to achieve a given performance level |
| Computational cost | Training time, GPU hours, and simulation throughput required |
| Training stability | Consistency of results across random seeds and randomization distributions |
| Year | Development |
|---|---|
| 2012 | Todorov, Erez, and Tassa publish the MuJoCo physics engine paper |
| 2016 | Sadeghi and Levine demonstrate collision-free flight by training only in simulation (CAD2RL) |
| 2017 | Tobin et al. introduce domain randomization for object detection transfer (IROS) |
| 2018 | Peng et al. demonstrate dynamics randomization for robotic control transfer (ICRA); Google's QT-Opt scales RL grasping to 580K real attempts |
| 2018 | OpenAI trains a Shadow Hand to rotate a block using sim-to-real transfer |
| 2019 | OpenAI solves the Rubik's Cube with a robot hand using Automatic Domain Randomization |
| 2020 | Lee et al. (ETH Zurich) demonstrate ANYmal locomotion over challenging natural terrain via teacher-student distillation |
| 2021 | Google DeepMind acquires MuJoCo; Hofer et al. publish "Sim2Real in Robotics and Automation" survey |
| 2022 | MuJoCo released as open source; parkour-style locomotion demonstrated on quadrupeds |
| 2024 | TRANSIC introduces human-in-the-loop corrections for sim-to-real furniture assembly |
| 2025 | NVIDIA releases Isaac GR00T N1 and Isaac Sim 5.0 as open source; Newton physics engine announced at GTC; Genesis claims 10-80x speedup over Isaac Gym |
| 2025 | Real-is-Sim introduces dynamic digital twins with 60 Hz real-world synchronization; Unified Locomotion Transformer reduces sim-to-real data requirements for quadrupeds |