MimicGen is a data generation system developed by researchers at NVIDIA's Seattle Robotics Lab and Learning and Perception Research group that automatically produces large-scale robot manipulation datasets from a small number of human demonstrations. The paper was presented at the Conference on Robot Learning (CoRL) 2023 (Proceedings of Machine Learning Research, volume 229, pages 1820-1864), and the codebase was publicly released in July 2024. The system addresses one of the persistent bottlenecks in robot learning: getting enough diverse training data without requiring a human to demonstrate the same task hundreds of times across hundreds of different scene configurations.
The core idea is straightforward. Many manipulation tasks decompose into a sequence of motions that are each defined relative to a single object's coordinate frame. Picking up a mug is picking up a mug regardless of where the mug sits on the table; the relative motion between hand and mug is approximately the same. MimicGen exploits this structure to transform recorded human trajectories into new trajectories that work for different object poses. From roughly 10 to 200 source demonstrations, the system generated over 50,000 demonstrations across 18 tasks, spanning diverse object poses, different object instances, and multiple robot arms. Robots trained on this synthetic data via imitation learning achieved strong performance on tasks including multi-part assembly and coffee preparation across broad initial state distributions.
The paper was authored by Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox, all affiliated with NVIDIA Research at the time of publication. Mandlekar is a Staff Research Scientist at NVIDIA's GEAR Lab. The data generation code is released under the NVIDIA Source Code License; the datasets are released under CC-BY 4.0 and are hosted on Hugging Face. The system has since been integrated into NVIDIA's Isaac GR00T training pipeline, where it helped generate 780,000 synthetic trajectories in 11 hours to improve GR00T N1 policy performance by 40%.
Imitation learning trains robot policies by showing them examples of correct behavior. Behavioral cloning, the simplest form, treats the problem as supervised learning: the policy learns to predict the correct action given the current observation, using demonstration data as the training set. The approach works reasonably well when training data covers the conditions the robot will encounter during deployment, but it fails when deployed scenes differ substantially from training scenes.
The problem is that robot manipulation scenes vary along many dimensions at once. Object positions change. Objects of the same category come in different shapes, colors, and sizes. Different robot arms have different kinematics and workspace configurations. Collecting demonstrations that cover all this variation by hand is expensive. A human demonstrating a task through a teleoperation interface might produce one demonstration every two to five minutes for simple tasks, longer for contact-rich sequences where failed attempts have to be discarded. Scaling to 1,000 demonstrations across 10 object variants and 3 robot arms would require thousands of person-hours per task.
This has led to a range of approaches to scaling robot data. Domain randomization in simulation generates visual diversity by perturbing simulator parameters like textures and lighting, but it does not address kinematic variation in demonstrations. Model-based planning can script trajectories for some tasks, but it requires precise scene models and does not generalize well to contact-rich manipulation where contacts are hard to model analytically. Reinforcement learning sidesteps demonstrations but needs dense reward functions that are difficult to design for multi-step manipulation, and it requires many millions of simulation steps to converge on complex tasks.
MimicGen sits in a different position from all of these. It starts from human demonstrations that already encode task structure, object-relative motion, and contact strategies. The human shows the system what to do once; MimicGen figures out how to replicate that behavior across a large distribution of new situations. Unlike domain randomization, it produces varied robot trajectories rather than just varied scene appearances. Unlike planning, it requires no task model. Unlike reinforcement learning, it converges on useful behavior quickly because it builds on the structure already present in human data.
The algorithm has three main components: parsing source demonstrations into subtask segments, transforming each segment to fit a new scene, and stitching segments together into a complete trajectory.
MimicGen assumes that a task can be described as an ordered sequence of object-centric subtasks:
(S1(o1), S2(o2), ..., Sm(om))
where each subtask Si involves manipulating the robot relative to a single reference object oi. A user specifies this sequence once per task type. For a three-piece assembly task, the sequence might be: grasp peg (reference: peg), insert peg into bracket (reference: bracket), place assembly into frame (reference: frame). The specification is intentionally coarse; it captures which object matters for each stage of the task without encoding anything about object positions.
Given this specification and a source human demonstration, MimicGen automatically detects where each subtask ends and the next begins. It uses task-specific completion metrics: binary signals that go from 0 to 1 when a subtask succeeds. The first 0-to-1 transition during execution marks the subtask boundary. Each demonstration is thereby split into a sequence of trajectory segments, one per subtask, stored in what the codebase calls a DataGenInfoPool.
The core transformation takes a recorded trajectory segment for subtask Si and produces a new trajectory valid for a different pose of the reference object oi. Let T_WO be the object's pose in world frame during the source demonstration and T_WO' be its pose in the new target scene. The transformation re-expresses each end-effector pose in the reference frame of the new object:
T_WC't = T_WO * (T_WO')^-1 * T_WCt
This SE(3) frame change says: take the end-effector pose relative to the old object position, and place it relative to the new object position. If the human reached down and to the left to pick up the mug, and the mug in the new scene is to the right, the transformed trajectory reaches down and to the right by the appropriate amount. The relative geometry between hand and mug is preserved.
The transformation requires knowing the object's pose at the start of each subtask in both the source demonstration and the new target scene. During data generation in simulation, the physics engine provides this directly from the simulator state. No external pose estimation hardware is needed during the generation process itself.
After transforming a segment, the robot needs to move from wherever it ended up after the previous subtask to the starting configuration of the transformed new segment. MimicGen prepends a bridge: a linearly interpolated trajectory from the current end-effector pose to the start of the transformed segment. This bridge handles the kinematic discontinuity between segments that were originally recorded in different scenes.
Once the bridge and transformed segment are assembled, an end-effector controller executes the full trajectory. If the subtask succeeds (the completion metric fires), the system moves to the next subtask and repeats. If a subtask fails, the attempt is discarded. Only demonstrations where all subtasks succeed are retained.
This rejection sampling is central to the approach. Not every transformed trajectory works: sometimes the transformed approach angle causes a collision, sometimes the bridge trajectory puts the robot in a configuration where the transformed segment cannot be executed cleanly. The system treats failed attempts as noise and keeps generating until the target dataset size is reached. Because simulation is fast, even a low per-attempt success rate is acceptable.
When multiple source demonstrations are available, MimicGen can select which one to adapt for each subtask of a given target scene. The default is random selection. An alternative is selecting the source demonstration whose object pose is most similar to the target scene's object pose, which tends to produce smoother transformations when source demonstrations span a narrow range of configurations. The paper found that random selection works well in practice, especially with 10 or more source demonstrations per task.
MimicGen's transformation requires object poses at subtask boundaries. During simulation-based data generation, this is trivially satisfied by querying the physics engine. During policy deployment on a real robot, no object poses are required: the trained policy operates from image observations and proprioceptive state, with no privileged information about object locations. This separation is a key design feature. The system imposes a strong requirement (known object poses) only during an automated offline process; the deployed policy needs none of it.
For real-world settings where someone wants to run MimicGen directly against a physical robot rather than a simulation, object pose estimation from cameras would be needed. The NVIDIA GR00T-Mimic deployment keeps everything inside Isaac Sim, so simulator state is always available and this requirement never becomes a practical constraint.
The paper evaluated MimicGen across 18 tasks in robosuite (MuJoCo) and Isaac Gym Factory. Source datasets used 10 to 100 human demonstrations per task. Target datasets were generated to 1,000 demonstrations per task variant. Policies were trained using behavioral cloning with RNN architectures (BC-RNN with LSTM encoders from the robomimic framework) on image observations, with three random seeds and 50 evaluation rollouts each.
Generation success rates varied substantially across tasks, reflecting the varying precision requirements:
| Task | Generation success rate |
|---|---|
| Nut Assembly | ~82% |
| Kitchen | ~75% |
| Coffee | ~52% |
| Square | ~45% |
| Threading | ~60% |
| Three Piece Assembly | ~8% |
Three Piece Assembly's low rate reflects its extreme contact requirements: millimeter-level insertion tolerances mean that most transformed trajectories miss the targets and the attempt fails. The pipeline still generates 1,000 successful demonstrations; it just requires roughly 12,000 attempts.
Policy performance on the default distribution (D0, same object configuration range as source demos):
| Task | Source-only (10 demos) | MimicGen 1000 demos | Human 200 demos |
|---|---|---|---|
| Square | 11.3% | 90.7% | ~84% |
| Threading | 19.3% | 98.0% | ~90% |
| Three Piece Assembly | 1.3% | 82.0% | ~78% |
| Kitchen | 54.7% | 100.0% | ~97% |
| Coffee | ~30% | 96.0% | ~94% |
The comparison to human data is the most informative result. Two hundred MimicGen-generated demonstrations from 10 source demos achieved policy performance comparable to 200 human-collected demonstrations (79% vs. 84% average success across tasks). This means that a researcher who spends time collecting 10 careful human demonstrations can, through MimicGen, approximate the data value of collecting 200 demonstrations by hand. The practical implication: data collection effort scales much more slowly than dataset size.
On broader initial state distributions (D1, larger object pose ranges not represented in source demos), policies maintained 42% to 99% success rates across tasks, indicating that generated data genuinely covers a range of scene configurations rather than just interpolating densely around the source examples.
Real-robot evaluations showed:
The real-robot numbers are lower than simulation, which is typical of sim-to-real transfer. The authors attribute part of the gap to the simplified contact models in the simulators used and visual domain shift between simulation rendering and real camera images.
MimicGen datasets are designed to work with the robomimic framework, also developed by the same research group. Robomimic implements behavioral cloning variants including BC-RNN (the default for MimicGen experiments), as well as more advanced algorithms including IRIS and Diffusion Policy. The standard pipeline:
Performance generally plateaued between 1,000 and 5,000 generated demonstrations. Generating 10,000 demonstrations provided minimal additional policy improvement beyond 5,000 for the tasks studied, suggesting a practical target dataset size for each task type.
One finding worth noting: the quality of the source demonstrations matters less than expected. Policies trained on datasets generated from lower-quality source operators performed comparably to those generated from higher-quality operators, provided the generation process itself succeeded (i.e., the transformed trajectories executed correctly). The rejection filter effectively compensates for operator noise by discarding attempts that fail.
MimicGen ships with environments for robosuite (MuJoCo) and Isaac Gym Factory. The robosuite environments cover the 12 tasks in the official released dataset. The Isaac Gym Factory environments were used in the paper to demonstrate simulator-agnostic operation.
The 12 tasks in the released dataset:
Each task has multiple distribution variants (D0, D1, D2) with progressively larger initial state ranges and object diversity.
Isaac Lab is NVIDIA's unified robot learning framework built on Isaac Sim. Since version 1.x, Isaac Lab has included a native MimicGen-style workflow called Isaac Lab Mimic, implemented in the isaaclab_mimic module. The integration brings MimicGen's algorithm inside the Isaac Sim physics environment, so object poses from the simulator are available directly without any additional infrastructure.
The isaaclab_mimic module provides:
To add MimicGen support to a new Isaac Lab task, a researcher subclasses ManagerBasedRLMimicEnv, defines the subtask sequence and completion metrics, collects source demonstrations via teleoperation, and runs the data generator. Isaac Lab 2.3 (released in 2025) extended the workflow to support both MimicGen and SkillGen segmentation strategies side by side, and added APIs for loco-manipulation tasks that combine a wheeled mobile base with a manipulator arm.
The Isaac Lab Mimic documentation describes the system as "inspired by the MimicGen system" rather than a direct port, reflecting several implementation adaptations for Isaac Sim's parallel simulation architecture and Isaac Lab's manager-based environment design.
NVIDIA announced Isaac GR00T N1, described as the world's first open humanoid robot foundation model, at GTC 2025. Training GR00T N1 required large volumes of manipulation demonstration data for the kinds of dexterous tasks humanoid robots need to perform. Human teleoperation alone could not produce data at the required scale. The solution was a pipeline NVIDIA calls the GR00T-Mimic blueprint.
The blueprint is a reference workflow that takes a handful of human demonstrations and converts them into hundreds of thousands of synthetic trajectories. It operates in four stages:
Teleoperation data collection. A human operator uses an Apple Vision Pro headset with NVIDIA CloudXR to portal into an Isaac Sim digital twin of the robot workspace. Hand-tracking data from the headset controls the simulated robot, and demonstrations are recorded directly in Isaac Sim. This approach captures natural human manipulation motions without requiring a physical robot during the data collection phase.
Synthetic trajectory generation. GR00T-Mimic (the NVIDIA production packaging of the MimicGen algorithm) takes the recorded demonstrations and generates large numbers of new trajectories by adapting them to different object poses and scene configurations within Isaac Sim. The object poses required for the SE(3) transformation are provided by the physics engine.
Visual domain randomization via Cosmos Transfer. The synthetic trajectories are rendered with photorealistic variation using NVIDIA Cosmos Transfer, a World Foundation Model component that applies randomized lighting, background textures, and scene materials based on text prompts. This step compresses what would otherwise be hours of manual scene authoring into minutes and reduces the appearance gap between simulation and real camera imagery.
Policy training in Isaac Lab. The visually augmented synthetic dataset is used to train GR00T N1 policies via imitation learning in Isaac Lab.
Using this pipeline, NVIDIA generated 780,000 synthetic trajectories in 11 hours, equivalent to approximately 6,500 hours (nine continuous months) of continuous human teleoperation. Combining this synthetic data with real-world demonstration data improved GR00T N1 performance by 40% compared to training on real data alone. The complete blueprint is available as an NVIDIA NIM microservice.
The original MimicGen framework assumes a single manipulator and a fixed-sequence subtask graph where one arm operates relative to one object at a time. This assumption breaks down for bimanual manipulation, where two arms may act simultaneously, in coordination, or in dependent sequences on the same or different objects. Threading a needle while holding the fabric steady, sorting objects with both hands, or transferring an object from one hand to the other all require coordination mechanisms that the single-arm subtask model cannot represent.
DexMimicGen (arXiv 2410.24185, ICRA 2025) extends MimicGen to bimanual dexterous manipulation. The key algorithmic contributions:
Per-arm subtask segmentation. Rather than a single task-level subtask sequence, DexMimicGen defines separate subtask sequences for each arm. Each arm's trajectory is parsed and transformed independently.
Subtask type classification. DexMimicGen categorizes subtasks into three types. Parallel subtasks have the arms operating independently; an asynchronous execution queue manages each arm separately. Coordination subtasks require both arms to execute synchronized trajectories generated with the same SE(3) transformation. Sequential subtasks enforce ordering through dependency constraints that ensure prerequisites complete before dependent arm actions begin.
DexMimicGen was evaluated on three robot embodiments: bimanual Panda arms with parallel-jaw grippers, bimanual Panda arms with dexterous five-fingered hands, and the GR-1 humanoid robot. It generated 21,000 demonstrations across 9 tasks from 60 source human demonstrations. On a real humanoid, training on DexMimicGen-generated data from just 4 source demonstrations achieved 90% task success on can sorting; a policy trained on the 4 source demonstrations alone achieved 0%.
Selected task results:
| Task | Policy success rate |
|---|---|
| Can Sorting | 97.3% |
| Tray Lift | 88.7% |
| Transport | 83.3% |
| Piece Assembly | 80.7% |
| Pouring | 79.3% |
| Drawer Cleanup | ~76% |
| Threading | ~69% |
The DexMimicGen codebase is available at github.com/NVlabs/dexmimicgen, and datasets are hosted on Hugging Face under the MimicGen organization.
RoboCasa is a large-scale simulation benchmark for household robot learning published by Nasiriany, Mandlekar, and collaborators from UT Austin and NVIDIA at RSS 2024. It provides 120 kitchen environments built from 10 floor plans and 12 architectural styles, with 100 canonical manipulation tasks grouped into atomic skills (25 tasks) and composite multi-step sequences (75 tasks).
RoboCasa and MimicGen are designed to work together. The benchmark's data pipeline uses MimicGen to expand a human seed set into large training datasets: human operators collect 50 demonstrations per atomic task with a SpaceMouse device, and MimicGen then segments and adapts those demonstrations to the diverse kitchen configurations, generating over 100,000 total trajectories. The combination addresses what neither system can do alone: MimicGen provides the data amplification mechanism, while RoboCasa provides the diverse, realistic household environment that makes the generated data meaningful for generalist robot training.
The two systems differ along several dimensions:
| Dimension | MimicGen | RoboCasa |
|---|---|---|
| Primary role | Data generation algorithm | Environment suite and benchmark |
| Input | ~10 to 200 human source demos | 50 human demos per task (seed) |
| Output | 1,000 to 50,000+ synthetic demos | 100,000+ demos across 100 tasks |
| Task scope | General manipulation, 18 tasks | Kitchen household, 100 tasks |
| Simulator | robosuite (MuJoCo), Isaac Gym | robosuite extended (MuJoCo) |
| Scene diversity | Object pose, instance, robot arm | Kitchen layout, style, texture |
| Foundation model role | GR00T N1 training data pipeline | GR00T N1 evaluation benchmark |
| Code license | NVIDIA Source Code License | MIT |
| Dataset license | CC-BY 4.0 | CC-BY 4.0 |
RoboCasa expanded into RoboCasa365 in 2025, adding seasonal and thematic kitchen variations and further integration with the Isaac GR00T N1 evaluation stack.
The MimicGen codebase was published on GitHub at github.com/NVlabs/mimicgen in July 2024, approximately eight months after the CoRL 2023 paper presentation. The repository contains:
Datasets (version 1.0.1, released September 2024) are available on Hugging Face at amandlek/mimicgen_datasets. The released dataset contains over 48,000 demonstrations across 12 tasks. Each dataset is labeled with source configuration, object distribution variant, and robot arm.
The companion robomimic framework (also from NVIDIA/Stanford, github.com/ARISE-Initiative/robomimic) provides the policy learning side: behavioral cloning variants, dataset loading utilities, and evaluation scripts compatible with MimicGen data formats.
Several research groups have extended the MimicGen approach since the original paper.
DynaMimicGen (arXiv 2511.16223, November 2025) extends MimicGen to dynamic tasks where objects or scene elements move during execution. The core modification replaces the static SE(3) transform with Dynamic Movement Primitives (DMPs) per subtask, enabling real-time adaptation of the trajectory as objects move. Standard MimicGen assumes scenes are static during execution; DynaMimicGen relaxes this constraint.
SoftMimicGen (arXiv 2603.25725) adapts the framework to deformable object manipulation. Rigid object pose tracking is straightforward; deformable objects like ropes, towels, and surgical tissue do not have a single frame to track. SoftMimicGen introduces deformation-aware segmentation and transformation strategies for tasks including high-precision threading, dynamic whipping, folding, and pick-and-place of soft objects, across robot arms, humanoids, and surgical robot embodiments.
SkillGen is an alternative data generation approach integrated alongside MimicGen in Isaac Lab 2.3. Rather than segmenting demonstrations into object-centric subtasks, SkillGen segments them into motion primitives (skills) that can be recombined. Isaac Lab 2.3 allows users to choose either MimicGen or SkillGen for the mimic workflow depending on which segmentation strategy fits their task.
MimicLabs (Georgia Tech, 2025) built a scalable data collection and generation pipeline for tabletop manipulation on top of MimicGen, adding real-time operator feedback and multi-operator session management for physical robot setups.
MimicGen has several practical limitations that define where the approach works well and where it does not.
The single-object-per-subtask assumption constrains which tasks the system can handle. Tasks requiring simultaneous contact with two distinct objects by a single arm within a single subtask cannot be represented in the basic framework. DexMimicGen addresses the bimanual case, but novel contact configurations that do not fit the object-centric model still require manual extension.
Object pose knowledge is required at generation time. Deploying MimicGen in a real-robot loop outside of simulation requires object pose estimation from cameras or other sensors. For tasks where accurate pose estimation is difficult (transparent objects, deformable objects, objects with rotational symmetry), generation quality degrades.
Generation success rates vary from under 10% to over 80% depending on task precision requirements. Very low generation rates mean the system needs to attempt many more trajectories to accumulate a fixed dataset size, increasing simulation compute time. For extremely tight-tolerance tasks, generation may be slow enough to affect practical usability.
Static scene assumption during execution. The bridge-and-execute model runs transformed trajectories open-loop. If a contact event perturbs the robot or object during a subtask, the remaining trajectory misaligns and the attempt fails the rejection filter. The system cannot react to disturbances mid-trajectory. DynaMimicGen addresses this with DMPs, but the base MimicGen system does not.
Source demonstration biases propagate. If human operators consistently used a suboptimal strategy (a particular grasp orientation, a specific approach direction), generated data inherits that strategy at scale. More generated demonstrations cannot correct a systematic flaw in the source data.
Policy generalization remains bounded by the generated distribution. While MimicGen substantially expands coverage compared to source demonstrations alone, the generated data still reflects the object pose ranges, object instances, and robot configurations sampled during generation. Deploying in significantly out-of-distribution conditions requires either regenerating data with appropriate distributions or augmenting with other training approaches.