RoboCasa is a large-scale simulation framework for training generalist robots to perform everyday household tasks. Developed by researchers at The University of Texas at Austin and NVIDIA Research, it was introduced in a paper presented at Robotics: Science and Systems (RSS) 2024 under the title "RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots." The framework centers on kitchen environments and provides a standardized benchmark of 100 tasks (25 atomic, 75 composite), 120 procedurally generated scenes, 2,509 unique 3D objects, and over 100,000 automatically generated training trajectories produced via MimicGen. Since its initial release, RoboCasa has been adopted by NVIDIA as a benchmark for its Isaac GR00T foundation model series and has been extended into RoboCasa365, which scales the framework to 365 tasks and 2,500 kitchen scenes.
Teaching robots to operate in human homes has long been a goal of embodied AI research. Real kitchen environments present compounding difficulties: high clutter, articulated appliances, objects with similar appearances but different functions, and tasks that require chaining many sub-skills in sequence. Collecting sufficient real-world robot data is slow, expensive, and difficult to scale. A single skilled operator can teleoperate a robot to produce perhaps a few hundred demonstrations per day, far short of the tens of thousands typically needed to train robust policies.
Simulation offers a path around data scarcity, but earlier simulation frameworks for household robotics had their own limitations. Many provided only tabletop manipulation tasks in sparse, texturally uniform environments. Others offered realistic scenes but few tasks. None combined all three elements that practitioners said mattered most: diverse scenes, a large curated task library, and a scalable dataset generation pipeline.
RoboCasa builds on RoboSuite, an earlier simulation framework for robot learning developed by the same research group and open-sourced through the ARISE Initiative. RoboSuite provided a modular, MuJoCo-based environment for tabletop manipulation tasks, a clean Python API, and support for multiple robot arm models. It became a widely used baseline in the manipulation literature.
The principal limitation of RoboSuite was scope. Its environments were essentially table-surface scenes; there was no room-scale context, no navigation, and no structured household semantics. Moving from RoboSuite to RoboCasa meant adding mobile manipulator support, room-scale kitchens, articulated appliances, an AI-assisted asset pipeline, and a dataset generation backend based on MimicGen.
The paper "RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots" was authored by Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, and Abhishek Joshi from UT Austin, together with Ajay Mandlekar and Yuke Zhu from NVIDIA Research. It was published at RSS 2024 (Robotics: Science and Systems, the 20th edition) and is available at arXiv:2406.02523.
The paper frames the central challenge as a scaling problem. The authors argue that simulation-based robot learning requires three things at the same time: diverse environments (so policies do not overfit to a single visual context), diverse tasks (so skills generalize across activities), and large datasets (so the learned policies are statistically robust). Prior work, they note, had addressed at most two of these simultaneously.
Their proposed solution is a framework that generates diverse environments using a combination of procedural design and generative AI, generates tasks using large language models (LLMs), and generates datasets using MimicGen starting from a small seed of human demonstrations. The paper validates this pipeline empirically, showing that scaling the MimicGen-generated dataset from 100 to 3,000 trajectories per task raises average atomic-task success rates from about 26% to about 48%.
RoboCasa uses MuJoCo as its physics engine, inherited from RoboSuite. Simulation runs at approximately 25 frames per second with full rendering enabled. For photorealistic imagery, the framework supports NVIDIA Omniverse as an optional rendering backend. This combination provides physically accurate contact dynamics for drawer handles, knob twisting, and object grasping while still producing images detailed enough for vision-based policies.
The Python codebase is structured as an extension of RoboSuite and is installed via pip after cloning the robocasa repository. Roughly 10 GB of kitchen assets must be downloaded separately during setup.
The original RoboCasa paper describes 120 distinct kitchen scenes generated through a multi-step process:
The team began by surveying architecture and interior design publications to identify diverse real-world kitchen layouts. They selected 10 distinct floor plans ranging from small apartment kitchens to larger open-plan arrangements. Each floor plan was then combined with 12 named kitchen styles: Industrial, Scandinavian, Coastal, Modern, Traditional, Mediterranean, Rustic, and several others. Each style specifies a consistent palette of fixture finishes, appliance selections, and material textures.
To achieve visual variety, the team used MidJourney to generate 400 AI textures (100 each for walls, floors, countertops, and cabinet panels). These textures were mapped onto the procedural scene geometry, giving every kitchen a distinct look without requiring a separate 3D artist pass for each scene.
RoboCasa365 later extended the scene count to 2,500. Those additional scenes were derived from 50 layouts modeled after real Zillow real-estate listings across multiple US cities, combined with 50 visual styles. The design ensured that pretraining scenes and evaluation scenes share no visual style overlap, reducing the risk that policies learn superficial appearance cues rather than underlying task structure.
The 75 composite tasks in the original RoboCasa benchmark were generated with assistance from GPT-4 and Gemini 1.5. The process had two steps.
First, GPT-4 was prompted to enumerate high-level kitchen activities that a person might carry out in a typical week. It produced 20 categories: brewing coffee or tea, washing dishes, restocking supplies, chopping produce, making toast, defrosting items, boiling water, preparing meat, setting the table, clearing the table, sanitizing surfaces, preparing snacks, tidying up, washing produce, frying, reheating food, mixing or blending, baking, serving, and steaming.
Second, for each category, GPT-4 and Gemini 1.5 were asked to propose specific robot task implementations that could be expressed as state-machine programs. The authors then reviewed the proposed tasks, filtered out those with logical flaws, and coded the accepted tasks as Python environment definitions within the RoboCasa framework. The result was 75 composite tasks that cover a meaningful range of kitchen activity while remaining implementable in simulation.
The original framework defines 25 atomic tasks organized around eight foundational sensorimotor skills:
| Skill | Description |
|---|---|
| Pick and place | Grasping an object and moving it to a target location |
| Opening and closing doors | Interacting with hinged cabinet and appliance doors |
| Opening and closing drawers | Pulling and pushing drawer handles |
| Twisting knobs | Rotating stovetop or oven control knobs |
| Turning levers | Operating lever-style handles on faucets or appliances |
| Pressing buttons | Activating push-button controls on appliances |
| Insertion | Placing items into slots, receptacles, or holders |
| Navigation | Moving the robot base to a target position in the kitchen |
Of these, the RoboCasa paper found pick-and-place and insertion to be consistently hardest to learn due to object diversity and the need for precise end-effector alignment, respectively.
The 75 composite tasks require the robot to sequence multiple atomic skills. Representative examples include:
Single-task policies trained on composite tasks from scratch in the RSS 2024 paper showed low success rates (0 to 12% depending on the task). However, fine-tuning from a policy pretrained on atomic tasks improved performance across the board, suggesting that the atomic skill library serves as a useful prior for composite task learning.
RoboCasa365 expands this to 365 tasks across 60 distinct kitchen activity categories, adding tasks that test semantic reasoning, memory-dependent behavior (such as remembering where an item was stored), and longer-horizon planning.
RoboCasa's original release includes 2,509 unique 3D objects spanning 153 categories. These cover the full range of items found in a residential kitchen: fruits, vegetables, packaged foods, containers, utensils, small appliances, and miscellaneous household items.
A major portion of the object library was generated with AI-based tools. The team used Luma.ai's text-to-3D generation service to produce 1,592 objects, giving the library substantial variety without requiring individual 3D modeling for each asset. Remaining objects came from Objaverse 1.0, a large community-curated 3D model collection, and from LightWheel AI.
RoboCasa365 expanded the object library to over 3,200 assets across more than 150 categories, adding 57 new object categories beyond the original 153.
The simulation includes a rich set of interactable fixtures representing the built-in elements of a kitchen. These include cabinets with hinged doors, drawers with pull handles, microwaves, stovetops with articulated knobs, coffee machines, dishwashers, refrigerators, and sinks. In RoboCasa365, the fixture library was expanded from 20 to 456 interactive fixtures and appliances across 12 categories, adding toasters, blenders, electric kettles, and other countertop devices.
Fixtures in RoboCasa support multi-state behavior. Turning a stove knob activates the corresponding burner element. Opening a microwave door reveals the interior cavity. This means tasks can include state-change verification as part of their success condition, encouraging policies to interact with objects in semantically correct ways rather than merely moving them.
MimicGen is a data generation system developed by Ajay Mandlekar and colleagues at NVIDIA that can synthesize large demonstration datasets from a small seed of human demonstrations. Given roughly 50 human-collected trajectories for a task, MimicGen decomposes each trajectory into object-centric segments, adapts those segments to new scene configurations, and uses rejection sampling to keep only trajectories where the robot successfully completed the task. The result is a dataset orders of magnitude larger than what a human operator could collect directly.
RoboCasa uses MimicGen as its primary dataset scaling mechanism. The RSS 2024 paper describes the following process:
Four operators collected 1,250 human demonstrations total for atomic tasks (50 demonstrations per task) using 3D SpaceMouse teleoperation devices. MimicGen then generated over 100,000 additional trajectories from this seed data: 72,000 trajectories covering 24 atomic tasks (3,000 trajectories per task) and 28,000 further trajectories using AI-generated object variants.
The scaling experiment in the paper trained behavioral cloning policies on datasets of four sizes (50 human demos, 100 generated, 300 generated, 3,000 generated) and measured average success across 24 atomic tasks:
| Dataset | Size | Avg. success rate |
|---|---|---|
| Human-50 | 50 demos | 28.8% |
| Generated-100 | 100 demos | 26.3% |
| Generated-300 | 300 demos | 35.0% |
| Generated-3000 | 3,000 demos | 47.6% |
This result demonstrates a clear positive scaling trend: more MimicGen-generated data consistently improves policy performance. The 3,000-demo policy outperformed the 50-demo human baseline by nearly 19 percentage points in absolute terms.
RoboCasa365 extended MimicGen usage further, providing 1,615 hours of synthetic data (10,000 demonstrations per 60 atomic task categories) alongside 404 hours of human pretraining demonstrations (30,000 demonstrations across 300 tasks).
The RSS 2024 paper includes real-world transfer experiments conducted on a Franka Emika Panda arm mounted on a wheeled mobile base in an actual kitchen. Policies were trained on real data alone or on a combination of real and simulated data, then evaluated on physical hardware.
Co-training with RoboCasa simulation data yielded meaningful improvements:
RoboCasa365 extended these transfer experiments, reporting that sim-and-real co-training achieves 79.8% average success on real kitchen tasks compared to 61.8% with real data alone.
These results suggest that even imperfect simulation data with a domain gap to real hardware can substantially improve policy robustness, particularly for generalization to new objects.
RoboCasa has become a key benchmark and training environment within NVIDIA's robotics stack. Isaac GR00T N1, NVIDIA's foundation model for generalist humanoid robots, lists RoboCasa as one of its evaluation benchmarks. In the RoboCasa leaderboard maintained at robocasa.ai, GR00T N1 achieves 49.6% average success using 300 demonstrations per task.
NVIDIA's Cosmos-Policy model uses RoboCasa training data more directly. The Cosmos-Policy-RoboCasa-Predict2-2B model, released on Hugging Face, is fine-tuned from the Cosmos-Predict2-2B Video2World foundation model using 50 human-teleoperated demonstrations per task across 24 kitchen manipulation tasks. The model takes multi-view RGB images, proprioceptive state, and a natural language task description as input, and outputs a 32-timestep action sequence along with future state predictions and a value estimate.
On the RoboCasa benchmark, Cosmos Policy achieves 67.1% average success with only 50 demonstrations per task, outperforming methods that use 300 or more demonstrations:
| Method | Demos per task | Avg. success rate |
|---|---|---|
| GR00T-N1 | 300 | 49.6% |
| UVA | 50 | 50.0% |
| DP-VLA | 3,000 | 57.3% |
| π0 | 300 | 62.5% |
| GR00T-N1.5 | 300 | 64.1% |
| Video Policy | 300 | 66.0% |
| FLARE | 300 | 66.4% |
| Cosmos Policy | 50 | 67.1% |
These numbers are averaged over 50 trials per task across 5 evaluation scenes (3,600 total trials).
NVIDIA's DreamGen project, which uses Cosmos world foundation models to generate synthetic robot training videos, also uses RoboCasa as a downstream evaluation target. The research showed that video world models which score higher on a generation quality benchmark (DreamGen Bench) consistently produce higher success rates when robots trained on their outputs are evaluated on RoboCasa tasks.
RoboCasa also serves as a benchmark environment in Isaac Lab-Arena, NVIDIA's standardized robotics testing infrastructure, which evaluates robot skills in simulation before deployment to physical hardware.
Several simulation frameworks address overlapping problems in household robotics. RoboCasa occupies a distinct position in this space, particularly in its combination of curated tasks, AI-assisted asset generation, and an integrated demonstration dataset.
| Framework | Developer | Primary focus | Scene type | Sim speed | Data pipeline |
|---|---|---|---|---|---|
| RoboCasa | UT Austin / NVIDIA | Manipulation in kitchens | Room-scale | ~25 FPS (CPU) | MimicGen augmentation |
| MimicGen | NVIDIA | Data generation for tabletop | Tabletop | Varies | Human seed + synthesis |
| RoboGen | CMU | Open-ended skill generation | Varied | Varies | Fully automatic (RL/motion planning) |
| Habitat 3.0 | Meta AI | Navigation and rearrangement | Full home | GPU-accelerated | Human-in-the-loop demos |
| ManiSkill3 | UCSD / Berkeley | Manipulation benchmarking | Varied | 2,000+ FPS (GPU) | RLPD / RFCL online |
| AI2-THOR | Allen Institute | Visual navigation, interaction | Full home | CPU | Human annotation |
| OmniGibson | Stanford | Daily activities | Full home | CPU | Varies |
A few distinctions are worth noting in more detail.
RoboGen, developed at Carnegie Mellon, uses a fully automated self-guided loop: a language model proposes tasks, a generative pipeline creates simulation environments, and the system selects between reinforcement learning, motion planning, or trajectory optimization to acquire each skill. This approach can produce an essentially unlimited variety of skills with minimal human input.
RoboCasa takes the opposite design philosophy. Its tasks are curated by human researchers who have filtered LLM suggestions for logical consistency and coded them as structured Python environment definitions. Its datasets rely on human demonstrations as a quality anchor before MimicGen amplification. The tradeoff is a smaller but more reliable and reproducible task suite, with standardized evaluation conditions across all groups using the benchmark.
RoboGen is better suited for exploring the breadth of skills a robot might acquire autonomously; RoboCasa is better suited for measuring progress on a stable, shared benchmark.
Meta AI's Habitat platform emphasizes navigation and social rearrangement in full home environments, with particular strength in human-robot co-habitation tasks. Habitat 3.0 includes accurate humanoid simulation and infrastructure for real humans to interact with simulated robots.
RoboCasa focuses more narrowly on kitchen manipulation: picking, placing, opening, closing, twisting, and inserting objects within a fixed spatial context. It does not include whole-house navigation or explicit human-robot interaction scenarios. Habitat uses GPU-accelerated simulation that can run far faster than RoboCasa's CPU-based MuJoCo backend, which gives Habitat an advantage for reinforcement learning workflows that require millions of environment steps.
ManiSkill3, developed at UCSD and Berkeley, achieves GPU-parallelized simulation speeds exceeding 2,000 frames per second, enabling online reinforcement learning at a scale that CPU-based simulators cannot match. Its design favors algorithms like RLPD and RFCL, which combine online exploration with offline demonstration data.
RoboCasa does not support GPU-parallelized simulation and runs at roughly 25 FPS with rendering. This limits it primarily to imitation learning and offline methods. However, RoboCasa's photorealistic kitchen environments, rich fixture library, and large curated dataset make it more representative of real-world household conditions than most ManiSkill3 task suites at the time of the RSS 2024 paper.
RoboCasa was released as open source under a dual license: MIT for the code base and CC BY 4.0 for the asset library and datasets. The code is hosted at github.com/robocasa/robocasa and the datasets are available on Hugging Face under the haosulab/RoboCasa namespace.
The release includes:
Version 0.2, released October 31, 2024, updated the backend to RoboSuite v1.5 with improved support for custom robot configurations, composite controllers, additional teleoperation devices, and photorealistic rendering.
Version 1.0, released February 18, 2026, corresponds to RoboCasa365 and brings the full 365-task suite, 2,500+ kitchen scenes, 3,200+ objects, and the 2,200+ hour demonstration dataset along with a community leaderboard at robocasa.ai/leaderboard.html for submitting and comparing policy models.
RoboCasa addresses several distinct use cases in robotics research:
Pre-training robot foundation models. The large and diverse demonstration dataset makes RoboCasa suitable as a pretraining corpus for vision-language-action models that will be fine-tuned on specific downstream tasks. RoboCasa365's benchmarking experiments showed that pretraining on its data allows models to match the performance of target-only training with roughly three times fewer target demonstrations.
Measuring simulation-to-real transfer. The framework provides a reproducible environment for studying how policy quality in simulation predicts real-world performance. The kitchen setting, with its grounded object types and plausible task structure, is closer to real deployment conditions than abstract tabletop benchmarks.
Evaluating generalist manipulation policies. The public leaderboard enables direct comparison across research groups using identical evaluation conditions: same tasks, same scenes, same trial counts, and standardized success metrics.
Studying data scaling laws. The integration of MimicGen means researchers can systematically vary dataset size (from tens to tens of thousands of demonstrations) and measure the resulting policy performance, contributing to an empirical understanding of how robot learning scales with data.
Domain randomization and visual generalization. The 400 AI-generated textures, combined with the procedural variation across 120 or 2,500 scenes, provide a natural testbed for visual domain randomization approaches that try to close the sim-to-real gap.
The RSS 2024 paper is candid about several limitations of the original framework.
Composite task performance remains poor. Even with pretraining, success rates on composite tasks in the original paper topped out around 12%, suggesting that the jump from atomic skills to multi-step activities is not yet solved.
MimicGen-generated trajectories can exhibit artifacts. Because MimicGen adapts demonstrations programmatically rather than physically replanning from scratch, the resulting trajectories sometimes contain jerky motions, near-collisions, or brief interpenetrations that would not appear in human-collected data. Policies trained on these trajectories may inherit some of these motion artifacts.
LLM task generation still requires human implementation. GPT-4 can propose task descriptions, but a human researcher must translate each proposal into a Python environment definition within RoboCasa's task coding framework. This is a bottleneck for scaling the task library beyond what the authors have built.
The dataset lacks coverage of highly dexterous skills, deformable object manipulation, and bimanual tasks. All original RoboCasa tasks use a single-arm mobile manipulator. Tasks requiring two hands, soft materials, or fine-grained fingertip control are absent.
Simulation speed limits reinforcement learning. At 25 FPS, RoboCasa cannot support the millions of environment steps that model-free RL typically requires. The framework is effectively restricted to imitation learning and offline methods.
The environments are limited to kitchens. While kitchens are a natural testbed given their complexity, they do not cover the full range of household environments (bedrooms, bathrooms, living rooms) that a deployable home robot would need to navigate.