Open X-Embodiment is a large-scale collaborative robotics research initiative led by Google DeepMind that produced the largest open-source real robot dataset and a family of cross-embodiment robot learning models. Announced in October 2023, the project brought together over 20 institutions and 33 research labs to pool robotic demonstration data from 22 different robot platforms into a single unified dataset. Two models trained on this data, RT-1-X and RT-2-X, demonstrated that a single policy can transfer learned skills across physically different robots, achieving a 50% average improvement over robot-specific baselines. The accompanying paper won the Best Conference Paper Award at ICRA 2024 in Yokohama, Japan.
The Open X-Embodiment (OXE) dataset contains more than one million real-world robot trajectories spanning 527 distinct skills and 160,266 tasks. It was assembled from 60 pre-existing datasets contributed by 34 robotics laboratories worldwide. By standardizing these heterogeneous data sources into a common format, the project enabled the first systematic study of cross-embodiment transfer learning at scale. The dataset, model checkpoints, and code are publicly available under open-source licenses.
Before Open X-Embodiment, the robotics research community faced a fundamental data problem. Each laboratory typically collected its own demonstrations on its own robot hardware, trained policies for that specific platform, and published results that were difficult to compare or reproduce elsewhere. Unlike computer vision and natural language processing, where massive shared datasets (ImageNet, Common Crawl) enabled foundation models to learn general representations, robotics lacked a comparable shared data resource.
This fragmentation had several consequences. Individual lab datasets were too small to train high-capacity models. Skills learned on one robot could not be transferred to another. And every new robot platform essentially required starting from scratch. The question at the heart of the OXE project was whether diverse robot data, pooled together despite differences in hardware, sensors, and task domains, could produce policies that generalize across embodiments.
The project built on two prior Google models: RT-1 (Robotics Transformer 1), a transformer-based architecture for real-world robotic control published in December 2022, and RT-2, a vision-language-action model from July 2023 that represented robot actions as text tokens within a large vision-language model. RT-1 was trained on 130,000 episodes from a fleet of 13 Google robots performing over 700 tasks. RT-2 showed that web-scale pretraining on images and text could transfer useful knowledge to robot control. Open X-Embodiment extended both models to learn from many different robot types simultaneously.
The OXE dataset is the largest open-source collection of real robot demonstrations. Its key statistics are:
| Statistic | Value |
|---|---|
| Total trajectories | 1,000,000+ |
| Robot embodiments | 22 |
| Distinct skills | 527 |
| Total tasks | 160,266 |
| Component datasets | 60 |
| Contributing institutions | 21 |
| Contributing research labs | 34 |
The 22 robot embodiments range from single-arm manipulators to bimanual systems and legged platforms. The dataset covers a wide spectrum of manipulation behaviors, from common tasks like pick-and-place and drawer opening to specialized skills like cable routing and wiping.
The dataset includes demonstrations from diverse robotic hardware:
| Category | Robot platforms |
|---|---|
| Single-arm manipulators | Franka Emika Panda, WidowX, UR5, xArm, Sawyer, Kinova Jaco, Kinova Gen3 |
| Google platforms | Google Robot (Everyday Robots) |
| Industrial arms | KUKA |
| Bimanual systems | ALOHA |
| Mobile manipulators | Hello Robot Stretch |
| Legged platforms | Unitree A1, Boston Dynamics Spot |
| Tabletop systems | Language Table |
The most heavily represented platforms in terms of trajectory count are the Google Robot (from the Fractal/RT-1 dataset), Bridge V2 (WidowX), and Language Table. The dataset is notably imbalanced; the top four robot types account for over 85% of all real demonstration data.
The 60 component datasets were collected independently by different research groups over multiple years. Some of the larger and more widely known datasets in the collection include:
| Dataset | Robot | Source institution | Approximate episodes |
|---|---|---|---|
| Fractal (RT-1) | Google Robot | Google DeepMind | ~130,000 |
| Bridge V2 | WidowX | UC Berkeley | ~60,000 |
| Language Table | xArm (tabletop) | Google Research | ~440,000 |
| QT-Opt | Kuka | Google Research | ~60,000 |
| TACO Play | Various | TU Darmstadt | ~3,600 |
| Jaco Play | Kinova Jaco | University of Freiburg | ~1,000 |
| Berkeley Cable Routing | Franka | UC Berkeley | ~1,500 |
| RoboTurk | Sawyer | Stanford | ~2,100 |
| NYU Door Opening | Franka | NYU | ~500 |
| Berkeley Autolab UR5 | UR5 | UC Berkeley | ~1,000 |
| TOTO | Franka | CMU | ~1,000 |
The remaining datasets contribute anywhere from a few hundred to several thousand episodes each, covering tasks on platforms including Franka, WidowX, xArm, and others.
All datasets in OXE are stored in the RLDS (Reinforcement Learning Datasets) format, which is built on TensorFlow Datasets and Apache Arrow. RLDS organizes data as sequences of episodes, where each episode contains a series of timesteps with observations, actions, and metadata.
The standardized observation space includes RGB workspace camera images and natural language task descriptions. Wrist cameras, depth sensors, and point clouds are not included in the unified schema, though some component datasets originally contained them. Images are resized to a common resolution and served at approximately 3 Hz.
The action space is a 7-dimensional end-effector representation: x, y, z position; roll, pitch, yaw orientation; and gripper opening. An eighth dimension indicates episode termination. Actions are discretized into 256 uniform bins per dimension. Importantly, the action values are normalized per dataset but are not aligned to a shared coordinate frame across robots. This means the same numerical action vector may produce very different physical motions on different platforms.
Data is accessible through TensorFlow Datasets (TFDS) or direct download from Google Cloud Storage. A public Google Sheets spreadsheet lists all component datasets with their metadata, citations, and download instructions. The repository also provides Colab notebooks for data loading and model inference.
RT-1-X is a 35-million parameter model based on the RT-1 architecture, retrained on the OXE data mixture. Its architecture is a decoder-only transformer with the following pipeline:
RT-1-X is designed to be lightweight enough to run inference on a single robot in real time. Its relatively small parameter count means it can be deployed on-robot without requiring external GPU servers, but this limited capacity also constrains how much diverse data it can absorb before underfitting.
RT-2-X is based on RT-2, which builds on the PaLI-X vision-language model. The version evaluated in the paper uses a 55-billion parameter model that combines a Vision Transformer (ViT) visual encoder with a UL2 language model, both pretrained on the WebLI dataset.
The core idea behind RT-2 (and by extension RT-2-X) is to represent robot actions as natural language tokens. During training, action values are discretized and expressed as text strings (for example, "1 128 91 241 5 101 127"), which are interleaved with standard vision-language data. This co-fine-tuning approach lets the model retain its web-scale knowledge about objects, spatial relationships, and language while simultaneously learning to output motor commands.
RT-2-X was co-fine-tuned with an approximately one-to-one split between VLM tasks and robotics data, using the OXE data mixture as the robotics component. The model takes camera images and a natural language instruction as input and outputs the 7-dimensional action vector.
The RT-X models were evaluated through 3,600 real-world trials across six different robot platforms at multiple institutions, including UC Berkeley (RAIL and AUTOLab), the University of Freiburg, NYU, Stanford, and USC.
RT-1-X was evaluated on five datasets where the original training sets were relatively small (typically hundreds to a few thousand episodes): Kitchen Manipulation, Cable Routing, NYU Door Opening, AUTOLab UR5, and Robot Play. Across these five settings, RT-1-X achieved an average success rate of approximately 63%, compared to 41% for the original methods that had been developed specifically for each robot and task. This represents a roughly 50% relative improvement.
The gains were most pronounced when the original dataset was small. By supplementing a limited single-robot dataset with diverse multi-robot data from OXE, the model learned more robust visual and behavioral features than could be acquired from the small dataset alone.
On larger datasets, RT-1-X showed diminishing returns. For example, on the Bridge/WidowX evaluation, RT-1-X achieved 27% success compared to 13% for original methods, but RT-2-X with its much larger capacity achieved 50% success on the same tasks. On the Google Robot RT-1 evaluation (a large dataset where RT-1 was originally trained), RT-1 scored 92% and RT-2-X achieved 91%. These results suggest that the 35M-parameter RT-1-X model cannot fully absorb the diversity of the OXE data mixture; larger models are needed to benefit from the added data without underfitting.
The most striking finding of the paper concerns RT-2-X's emergent capabilities. When evaluated on skills that existed in other robots' datasets within OXE but were never present in the evaluation robot's own training data, RT-2-X achieved 75.8% success on a Google Robot, compared to 27.3% for the standard RT-2 model. This roughly 3x improvement demonstrates that the model was transferring behavioral knowledge learned from other robot embodiments.
When the Bridge dataset (collected on WidowX robots) was removed from the training mixture, performance on these emergent skills dropped from 75.8% to 42.8%, confirming that specific data from other embodiments was contributing to the transfer.
RT-2-X also demonstrated improved spatial reasoning compared to RT-2, better distinguishing between instructions like "move near the object" versus "move on top of the object." This suggests that training on diverse embodiments, each with different workspace configurations and object arrangements, helps the model develop more precise spatial understanding.
The paper includes ablation experiments for RT-2-X that highlight several factors:
| Factor | Condition | Success rate |
|---|---|---|
| Web pretraining | With web pretraining | 48.7% |
| Web pretraining | From scratch (no web pretraining) | 0% |
| Model size | 55B parameters | Full performance |
| Model size | 5B parameters | Substantially lower on emergent skills |
| Image history | With history | 44.4% |
| Image history | Without history | 14.5% |
| Training method | Co-fine-tuning | Similar to fine-tuning when data is diverse |
These results show that web pretraining is not optional; the model simply does not learn without it. Model capacity is also critical, as smaller models cannot effectively leverage the cross-embodiment data. Image history (seeing multiple recent frames rather than just the current one) provides a large performance boost.
The Open X-Embodiment paper lists over 150 authors from more than 20 institutions. The contributing organizations include:
| Institution | Country |
|---|---|
| Google DeepMind | United States / United Kingdom |
| Google Research | United States |
| UC Berkeley | United States |
| Stanford University | United States |
| Carnegie Mellon University | United States |
| New York University | United States |
| University of Southern California | United States |
| University of Texas at Austin | United States |
| University of Illinois Urbana-Champaign | United States |
| UC San Diego | United States |
| Columbia University | United States |
| Georgia Institute of Technology | United States |
| Arizona State University | United States |
| ETH Zurich | Switzerland |
| University of Freiburg | Germany |
| Technische Universitat Darmstadt | Germany |
| German Aerospace Center (DLR) | Germany |
| Max Planck Institute | Germany |
| University of Technology Nuremberg | Germany |
| Imperial College London | United Kingdom |
| Istituto Italiano di Tecnologia | Italy |
| KAIST | South Korea |
| Shanghai Jiao Tong University | China |
| University of Tokyo | Japan |
| RIKEN | Japan |
| Queensland University of Technology | Australia |
| NVIDIA | United States |
| Intrinsic (Alphabet) | United States |
| Flexiv Robotics | China |
Open X-Embodiment has had a significant effect on the robotics research community since its release. It won the ICRA 2024 Best Conference Paper Award and has become the standard pretraining corpus for generalist robot policies.
Octo is an open-source generalist robot policy released in May 2024, trained on 800,000 trajectories from the OXE dataset. Developed primarily by researchers at UC Berkeley, Octo uses a modular transformer architecture with a diffusion action head. It comes in two sizes: 27M and 93M parameters.
Octo supports flexible input configurations (workspace cameras, wrist cameras, or both) and can be guided by either language instructions or goal images. In benchmarks, Octo outperformed RT-1-X and performed comparably to the 55B-parameter RT-2-X on several tasks despite being orders of magnitude smaller. Octo was designed for easy fine-tuning; it can be adapted to new robots, sensors, and action spaces using small target-domain datasets on consumer GPUs.
OpenVLA is a 7-billion parameter open-source vision-language-action model released in June 2024. It was trained on 970,000 robot manipulation trajectories from the OXE dataset. OpenVLA builds on a Llama 2 language model backbone combined with DINOv2 and SigLIP visual encoders.
Despite having 7x fewer parameters than RT-2-X (55B), OpenVLA outperformed RT-2-X by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments. It supports parameter-efficient fine-tuning through low-rank adaptation (LoRA) and can be served with quantization for efficient deployment. The OpenVLA team released model checkpoints, fine-tuning notebooks, and a full PyTorch training codebase.
OXE-AugE, published in December 2025, addresses one of OXE's key limitations: data imbalance across embodiments. The augmentation pipeline expands the dataset with 9 additional robot embodiments, producing over 4.4 million trajectories (more than triple the original OXE). Experiments showed that generalist policies like OpenVLA and pi0 improved success rates by 24-45% on previously unseen robot-gripper combinations when fine-tuned on OXE-AugE, compared to fine-tuning on the original OXE alone.
The OXE dataset and methodology directly influenced several subsequent robot foundation model efforts:
The OXE project demonstrated an "embodiment scaling law": expanding the number of training embodiments yields more effective generalization than simply increasing the trajectory count for any single embodiment. This finding has informed how subsequent projects prioritize data collection across diverse platforms rather than maximizing data volume on a single robot.
The central technical contribution of Open X-Embodiment is the empirical demonstration that cross-embodiment transfer works. Before this project, it was unclear whether a model could learn useful behavioral patterns from a robot arm with 6 joints, a different gripper type, and different camera placement, and apply that knowledge to a completely different robot. The OXE experiments showed that high-capacity models (like the 55B RT-2-X) can extract abstract manipulation concepts, such as the idea of "picking up" or "opening," that transfer across embodiments even without explicit coordinate frame alignment.
This transfer is not automatic or trivial. The 35M-parameter RT-1-X showed positive transfer mainly in the small-data regime. On large datasets, its limited capacity became a bottleneck, and it could not simultaneously absorb the diversity of behaviors from 22 different robots. The 55B RT-2-X, by contrast, had enough capacity to benefit from the added data. This finding underscored the importance of model scale for cross-embodiment learning.
The ablation results for RT-2-X showed that web pretraining (training the underlying VLM on billions of image-text pairs from the internet before fine-tuning on robotics data) was critical. Without web pretraining, the model achieved 0% success. This confirms that the visual and semantic knowledge from web data provides a foundation that robot data alone cannot supply. The model needs to "understand" what a cup, a drawer, or a table is before it can learn to manipulate them.
Beyond the models, OXE's data standardization work has had lasting infrastructure impact. The RLDS format and the data conversion tools released with the project created a common data pipeline for the robotics community. Subsequent datasets like DROID adopted RLDS specifically for OXE compatibility. Tools like the LeRobot library from Hugging Face provide converters between OXE's RLDS format and other training frameworks. This data infrastructure lowers the barrier for any lab to contribute its data to the shared pool and benefit from models trained on the combined collection.
The OXE dataset and models have several known limitations:
Since the initial release in October 2023, the OXE ecosystem has continued to grow. The dataset has been adopted as the default pretraining resource for several major robot foundation model projects. The RLDS format has become a de facto standard for sharing robot demonstration data.
Key developments through 2025 and into 2026 include:
The OXE project established the principle that robot learning benefits from data diversity across embodiments, much as language models benefit from text diversity across domains. As more labs contribute their data and as data augmentation techniques improve, the dataset is likely to continue growing. The long-term vision is a shared data resource for embodied AI comparable to what ImageNet was for computer vision or what Common Crawl is for language modeling.