# Open X-Embodiment

> Source: https://aiwiki.ai/wiki/open_x_embodiment
> Updated: 2026-06-23
> Categories: Data & Datasets, Google DeepMind, Robotics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Open X-Embodiment** (OXE) is a large-scale collaborative robotics research initiative, led by [Google DeepMind](/wiki/google_deepmind) and announced in October 2023, that produced the largest open-source real robot dataset and a family of cross-embodiment [robot learning](/wiki/robot_learning) models called RT-X.[1][3] The project pooled robotic demonstration data from 22 different robot platforms, gathered through a collaboration between 21 institutions and 33 research labs, into a single standardized dataset of more than one million real-world trajectories spanning 527 skills and 160,266 tasks.[1][3] Two models trained on this data, RT-1-X and RT-2-X, demonstrated that a single policy can transfer learned skills across physically different robots: RT-1-X delivered a 50% average success-rate improvement over robot-specific baselines, and RT-2-X tripled performance on emergent cross-embodiment skills.[3] The accompanying paper, "Open X-Embodiment: Robotic Learning Datasets and RT-X Models," won the Best Conference Paper Award at ICRA 2024 in Yokohama, Japan.[2]

The central claim of the project is that robotics can undergo the same consolidation around shared, pretrained models that reshaped language and vision. As the paper states, "We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms."[1] By standardizing 60 pre-existing datasets into a common format, OXE enabled the first systematic study of cross-embodiment [transfer learning](/wiki/transfer_learning) at scale.[1][8] The dataset, model checkpoints, and code are publicly available under open-source licenses.[8][9]

## What problem does Open X-Embodiment solve?

Before Open X-Embodiment, the robotics research community faced a fundamental data problem. Each laboratory typically collected its own demonstrations on its own robot hardware, trained policies for that specific platform, and published results that were difficult to compare or reproduce elsewhere. Unlike [computer vision](/wiki/computer_vision) and [natural language processing](/wiki/natural_language_processing), where massive shared datasets (ImageNet, Common Crawl) enabled [foundation models](/wiki/foundation_model) to learn general representations, robotics lacked a comparable shared data resource.[1]

The paper frames the goal directly: "Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments?"[1] This fragmentation had several consequences. Individual lab datasets were too small to train high-capacity models. Skills learned on one robot could not be transferred to another. And every new robot platform essentially required starting from scratch. The question at the heart of the OXE project was whether diverse robot data, pooled together despite differences in hardware, sensors, and task domains, could produce policies that generalize across embodiments.

The project built on two prior Google models: [RT-1](/wiki/rt_1) (Robotics Transformer 1), a [transformer](/wiki/transformer)-based architecture for real-world robotic control published in December 2022, and [RT-2](/wiki/rt_2), a [vision-language-action model](/wiki/vision-language-action_model) from July 2023 that represented robot actions as text tokens within a large [vision-language model](/wiki/vision_language_model).[4][5] RT-1 was trained on 130,000 episodes from a fleet of 13 Google robots performing over 700 tasks.[4] RT-2 showed that web-scale pretraining on images and text could transfer useful knowledge to robot control.[5] Open X-Embodiment extended both models to learn from many different robot types simultaneously.

## The Open X-Embodiment dataset

### How large is the OXE dataset?

The OXE dataset is the largest open-source collection of real robot demonstrations.[1][3] Its key statistics are:

| Statistic | Value |
|---|---|
| Total trajectories | 1,000,000+ |
| Robot embodiments | 22 |
| Distinct skills | 527 |
| Total tasks | 160,266 |
| Component datasets | 60 |
| Contributing institutions | 21 |
| Contributing research labs | 34 |

The figures above come from the published paper (21 institutions, 527 skills, 160,266 tasks across 22 robots), while Google DeepMind's announcement described the same collaboration as spanning "33 academic labs" with "more than 500 skills and 150,000 tasks across more than 1 million episodes."[1][3] The 22 robot embodiments range from single-arm manipulators to bimanual systems and legged platforms. The dataset covers a wide spectrum of manipulation behaviors, from common tasks like pick-and-place and drawer opening to specialized skills like cable routing and wiping.

### Which robot platforms are included?

The dataset includes demonstrations from diverse robotic hardware:

| Category | Robot platforms |
|---|---|
| Single-arm manipulators | [Franka Emika Panda](/wiki/franka_emika), [WidowX](/wiki/widowx), [UR5](/wiki/universal_robots), [xArm](/wiki/ufactory_xarm), Sawyer, Kinova Jaco, Kinova Gen3 |
| Google platforms | Google Robot (Everyday Robots) |
| Industrial arms | [KUKA](/wiki/kuka) |
| Bimanual systems | [ALOHA](/wiki/aloha_robot) |
| Mobile manipulators | [Hello Robot Stretch](/wiki/hello_robot_stretch) |
| Legged platforms | Unitree A1, [Boston Dynamics Spot](/wiki/boston_dynamics) |
| Tabletop systems | Language Table |

The most heavily represented platforms in terms of trajectory count are the Google Robot (from the Fractal/RT-1 dataset), Bridge V2 (WidowX), and Language Table. The dataset is notably imbalanced; the top four robot types account for over 85% of all real demonstration data.[1]

### Component datasets

The 60 component datasets were collected independently by different research groups over multiple years.[1] Some of the larger and more widely known datasets in the collection include:

| Dataset | Robot | Source institution | Approximate episodes |
|---|---|---|---|
| Fractal (RT-1) | Google Robot | Google DeepMind | ~130,000 |
| Bridge V2 | WidowX | UC Berkeley | ~60,000 |
| Language Table | xArm (tabletop) | Google Research | ~440,000 |
| QT-Opt | Kuka | Google Research | ~60,000 |
| TACO Play | Various | TU Darmstadt | ~3,600 |
| Jaco Play | Kinova Jaco | University of Freiburg | ~1,000 |
| Berkeley Cable Routing | Franka | UC Berkeley | ~1,500 |
| RoboTurk | Sawyer | Stanford | ~2,100 |
| NYU Door Opening | Franka | NYU | ~500 |
| Berkeley Autolab UR5 | UR5 | UC Berkeley | ~1,000 |
| TOTO | Franka | CMU | ~1,000 |

The remaining datasets contribute anywhere from a few hundred to several thousand episodes each, covering tasks on platforms including Franka, WidowX, xArm, and others.

### How is the data formatted and standardized?

All datasets in OXE are stored in the RLDS (Reinforcement Learning Datasets) format, which is built on [TensorFlow](/wiki/tensorflow) Datasets and Apache Arrow.[1][8] RLDS organizes data as sequences of episodes, where each episode contains a series of timesteps with observations, actions, and metadata.

The standardized observation space includes RGB workspace camera images and natural language task descriptions. Wrist cameras, depth sensors, and point clouds are not included in the unified schema, though some component datasets originally contained them. Images are resized to a common resolution and served at approximately 3 Hz.

The action space is a 7-dimensional end-effector representation: x, y, z position; roll, pitch, yaw orientation; and gripper opening. An eighth dimension indicates episode termination. Actions are discretized into 256 uniform bins per dimension. Importantly, the action values are normalized per dataset but are not aligned to a shared coordinate frame across robots.[1] This means the same numerical action vector may produce very different physical motions on different platforms.

Data is accessible through TensorFlow Datasets (TFDS) or direct download from Google Cloud Storage.[8] A public Google Sheets spreadsheet lists all component datasets with their metadata, citations, and download instructions. The repository also provides Colab notebooks for data loading and model inference.[8]

## What are the RT-X models?

### RT-1-X

RT-1-X is a 35-million parameter model based on the [RT-1](/wiki/rt_1) architecture, retrained on the OXE data mixture.[1][4] Its architecture is a decoder-only [transformer](/wiki/transformer) with the following pipeline:

1. An ImageNet-pretrained [EfficientNet](/wiki/efficientnet) processes camera images, conditioned on a Universal Sentence Encoder (USE) language embedding through FiLM (Feature-wise Linear Modulation) layers
2. A TokenLearner module compresses the resulting visual-language features into 81 tokens
3. A Transformer backbone attends over these tokens using a history of 15 images (approximately 5 seconds at 3 Hz)
4. The model outputs 8 discretized action tokens (7 end-effector dimensions plus termination), with each dimension quantized into 256 bins

RT-1-X is designed to be lightweight enough to run inference on a single robot in real time. Its relatively small parameter count means it can be deployed on-robot without requiring external GPU servers, but this limited capacity also constrains how much diverse data it can absorb before underfitting.

### RT-2-X

RT-2-X is based on [RT-2](/wiki/rt_2), which builds on the PaLI-X [vision-language model](/wiki/vision_language_model).[5] The version evaluated in the paper uses a 55-billion parameter model that combines a [Vision Transformer](/wiki/vision_transformer) (ViT) visual encoder with a UL2 language model, both pretrained on the WebLI dataset.[1][5]

The core idea behind RT-2 (and by extension RT-2-X) is to represent robot actions as natural language tokens. During training, action values are discretized and expressed as text strings (for example, "1 128 91 241 5 101 127"), which are interleaved with standard vision-language data.[5] This co-fine-tuning approach lets the model retain its web-scale knowledge about objects, spatial relationships, and language while simultaneously learning to output motor commands.

RT-2-X was co-fine-tuned with an approximately one-to-one split between VLM tasks and robotics data, using the OXE data mixture as the robotics component.[1] The model takes camera images and a natural language instruction as input and outputs the 7-dimensional action vector.

## How well do the RT-X models perform?

The RT-X models were evaluated through 3,600 real-world trials across six different robot platforms at multiple institutions, including UC Berkeley (RAIL and AUTOLab), the University of Freiburg, NYU, Stanford, and USC.[1]

### RT-1-X: small-data domain performance

RT-1-X was evaluated on five datasets where the original training sets were relatively small (typically hundreds to a few thousand episodes): Kitchen Manipulation, Cable Routing, NYU Door Opening, AUTOLab UR5, and Robot Play. Across these five settings, RT-1-X achieved an average success rate of approximately 63%, compared to 41% for the original methods that had been developed specifically for each robot and task.[1] This represents a roughly 50% relative improvement; Google DeepMind summarized the result as a "50% success rate improvement on average across five different commonly used robots."[3]

The gains were most pronounced when the original dataset was small. By supplementing a limited single-robot dataset with diverse multi-robot data from OXE, the model learned more robust visual and behavioral features than could be acquired from the small dataset alone.

### RT-1-X: large-data domain performance

On larger datasets, RT-1-X showed diminishing returns. For example, on the Bridge/WidowX evaluation, RT-1-X achieved 27% success compared to 13% for original methods, but RT-2-X with its much larger capacity achieved 50% success on the same tasks.[1] On the Google Robot RT-1 evaluation (a large dataset where RT-1 was originally trained), RT-1 scored 92% and RT-2-X achieved 91%.[1] These results suggest that the 35M-parameter RT-1-X model cannot fully absorb the diversity of the OXE data mixture; larger models are needed to benefit from the added data without underfitting.

### RT-2-X: emergent cross-embodiment skills

The most striking finding of the paper concerns RT-2-X's emergent capabilities. When evaluated on skills that existed in other robots' datasets within OXE but were never present in the evaluation robot's own training data, RT-2-X achieved 75.8% success on a Google Robot, compared to 27.3% for the standard RT-2 model.[1] This roughly 3x improvement demonstrates that the model was transferring behavioral knowledge learned from other robot embodiments; as the DeepMind team put it, "RT-2-X was three times as successful as our previous best model, RT-2, for emergent skills."[3]

When the Bridge dataset (collected on WidowX robots) was removed from the training mixture, performance on these emergent skills dropped from 75.8% to 42.8%, confirming that specific data from other embodiments was contributing to the transfer.[1]

RT-2-X also demonstrated improved spatial reasoning compared to RT-2, better distinguishing between instructions like "move near the object" versus "move on top of the object."[1] This suggests that training on diverse embodiments, each with different workspace configurations and object arrangements, helps the model develop more precise spatial understanding.

### Ablation studies

The paper includes ablation experiments for RT-2-X that highlight several factors:[1]

| Factor | Condition | Success rate |
|---|---|---|
| Web pretraining | With web pretraining | 48.7% |
| Web pretraining | From scratch (no web pretraining) | 0% |
| Model size | 55B parameters | Full performance |
| Model size | 5B parameters | Substantially lower on emergent skills |
| Image history | With history | 44.4% |
| Image history | Without history | 14.5% |
| Training method | Co-fine-tuning | Similar to fine-tuning when data is diverse |

These results show that web pretraining is not optional; the model simply does not learn without it. Model capacity is also critical, as smaller models cannot effectively leverage the cross-embodiment data. Image history (seeing multiple recent frames rather than just the current one) provides a large performance boost.

## Who built Open X-Embodiment?

The Open X-Embodiment paper lists over 150 authors from more than 20 institutions, making it one of the largest collaborations in robotics research.[1] The contributing organizations include:

| Institution | Country |
|---|---|
| [Google DeepMind](/wiki/google_deepmind) | United States / United Kingdom |
| Google Research | United States |
| UC Berkeley | United States |
| [Stanford University](/wiki/stanford_university) | United States |
| Carnegie Mellon University | United States |
| New York University | United States |
| University of Southern California | United States |
| University of Texas at Austin | United States |
| University of Illinois Urbana-Champaign | United States |
| UC San Diego | United States |
| [Columbia University](/wiki/columbia_university) | United States |
| Georgia Institute of Technology | United States |
| Arizona State University | United States |
| [ETH Zurich](/wiki/eth_zurich) | Switzerland |
| University of Freiburg | Germany |
| Technische Universitat Darmstadt | Germany |
| German Aerospace Center (DLR) | Germany |
| Max Planck Institute | Germany |
| University of Technology Nuremberg | Germany |
| [Imperial College London](/wiki/imperial_college_london) | United Kingdom |
| Istituto Italiano di Tecnologia | Italy |
| KAIST | South Korea |
| Shanghai Jiao Tong University | China |
| University of Tokyo | Japan |
| RIKEN | Japan |
| Queensland University of Technology | Australia |
| [NVIDIA](/wiki/nvidia) | United States |
| Intrinsic (Alphabet) | United States |
| Flexiv Robotics | China |

Reflecting on the collaboration, Google DeepMind framed the effort as community infrastructure: "The future of robotics relies on enabling robots to learn from each other, and most importantly, allowing researchers to learn from one another."[3]

## Impact and downstream models

Open X-Embodiment has had a significant effect on the robotics research community since its release. It won the ICRA 2024 Best Conference Paper Award and has become the standard pretraining corpus for [generalist robot policies](/wiki/robot_foundation_model).[2]

### Octo

[Octo](/wiki/octo_model) is an open-source generalist robot policy released in May 2024, trained on 800,000 trajectories from the OXE dataset.[6] Developed primarily by researchers at UC Berkeley, Octo uses a modular transformer architecture with a diffusion action head. It comes in two sizes: 27M and 93M parameters.[6]

Octo supports flexible input configurations (workspace cameras, wrist cameras, or both) and can be guided by either language instructions or goal images. In benchmarks, Octo outperformed RT-1-X and performed comparably to the 55B-parameter RT-2-X on several tasks despite being orders of magnitude smaller.[6] Octo was designed for easy fine-tuning; it can be adapted to new robots, sensors, and action spaces using small target-domain datasets on consumer GPUs.

### OpenVLA

[OpenVLA](/wiki/openvla) is a 7-billion parameter open-source [vision-language-action model](/wiki/vision-language-action_model) released in June 2024. It was trained on 970,000 robot manipulation trajectories from the OXE dataset.[7] OpenVLA builds on a Llama 2 language model backbone combined with DINOv2 and SigLIP visual encoders.[7]

Despite having 7x fewer parameters than RT-2-X (55B), OpenVLA outperformed RT-2-X by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments.[7] It supports parameter-efficient fine-tuning through low-rank adaptation (LoRA) and can be served with quantization for efficient deployment. The OpenVLA team released model checkpoints, fine-tuning notebooks, and a full PyTorch training codebase.[7]

### OXE-AugE

OXE-AugE, published in December 2025, addresses one of OXE's key limitations: data imbalance across embodiments.[10] The augmentation pipeline expands the dataset with 9 additional robot embodiments, producing over 4.4 million trajectories (more than triple the original OXE).[10] Experiments showed that generalist policies like OpenVLA and pi0 improved success rates by 24-45% on previously unseen robot-gripper combinations when fine-tuned on OXE-AugE, compared to fine-tuning on the original OXE alone.[10]

### Broader influence on robot foundation models

The OXE dataset and methodology directly influenced several subsequent robot [foundation model](/wiki/foundation_model) efforts:

- **[Gemini Robotics](/wiki/gemini_robotics)**: Google DeepMind's Gemini Robotics and Gemini Robotics-ER (2025-2026), built on the Gemini 2.0 [VLM](/wiki/vision_language_model), incorporate cross-embodiment training principles pioneered by OXE. These models can transfer learned motions between the ALOHA 2 bimanual platform, Franka-based systems, and Apptronik's Apollo humanoid robot.
- **[pi0](/wiki/pi0)** (Physical Intelligence): This cross-embodiment VLA model uses diverse robot data (including OXE-format datasets) as its pretraining base, then fine-tunes for specific tasks. It demonstrated that a single model can serve as a base for complex, long-horizon manipulation across different hardware.
- **[NVIDIA GR00T](/wiki/gr00t)**: NVIDIA's foundation model for humanoid robots draws on the open data ecosystem that OXE helped establish, using RLDS-compatible data pipelines for multi-robot pretraining.
- **DROID**: This dataset of 76,000 trajectories across diverse scenes (published at RSS 2024) was released in RLDS format specifically to be compatible with OXE, enabling joint training with OXE data.

The OXE project demonstrated an "embodiment scaling law": expanding the number of training embodiments yields more effective generalization than simply increasing the trajectory count for any single embodiment. This finding has informed how subsequent projects prioritize data collection across diverse platforms rather than maximizing data volume on a single robot.

## Technical significance

### Does cross-embodiment transfer actually work?

The central technical contribution of Open X-Embodiment is the empirical demonstration that cross-embodiment transfer works.[1] Before this project, it was unclear whether a model could learn useful behavioral patterns from a robot arm with 6 joints, a different gripper type, and different camera placement, and apply that knowledge to a completely different robot. The OXE experiments showed that high-capacity models (like the 55B RT-2-X) can extract abstract manipulation concepts, such as the idea of "picking up" or "opening," that transfer across embodiments even without explicit coordinate frame alignment.[1] Google DeepMind summarized the headline result this way: "Training a single model on data from multiple embodiments leads to significantly better performance across many robots than those trained on data from individual embodiments."[3]

This transfer is not automatic or trivial. The 35M-parameter RT-1-X showed positive transfer mainly in the small-data regime. On large datasets, its limited capacity became a bottleneck, and it could not simultaneously absorb the diversity of behaviors from 22 different robots. The 55B RT-2-X, by contrast, had enough capacity to benefit from the added data. This finding underscored the importance of model scale for cross-embodiment learning.

### The role of web pretraining

The ablation results for RT-2-X showed that web pretraining (training the underlying VLM on billions of image-text pairs from the internet before fine-tuning on robotics data) was critical.[1] Without web pretraining, the model achieved 0% success.[1] This confirms that the visual and semantic knowledge from web data provides a foundation that robot data alone cannot supply. The model needs to "understand" what a cup, a drawer, or a table is before it can learn to manipulate them.

### Data standardization as infrastructure

Beyond the models, OXE's data standardization work has had lasting infrastructure impact. The RLDS format and the data conversion tools released with the project created a common data pipeline for the robotics community.[8] Subsequent datasets like DROID adopted RLDS specifically for OXE compatibility. Tools like the LeRobot library from Hugging Face provide converters between OXE's RLDS format and other training frameworks. This data infrastructure lowers the barrier for any lab to contribute its data to the shared pool and benefit from models trained on the combined collection.

## What are the limitations of Open X-Embodiment?

The OXE dataset and models have several known limitations:

- **Data imbalance**: The top four robot types represent over 85% of all data.[1] Robots with fewer episodes benefit less from cross-embodiment training, and the overall data distribution is heavily skewed toward Google's own platforms.
- **Action space simplification**: By projecting all robots' actions into a common 7-DoF end-effector space, the dataset loses platform-specific degrees of freedom. Robots with mobile bases, multiple arms, or dexterous hands have their action spaces compressed or truncated.
- **No coordinate frame alignment**: The action normalization is per-dataset, meaning the model must implicitly learn to map between different robots' coordinate frames without explicit calibration.[1]
- **Single camera view**: The standardized schema uses only one RGB camera per dataset. Wrist cameras, depth sensors, and multi-view setups are dropped, which limits the model's ability to handle fine-grained manipulation tasks that depend on close-up views.
- **Limited task diversity in evaluation**: While the dataset contains 527 skills, the evaluation focused primarily on tabletop manipulation. Tasks involving locomotion, whole-body control, or dexterous grasping with multi-fingered hands are underrepresented.
- **Compute requirements**: RT-2-X at 55B parameters requires substantial compute for both training and inference. Real-time deployment on a physical robot typically requires cloud inference or expensive on-device accelerators.

## Current state and future directions

Since the initial release in October 2023, the OXE ecosystem has continued to grow. The dataset has been adopted as the default pretraining resource for several major [robot foundation model](/wiki/robot_foundation_model) projects. The RLDS format has become a de facto standard for sharing robot demonstration data.

Key developments through 2025 and into 2026 include:

- **Dataset expansion**: Projects like OXE-AugE have more than tripled the data volume and added new embodiments to address the original imbalance problem.[10]
- **Embodiment scaling laws**: Researchers have used OXE's diversity to quantify how the number of training embodiments affects generalization, finding that embodiment diversity matters more than raw trajectory count.
- **World model pretraining**: New approaches using optic-flow action representations on OXE data enable embodiment-agnostic world models that can improve downstream policy performance by over 50% with minimal target-robot data.
- **Integration with simulation**: The community has begun augmenting OXE's real-world data with simulated trajectories on matching robot platforms, further expanding the effective dataset size.
- **Hugging Face ecosystem**: The LeRobot project provides OXE data in formats compatible with PyTorch-based training pipelines, broadening access beyond the TensorFlow ecosystem.

The OXE project established the principle that robot learning benefits from data diversity across embodiments, much as language models benefit from text diversity across domains. As more labs contribute their data and as data augmentation techniques improve, the dataset is likely to continue growing. The long-term vision is a shared data resource for [embodied AI](/wiki/embodied_ai) comparable to what ImageNet was for computer vision or what Common Crawl is for language modeling.

## See also

- [RT-1](/wiki/rt_1)
- [RT-2](/wiki/rt_2)
- [Robot learning](/wiki/robot_learning)
- [Robot foundation model](/wiki/robot_foundation_model)
- [Vision-language-action model](/wiki/vision-language-action_model)
- [Transfer learning](/wiki/transfer_learning)
- [Google DeepMind](/wiki/google_deepmind)
- [Embodied AI](/wiki/embodied_ai)

## References

1. Open X-Embodiment Collaboration. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv preprint arXiv:2310.08864, October 2023. https://arxiv.org/abs/2310.08864

2. Open X-Embodiment Collaboration. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892-6903, Yokohama, Japan, May 2024. https://ieeexplore.ieee.org/document/10611477/

3. Google DeepMind. "Scaling up learning across many different robot types." Google DeepMind Blog, October 2023. https://deepmind.google/blog/scaling-up-learning-across-many-different-robot-types/

4. Brohan, A., Brown, N., Carbajal, J., et al. "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv preprint arXiv:2212.06817, December 2022. https://arxiv.org/abs/2212.06817

5. Zitkovich, B., Yu, T., Xu, S., et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." In Proceedings of the 7th Conference on Robot Learning (CoRL), 2023. https://arxiv.org/abs/2307.15818

6. Ghosh, D., Walke, H., et al. "Octo: An Open-Source Generalist Robot Policy." In Proceedings of Robotics: Science and Systems (RSS), 2024. https://arxiv.org/abs/2405.12213

7. Kim, M.J., Pertsch, K., et al. "OpenVLA: An Open-Source Vision-Language-Action Model." In Proceedings of the International Conference on Machine Learning (ICML), 2025. https://arxiv.org/abs/2406.09246

8. Open X-Embodiment GitHub Repository. https://github.com/google-deepmind/open_x_embodiment

9. Open X-Embodiment Project Page. https://robotics-transformer-x.github.io/

10. OXE-AugE: A Large-Scale Robot Augmentation of OXE for Scaling Cross-Embodiment Policy Learning. arXiv preprint arXiv:2512.13100, December 2025. https://arxiv.org/abs/2512.13100