# Robotics Models

> Source: https://aiwiki.ai/wiki/robotics_models
> Updated: 2026-05-31
> Categories: AI Models, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Robotics models** are [machine learning](/wiki/machine_learning) systems that give robots the ability to perceive their surroundings, plan actions, and execute motor control. They span perception networks (vision, depth, lidar), state estimation, motion planning, [reinforcement learning](/wiki/reinforcement_learning) policies, and the more recent class of [vision-language-action models](/wiki/vision_language_action_model) (VLAs) such as [RT-2](/wiki/rt_2) and [OpenVLA](/wiki/openvla). The field has shifted from classical sense-plan-act pipelines, where each module is engineered by hand, toward end-to-end learned policies trained on large robot demonstration datasets. This article catalogs the major learning-based [robotics](/wiki/robotics) models; for the general field see [robotics](/wiki/robotics) and [embodied AI](/wiki/embodied_ai), and for the dominant model class see [vision-language-action model](/wiki/vision_language_action_model).

*See also: [Reinforcement Learning Models](/wiki/reinforcement_learning_models), [Multimodal Models](/wiki/multimodal_models), [Embodied AI](/wiki/embodied_ai), and [Humanoid robot](/wiki/humanoid_robot)*

## Overview

A robotics model maps sensor observations to actuator commands. Classical robotics decomposes this into modules: a perception stack builds a scene representation, a planner produces a trajectory in configuration space, and a controller tracks that trajectory with feedback loops. Learning-based robotics replaces some or all of these stages with neural networks trained from data, including raw demonstrations, simulated rollouts, or rewards collected during interaction. Modern robotics models typically share components with [computer vision](/wiki/computer_vision) and [natural language processing](/wiki/natural_language_processing), reusing [transformer](/wiki/transformer) and [diffusion model](/wiki/diffusion_model) architectures developed for images and text.

The push toward general-purpose policies accelerated after 2022, when Google's RT-1 showed that a single [transformer](/wiki/transformer) trained on around 130,000 demonstrations of 700 tasks could control a mobile manipulator at 3 Hz (Brohan et al., 2022).[^rt1] RT-2 then demonstrated that fine-tuning a vision-language model like PaLI-X on robot trajectories produced emergent semantic reasoning, treating action tokens like words (Brohan et al., 2023).[^rt2] By 2025 the field had converged on the [vision-language-action model](/wiki/vision_language_action_model) recipe, and humanoid platforms from Figure, 1X, Tesla, and Apptronik began running such VLA models in commercial pilots.

## Core components

Most robotics stacks combine the following building blocks, regardless of whether they are hand-engineered or learned end-to-end:

* **Perception**: RGB cameras, depth sensors, lidar, and proprioception (joint angles, torques) feed into encoders such as ResNet, ViT, or fused SigLIP and DINOv2 backbones.
* **State estimation**: maps, occupancy grids, SLAM, object pose estimation, and tracking.
* **Planning**: motion planners such as RRT and CHOMP, task-and-motion planners, or learned trajectory generators.
* **Policy**: a function from observations to actions, expressed as a neural network in the learning setting.
* **Manipulation**: grasping, pick-and-place, peg-in-hole, tool use, and dexterous in-hand control.
* **Locomotion**: walking, running, and balance control for quadrupeds, bipeds, and wheeled bases.
* **Sim-to-real transfer**: techniques that close the gap between simulated training environments and physical hardware.

## Learning paradigms

### Reinforcement learning

[Reinforcement learning](/wiki/reinforcement_learning) (RL) trains a policy to maximize cumulative reward. Robotic RL typically uses model-free policy gradient methods such as [Proximal Policy Optimization](/wiki/ppo) (PPO), introduced by Schulman et al. in 2017, and [Soft Actor-Critic](/wiki/soft_actor_critic) (SAC) from Haarnoja et al. in 2018. Both algorithms have been used for legged locomotion on quadrupeds like ANYmal and humanoid balance control. Q-learning variants including [DQN](/wiki/dqn), DDPG, and TD3 are common for discrete or low-dimensional continuous tasks, while Trust Region Policy Optimization (TRPO) preceded PPO with a tighter monotonic improvement guarantee.

Model-based RL learns a predictive model of the environment and plans inside it. Dreamer and its successor DreamerV3 (Hafner et al., 2019, 2023) learn latent dynamics models and act through imagined rollouts. [MuZero](/wiki/muzero) (Schrittwieser et al., 2020) extended AlphaZero-style planning to environments without known rules, mastering Atari, Go, chess, and shogi from a single model.

### Imitation learning

[Imitation learning](/wiki/imitation_learning) trains policies from expert demonstrations. Behavioral cloning fits a supervised model that maps observations to expert actions. DAgger (Ross et al., 2011) addresses compounding errors by repeatedly querying the expert on states the learner visits. Generative Adversarial Imitation Learning, or GAIL (Ho and Ermon, 2016), frames imitation as a GAN-style game between the policy and a discriminator. Inverse reinforcement learning recovers the reward function the expert appears to optimize, then trains a policy against that reward.

Action Chunking with Transformers (ACT), released alongside the Aloha low-cost teleoperation rig by Zhao et al. in 2023, predicts short sequences of actions rather than single steps and reached 80 to 90 percent success on tasks like opening condiment cups with only ten minutes of demonstration data.[^act] [Diffusion Policy](/wiki/diffusion_policy) (Chi et al., 2023) generates action sequences through conditional denoising and reported a 46.9 percent average improvement over prior methods across 12 manipulation benchmarks.[^dp]

### World models and self-supervision

Self-supervised approaches learn dynamics or representations from unlabeled video and interaction logs. [V-JEPA](/wiki/v_jepa) (Bardes et al., 2024) predicts masked features in video to learn embeddings useful for downstream control. World models such as Dreamer and Google DeepMind's Genie 2 simulate plausible futures conditioned on actions, supporting planning and data augmentation. NVIDIA's GR00T-Dreams blueprint applies the same idea to data generation, prompting a video world model with a single image and an instruction to synthesize training "dreams"; NVIDIA reported using it to produce data for GR00T N1.5 in about 36 hours rather than months of manual collection.[^groot15]

### Foundation models for robotics

Foundation models pretrain on large mixed datasets and adapt to many tasks. The current generation of robotics foundation models is dominated by [vision-language-action models](/wiki/vision_language_action_model) that take an image and a language instruction and emit either tokenized or continuous actions. RT-1 used a FiLM-conditioned EfficientNet plus TokenLearner and Transformer; RT-2 swapped in PaLI-X (55B) and PaLM-E (12B).[^rt2] RT-X and the [Open X-Embodiment](/wiki/open_x_embodiment) collaboration (October 2023) trained on 60 datasets from 22 robot embodiments across 21 institutions and showed positive transfer across platforms.[^oxe] [OpenVLA](/wiki/openvla) (Kim et al., June 2024) is a 7-billion-parameter open-source VLA combining SigLIP, DINOv2, and Llama 2, trained on 970,000 Open X-Embodiment demonstrations, that the authors report outperformed RT-2-X (55B) by 16.5 absolute points in task success across 29 tasks while using roughly 7 times fewer parameters.[^openvla] Octo (May 2024) is an open-source generalist trained on 800,000 trajectories from Open X-Embodiment.[^octo] The sections below detail the leading 2024 and 2025 robot foundation models.

## Robot foundation and VLA models

This section surveys the major robot foundation models in roughly chronological order. Most are vision-language-action models; the [vision-language-action model](/wiki/vision_language_action_model) article covers the shared architecture in depth.

### Google RT-1, RT-2, and RT-X

RT-1 (Robotics Transformer 1, December 2022) is a 35M-parameter transformer that tokenizes camera images and a language instruction and outputs discretized arm and base actions at 3 Hz. It was trained on roughly 130,000 episodes spanning more than 700 tasks, collected over 17 months with 13 Everyday Robots mobile manipulators, and Google reported a 97 percent success rate on training instructions.[^rt1] RT-2 (July 2023) reframed control as a vision-language task by co-fine-tuning a web-pretrained vision-language model (PaLI-X up to 55B, or PaLM-E 12B) on robot trajectories, with actions emitted as text tokens; Google described this as transferring web knowledge to robotic control and reported improved generalization to novel objects and instructions.[^rt2] RT-X (October 2023), part of the [Open X-Embodiment](/wiki/open_x_embodiment) collaboration, retrained RT-1 and RT-2 on the pooled cross-embodiment dataset (the RT-1-X and RT-2-X models) and demonstrated positive transfer across robot platforms.[^oxe]

### Octo and OpenVLA

Octo (May 2024, Berkeley, Stanford, and collaborators) is an open-source generalist policy built on a transformer with a diffusion action head, trained on 800,000 trajectories from the Open X-Embodiment mixture, and designed to be fine-tuned quickly to new sensors and action spaces.[^octo] [OpenVLA](/wiki/openvla) (June 2024, Stanford, Berkeley, Toyota Research Institute, Google DeepMind) is a 7-billion-parameter open-source VLA that fuses SigLIP and DINOv2 visual features into a Llama 2 backbone and predicts discretized action tokens. Trained on 970,000 real-world Open X-Embodiment demonstrations, its authors report it surpasses the closed RT-2-X (55B) by 16.5 absolute percentage points in success rate across 29 tasks with about 7 times fewer parameters, and that it supports parameter-efficient fine-tuning and quantization.[^openvla]

### Physical Intelligence pi-0 and pi-0.5

Physical Intelligence's pi-0 (also written π0, October 2024) builds on the 3B-parameter PaliGemma vision-language model and adds a roughly 300M-parameter "action expert" trained with flow matching, for about 3.3 billion parameters total. Rather than discrete tokens, the action expert generates continuous action chunks, which the company reports enables high-frequency control of up to 50 Hz for dexterous tasks such as folding laundry and bussing tables; it was pretrained on over 10,000 hours of robot data across multiple embodiments. The weights were later released as part of the open-source openpi repository.[^pi0] pi-0.5 (April 2025) extends pi-0 toward open-world generalization through co-training on a heterogeneous mixture of data, including multimodal web data, verbal instructions, high-level semantic subtask prediction, and low-level robot actions from several robot types. Physical Intelligence reported that pi-0.5 was the first end-to-end learned system to perform long-horizon dexterous tasks such as cleaning kitchens and bedrooms in homes never seen during training.[^pi05]

### NVIDIA Isaac GR00T

NVIDIA Isaac GR00T N1 (March 2025) was announced as the first open, customizable foundation model for generalist humanoid robots. It uses a dual-system design in which a vision-language module (System 2) interprets the scene and instruction and a diffusion-transformer module (System 1) generates continuous motor actions; the GR00T-N1-2B checkpoint is published on Hugging Face. It was trained on a mixture of real-robot trajectories, human videos, and synthetic data, and NVIDIA reported that adding synthetic data improved performance by 40 percent over real data alone.[^groot1] GR00T N1.5 (June 2025) is a 3B-parameter update that freezes its Eagle-2.5 vision-language backbone, adds a Future Latent Representation Alignment (FLARE) objective, and improves language following; NVIDIA reported large gains over N1, for example raising the real GR-1 humanoid language-following rate from 46.6 percent to 93.3 percent.[^groot15] NVIDIA has continued the series with GR00T N1.6 and an early-access GR00T N1.7 (2026), the latter built on a Cosmos-Reason VLM backbone.[^grootrepo]

### Google DeepMind Gemini Robotics

Google DeepMind's [Gemini Robotics](/wiki/gemini_robotics) (March 2025) is a vision-language-action model built on Gemini 2.0 that adds physical action outputs, paired with Gemini Robotics-ER, an embodied reasoning model with enhanced spatial and temporal understanding for perception, state estimation, planning, and code generation; DeepMind reported that Gemini Robotics-ER achieved a 2 to 3 times higher success rate than Gemini 2.0 on end-to-end embodied tasks, and named Apptronik as a humanoid partner.[^gemrob] Gemini Robotics On-Device (June 2025) is a VLA optimized to run locally on the robot for low-latency, offline operation; DeepMind described it as its first VLA available for developer fine-tuning, adapting to new tasks from as few as 50 to 100 demonstrations, and shipped a Gemini Robotics SDK with MuJoCo support.[^gemod] Gemini Robotics 1.5 and Gemini Robotics-ER 1.5 (September 2025) split the stack into an orchestrating embodied-reasoning model (ER 1.5) that plans, calls digital tools such as web search, and estimates progress, and a VLA (1.5) that "thinks before acting" by generating natural-language reasoning before motion; DeepMind reported skill transfer across the ALOHA 2, Apptronik Apollo, and Franka platforms without per-robot specialization.[^gem15]

### Figure Helix

Figure's Helix (February 2025) is a vision-language-action model for generalist humanoid control that uses a System 1 plus System 2 split. System 2 is an internet-pretrained 7B vision-language model running at 7 to 9 Hz for scene and language understanding, and System 1 is an 80M-parameter visuomotor policy that produces continuous actions at 200 Hz over a 35-degree-of-freedom humanoid upper body, including individual finger control. Figure reported training on roughly 500 hours of teleoperated demonstrations augmented with auto-generated hindsight instructions, and showed two robots collaborating on novel objects without task-specific programming.[^helix]

### Open and efficient VLAs: RDT-1B and SmolVLA

RDT-1B (Robotics Diffusion Transformer, October 2024, Tsinghua University) is a 1.2-billion-parameter diffusion foundation model for bimanual manipulation, described by its authors as the largest diffusion-based foundation model for robotic manipulation at release. It was pretrained on a collection of 46 datasets with more than one million episodes and fine-tuned on a self-collected bimanual dataset, and the authors report zero-shot generalization to unseen objects and few-shot learning of new skills from one to five demonstrations.[^rdt] SmolVLA (June 2025, Hugging Face) is a compact 450M-parameter open VLA trained on community-contributed LeRobot datasets. It pairs a SmolVLM-2 backbone (SigLIP vision encoder plus a SmolLM2 language model) with a roughly 100M-parameter flow-matching action expert, targets consumer-grade hardware, and uses an asynchronous inference scheme that Hugging Face reports cuts task time by about 30 percent on average; the company states it matches or exceeds much larger models on its benchmarks.[^smolvla]

## Notable models

| Model | Year | Group | Type |
| --- | --- | --- | --- |
| DQN | 2013/2015 | DeepMind | Deep Q-learning, Atari |
| AlphaGo / AlphaZero | 2016/2017 | DeepMind | Self-play planning |
| OpenAI Five | 2018 | OpenAI | Dota 2 PPO at scale |
| AlphaStar | 2019 | DeepMind | StarCraft II |
| MuZero | 2020 | DeepMind | Model-based planning |
| DreamerV3 | 2023 | DeepMind | World-model RL |
| RT-1 | Dec 2022 | Google | Robotics transformer, 35M |
| Diffusion Policy | Mar 2023 | Columbia, Toyota | Diffusion control |
| ACT / Aloha | Apr 2023 | Stanford | Bimanual imitation |
| RT-2 | Jul 2023 | Google | First VLA at scale |
| RT-X | Oct 2023 | RT-X collaboration | Cross-embodiment |
| Octo | May 2024 | Berkeley, Stanford | Open generalist policy |
| OpenVLA | Jun 2024 | Stanford, Berkeley | Open-source 7B VLA |
| pi-0 | Oct 2024 | Physical Intelligence | Flow-matching VLA, 3.3B |
| RDT-1B | Oct 2024 | Tsinghua University | Bimanual diffusion VLA, 1.2B |
| Helix | Feb 2025 | Figure | Humanoid VLA, 7B + 80M, 200 Hz |
| Gemini Robotics | Mar 2025 | Google DeepMind | Gemini 2.0 with actions |
| GR00T N1 | Mar 2025 | NVIDIA | Open humanoid foundation model |
| pi-0.5 | Apr 2025 | Physical Intelligence | Open-world VLA |
| SmolVLA | Jun 2025 | Hugging Face | Compact open VLA, 450M |
| GR00T N1.5 | Jun 2025 | NVIDIA | 3B humanoid VLA, FLARE |
| Gemini Robotics On-Device | Jun 2025 | Google DeepMind | On-robot, fine-tunable VLA |
| Gemini Robotics 1.5 / ER 1.5 | Sep 2025 | Google DeepMind | Agentic VLA plus reasoning |

## Datasets and benchmarks

Large robot datasets are typically collected by human teleoperation, scripted policies, or online RL agents. The table below lists widely cited resources.

| Resource | Year | Scale | Notes |
| --- | --- | --- | --- |
| RoboNet | 2019 | 15M video frames | Multi-robot, multi-lab |
| Bridge Data | 2022/2023 | ~60K demos | Single-arm tabletop |
| RT-1 dataset | 2022 | ~130K demos, 700 tasks | Google kitchen environments |
| Open X-Embodiment | 2023 | 60 datasets, 22 embodiments | Cross-institution union |
| DROID | 2024 | 76K demos, 350 hours | 13 institutions, Franka arms |
| Aloha and Mobile Aloha | 2023/2024 | Teleop bimanual | Low-cost rig |
| BEHAVIOR-1K | 2023 | 1,000 household tasks | Stanford simulation |
| RLBench | 2019 | 100 tasks | CoppeliaSim |
| Meta-World | 2019 | 50 manipulation tasks | Multi-task RL benchmark |
| LIBERO | 2023 | Lifelong learning suite | Procedural tasks |
| ManiSkill | 2021 | Generalizable manipulation | SAPIEN simulator |
| SIMPLER | 2024 | Sim-to-real evaluation | Reproduces real benchmarks |

Simulators feeding these benchmarks include NVIDIA Isaac Sim and Isaac Gym, [MuJoCo](/wiki/mujoco) maintained by Google DeepMind, PyBullet, RoboCasa, Habitat for embodied navigation, and AI2-THOR. Sim-to-real transfer commonly relies on domain randomization (Tobin et al., 2017), which trains policies across randomized visual and dynamics parameters so that real-world physics falls inside the training distribution.[^domainrand]

## Skills and embodiments

Robotics models target several categories of skills. Manipulation covers pick-and-place, peg-in-hole, tool use, cloth folding, and dexterous in-hand reorientation. Navigation includes point-goal, object-goal, and instruction-following tasks. Bimanual manipulation, which the Aloha platform popularized, requires synchronized two-arm control. Locomotion spans flat ground walking, dynamic running, and recovery from disturbances. Many recent VLAs are deployed on [humanoid robot](/wiki/humanoid_robot) platforms, listed below alongside quadrupeds and arms.

| Embodiment | Company or lab | Type |
| --- | --- | --- |
| Spot | Boston Dynamics | Quadruped |
| ANYmal | ANYbotics, ETH Zurich | Quadruped |
| Go2 / B2 | Unitree | Quadruped |
| Atlas | Boston Dynamics | Humanoid |
| Optimus | Tesla | Humanoid |
| Figure 02 / Figure 03 | Figure | Humanoid |
| NEO | 1X Technologies | Humanoid |
| H1 / G1 | Unitree | Humanoid |
| Apollo | Apptronik | Humanoid |
| Phoenix | Sanctuary AI | Humanoid |
| Aloha 2 | Stanford, Google | Bimanual stationary |
| Franka Panda | Franka Robotics | 7-DoF arm |

## Vision-language-action paradigm

The [vision-language-action model](/wiki/vision_language_action_model) recipe inherits a [pretrained](/wiki/pretraining) vision-language backbone, attaches a robot action head, and fine-tunes on demonstration data. RT-2 and OpenVLA discretize each action dimension into bins and predict them as tokens, reusing the language model's softmax head. pi-0 instead pairs PaliGemma with a flow-matching expert that emits continuous trajectories, giving 50 Hz control with smoother motion, an approach also taken by SmolVLA's action expert.[^pi0][^smolvla] Helix and GR00T N1 split inference between a slower internet-pretrained VLM for scene understanding (System 2) and a fast visuomotor policy (System 1) that runs the high-rate control loop.[^helix][^groot1] Across these systems the design pattern is the same: web-scale [pretraining](/wiki/pretraining) supplies general world knowledge, and a smaller robot-specific fine-tune supplies grounded motor control.

## Modern landscape

Several trends define the 2024 and 2025 wave of robotics models. [Humanoid robot](/wiki/humanoid_robot) development has surged with commercial deliveries from Figure, 1X, Agility Robotics, and Apptronik, plus ongoing programs at Tesla and Boston Dynamics. Generalist VLAs are replacing per-task policies, supported by collaborations such as Open X-Embodiment and by open releases such as OpenVLA, GR00T, SmolVLA, and the openpi pi-0 weights. Large-scale teleoperation rigs including Aloha, Mobile Aloha, and the Universal Manipulation Interface (UMI) lower the cost of data collection. [Diffusion](/wiki/diffusion_model) and flow-matching action heads handle multimodal action distributions that classical mean-squared-error losses smooth out. Agentic stacks such as Gemini Robotics 1.5 add an explicit reasoning model that plans and calls external tools before the VLA acts.[^gem15] Edge inference platforms such as NVIDIA Jetson Thor and Orin, together with on-robot models like Gemini Robotics On-Device, run multibillion-parameter policies on the robot itself, which removes the need for a cloud round trip during control.[^gemod]

## Applications

Robotics models reach production in several sectors. Industrial assembly cells use learned bin picking and insertion. Warehouse logistics is led by Amazon Robotics, which runs hundreds of thousands of mobile platforms. Service robots address cleaning, last-mile delivery, and food preparation. Surgical robotics combines teleoperation with learned assistance, as in the Intuitive da Vinci platform. [Autonomous driving](/wiki/autonomous_driving) shares perception and planning components with robotics and is an adjacent field. Agriculture uses fruit-picking arms and autonomous tractors. Search and rescue, prosthetics, and exoskeletons round out the application stack.

## Limitations

Despite rapid progress, several issues remain open. The sim-to-real gap means policies trained in simulation often degrade on hardware, and domain randomization adds variance to training. Sample inefficiency is acute for real-world RL because robot rollouts are slow and risk hardware damage. Generalization across embodiments is partial: a policy trained on a Franka Panda may not transfer to a different gripper without fine-tuning, which is one motivation for cross-embodiment efforts like Open X-Embodiment and GR00T. Safety guarantees are weak compared with classical controllers because neural policies are not easily certified. Real-time inference imposes latency budgets that constrain model size on the robot. Data collection is expensive: 76,000 DROID trajectories required 12 months and 13 institutions.[^droid] Evaluation reproducibility is hampered by hardware variation, which has motivated reproducible suites like SIMPLER. Vendor performance figures cited above are self-reported and have not always been independently replicated.

## References

[^rt1]: Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. https://arxiv.org/abs/2212.06817 Accessed 2026-05-31.
[^rt2]: Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. https://arxiv.org/abs/2307.15818 Accessed 2026-05-31.
[^oxe]: Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. https://arxiv.org/abs/2310.08864 Accessed 2026-05-31.
[^act]: Zhao, T. et al. (2023). Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT and Aloha). https://arxiv.org/abs/2304.13705 Accessed 2026-05-31.
[^dp]: Chi, C. et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. https://arxiv.org/abs/2303.04137 Accessed 2026-05-31.
[^octo]: Octo Model Team (2024). Octo: An Open-Source Generalist Robot Policy. https://arxiv.org/abs/2405.12213 Accessed 2026-05-31.
[^openvla]: Kim, M. J. et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. https://arxiv.org/abs/2406.09246 Accessed 2026-05-31.
[^pi0]: Physical Intelligence (2024). pi-0: A Vision-Language-Action Flow Model for General Robot Control. https://arxiv.org/abs/2410.24164 Accessed 2026-05-31.
[^pi05]: Physical Intelligence (2025). pi-0.5: a Vision-Language-Action Model with Open-World Generalization. https://arxiv.org/abs/2504.16054 Accessed 2026-05-31.
[^groot1]: NVIDIA (2025). NVIDIA Isaac GR00T N1: An Open Foundation Model for Humanoid Robots. https://arxiv.org/abs/2503.14734 Accessed 2026-05-31.
[^groot15]: NVIDIA (2025). GR00T N1.5: An Improved Open Foundation Model for Generalist Humanoid Robots. https://research.nvidia.com/labs/gear/gr00t-n1_5/ Accessed 2026-05-31.
[^grootrepo]: NVIDIA (2026). Isaac-GR00T repository (GR00T N1.6 and N1.7). https://github.com/NVIDIA/Isaac-GR00T Accessed 2026-05-31.
[^gemrob]: Google DeepMind (2025). Gemini Robotics: Bringing AI into the Physical World. https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/ Accessed 2026-05-31.
[^gemod]: Google DeepMind (2025). Gemini Robotics On-Device brings AI to local robotic devices. https://deepmind.google/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/ Accessed 2026-05-31.
[^gem15]: Google DeepMind (2025). Gemini Robotics 1.5 brings AI agents into the physical world. https://deepmind.google/discover/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/ Accessed 2026-05-31.
[^helix]: Figure (2025). Helix: A Vision-Language-Action Model for Generalist Humanoid Control. https://www.figure.ai/news/helix Accessed 2026-05-31.
[^rdt]: Liu, S. et al. (2024). RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation. https://arxiv.org/abs/2410.07864 Accessed 2026-05-31.
[^smolvla]: Shukor, M. et al. (2025). SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. https://arxiv.org/abs/2506.01844 Accessed 2026-05-31.
[^droid]: Khazatsky, A. et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. https://arxiv.org/abs/2403.12945 Accessed 2026-05-31.
[^domainrand]: Tobin, J. et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. https://arxiv.org/abs/1703.06907 Accessed 2026-05-31.

Additional references retained from the prior version:

1. Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347 Accessed 2026-05-31.
2. Haarnoja, T. et al. (2018). Soft Actor-Critic. https://arxiv.org/abs/1801.01290 Accessed 2026-05-31.
3. Schrittwieser, J. et al. (2020). MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. https://arxiv.org/abs/1911.08265 Accessed 2026-05-31.
4. Ho, J., Ermon, S. (2016). Generative Adversarial Imitation Learning. https://arxiv.org/abs/1606.03476 Accessed 2026-05-31.