Robotics Models
Last reviewed
May 31, 2026
Sources
4 citations
Review status
Source-backed
Revision
v3 · 3,691 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
4 citations
Review status
Source-backed
Revision
v3 · 3,691 words
Add missing citations, update stale details, or suggest a clearer explanation.
Robotics models are machine learning systems that give robots the ability to perceive their surroundings, plan actions, and execute motor control. They span perception networks (vision, depth, lidar), state estimation, motion planning, reinforcement learning policies, and the more recent class of vision-language-action models (VLAs) such as RT-2 and OpenVLA. The field has shifted from classical sense-plan-act pipelines, where each module is engineered by hand, toward end-to-end learned policies trained on large robot demonstration datasets. This article catalogs the major learning-based robotics models; for the general field see robotics and embodied AI, and for the dominant model class see vision-language-action model.
See also: Reinforcement Learning Models, Multimodal Models, Embodied AI, and Humanoid robot
A robotics model maps sensor observations to actuator commands. Classical robotics decomposes this into modules: a perception stack builds a scene representation, a planner produces a trajectory in configuration space, and a controller tracks that trajectory with feedback loops. Learning-based robotics replaces some or all of these stages with neural networks trained from data, including raw demonstrations, simulated rollouts, or rewards collected during interaction. Modern robotics models typically share components with computer vision and natural language processing, reusing transformer and diffusion model architectures developed for images and text.
The push toward general-purpose policies accelerated after 2022, when Google's RT-1 showed that a single transformer trained on around 130,000 demonstrations of 700 tasks could control a mobile manipulator at 3 Hz (Brohan et al., 2022).[^rt1] RT-2 then demonstrated that fine-tuning a vision-language model like PaLI-X on robot trajectories produced emergent semantic reasoning, treating action tokens like words (Brohan et al., 2023).[^rt2] By 2025 the field had converged on the vision-language-action model recipe, and humanoid platforms from Figure, 1X, Tesla, and Apptronik began running such VLA models in commercial pilots.
Most robotics stacks combine the following building blocks, regardless of whether they are hand-engineered or learned end-to-end:
Reinforcement learning (RL) trains a policy to maximize cumulative reward. Robotic RL typically uses model-free policy gradient methods such as Proximal Policy Optimization (PPO), introduced by Schulman et al. in 2017, and Soft Actor-Critic (SAC) from Haarnoja et al. in 2018. Both algorithms have been used for legged locomotion on quadrupeds like ANYmal and humanoid balance control. Q-learning variants including DQN, DDPG, and TD3 are common for discrete or low-dimensional continuous tasks, while Trust Region Policy Optimization (TRPO) preceded PPO with a tighter monotonic improvement guarantee.
Model-based RL learns a predictive model of the environment and plans inside it. Dreamer and its successor DreamerV3 (Hafner et al., 2019, 2023) learn latent dynamics models and act through imagined rollouts. MuZero (Schrittwieser et al., 2020) extended AlphaZero-style planning to environments without known rules, mastering Atari, Go, chess, and shogi from a single model.
Imitation learning trains policies from expert demonstrations. Behavioral cloning fits a supervised model that maps observations to expert actions. DAgger (Ross et al., 2011) addresses compounding errors by repeatedly querying the expert on states the learner visits. Generative Adversarial Imitation Learning, or GAIL (Ho and Ermon, 2016), frames imitation as a GAN-style game between the policy and a discriminator. Inverse reinforcement learning recovers the reward function the expert appears to optimize, then trains a policy against that reward.
Action Chunking with Transformers (ACT), released alongside the Aloha low-cost teleoperation rig by Zhao et al. in 2023, predicts short sequences of actions rather than single steps and reached 80 to 90 percent success on tasks like opening condiment cups with only ten minutes of demonstration data.[^act] Diffusion Policy (Chi et al., 2023) generates action sequences through conditional denoising and reported a 46.9 percent average improvement over prior methods across 12 manipulation benchmarks.[^dp]
Self-supervised approaches learn dynamics or representations from unlabeled video and interaction logs. V-JEPA (Bardes et al., 2024) predicts masked features in video to learn embeddings useful for downstream control. World models such as Dreamer and Google DeepMind's Genie 2 simulate plausible futures conditioned on actions, supporting planning and data augmentation. NVIDIA's GR00T-Dreams blueprint applies the same idea to data generation, prompting a video world model with a single image and an instruction to synthesize training "dreams"; NVIDIA reported using it to produce data for GR00T N1.5 in about 36 hours rather than months of manual collection.[^groot15]
Foundation models pretrain on large mixed datasets and adapt to many tasks. The current generation of robotics foundation models is dominated by vision-language-action models that take an image and a language instruction and emit either tokenized or continuous actions. RT-1 used a FiLM-conditioned EfficientNet plus TokenLearner and Transformer; RT-2 swapped in PaLI-X (55B) and PaLM-E (12B).[^rt2] RT-X and the Open X-Embodiment collaboration (October 2023) trained on 60 datasets from 22 robot embodiments across 21 institutions and showed positive transfer across platforms.[^oxe] OpenVLA (Kim et al., June 2024) is a 7-billion-parameter open-source VLA combining SigLIP, DINOv2, and Llama 2, trained on 970,000 Open X-Embodiment demonstrations, that the authors report outperformed RT-2-X (55B) by 16.5 absolute points in task success across 29 tasks while using roughly 7 times fewer parameters.[^openvla] Octo (May 2024) is an open-source generalist trained on 800,000 trajectories from Open X-Embodiment.[^octo] The sections below detail the leading 2024 and 2025 robot foundation models.
This section surveys the major robot foundation models in roughly chronological order. Most are vision-language-action models; the vision-language-action model article covers the shared architecture in depth.
RT-1 (Robotics Transformer 1, December 2022) is a 35M-parameter transformer that tokenizes camera images and a language instruction and outputs discretized arm and base actions at 3 Hz. It was trained on roughly 130,000 episodes spanning more than 700 tasks, collected over 17 months with 13 Everyday Robots mobile manipulators, and Google reported a 97 percent success rate on training instructions.[^rt1] RT-2 (July 2023) reframed control as a vision-language task by co-fine-tuning a web-pretrained vision-language model (PaLI-X up to 55B, or PaLM-E 12B) on robot trajectories, with actions emitted as text tokens; Google described this as transferring web knowledge to robotic control and reported improved generalization to novel objects and instructions.[^rt2] RT-X (October 2023), part of the Open X-Embodiment collaboration, retrained RT-1 and RT-2 on the pooled cross-embodiment dataset (the RT-1-X and RT-2-X models) and demonstrated positive transfer across robot platforms.[^oxe]
Octo (May 2024, Berkeley, Stanford, and collaborators) is an open-source generalist policy built on a transformer with a diffusion action head, trained on 800,000 trajectories from the Open X-Embodiment mixture, and designed to be fine-tuned quickly to new sensors and action spaces.[^octo] OpenVLA (June 2024, Stanford, Berkeley, Toyota Research Institute, Google DeepMind) is a 7-billion-parameter open-source VLA that fuses SigLIP and DINOv2 visual features into a Llama 2 backbone and predicts discretized action tokens. Trained on 970,000 real-world Open X-Embodiment demonstrations, its authors report it surpasses the closed RT-2-X (55B) by 16.5 absolute percentage points in success rate across 29 tasks with about 7 times fewer parameters, and that it supports parameter-efficient fine-tuning and quantization.[^openvla]
Physical Intelligence's pi-0 (also written π0, October 2024) builds on the 3B-parameter PaliGemma vision-language model and adds a roughly 300M-parameter "action expert" trained with flow matching, for about 3.3 billion parameters total. Rather than discrete tokens, the action expert generates continuous action chunks, which the company reports enables high-frequency control of up to 50 Hz for dexterous tasks such as folding laundry and bussing tables; it was pretrained on over 10,000 hours of robot data across multiple embodiments. The weights were later released as part of the open-source openpi repository.[^pi0] pi-0.5 (April 2025) extends pi-0 toward open-world generalization through co-training on a heterogeneous mixture of data, including multimodal web data, verbal instructions, high-level semantic subtask prediction, and low-level robot actions from several robot types. Physical Intelligence reported that pi-0.5 was the first end-to-end learned system to perform long-horizon dexterous tasks such as cleaning kitchens and bedrooms in homes never seen during training.[^pi05]
NVIDIA Isaac GR00T N1 (March 2025) was announced as the first open, customizable foundation model for generalist humanoid robots. It uses a dual-system design in which a vision-language module (System 2) interprets the scene and instruction and a diffusion-transformer module (System 1) generates continuous motor actions; the GR00T-N1-2B checkpoint is published on Hugging Face. It was trained on a mixture of real-robot trajectories, human videos, and synthetic data, and NVIDIA reported that adding synthetic data improved performance by 40 percent over real data alone.[^groot1] GR00T N1.5 (June 2025) is a 3B-parameter update that freezes its Eagle-2.5 vision-language backbone, adds a Future Latent Representation Alignment (FLARE) objective, and improves language following; NVIDIA reported large gains over N1, for example raising the real GR-1 humanoid language-following rate from 46.6 percent to 93.3 percent.[^groot15] NVIDIA has continued the series with GR00T N1.6 and an early-access GR00T N1.7 (2026), the latter built on a Cosmos-Reason VLM backbone.[^grootrepo]
Google DeepMind's Gemini Robotics (March 2025) is a vision-language-action model built on Gemini 2.0 that adds physical action outputs, paired with Gemini Robotics-ER, an embodied reasoning model with enhanced spatial and temporal understanding for perception, state estimation, planning, and code generation; DeepMind reported that Gemini Robotics-ER achieved a 2 to 3 times higher success rate than Gemini 2.0 on end-to-end embodied tasks, and named Apptronik as a humanoid partner.[^gemrob] Gemini Robotics On-Device (June 2025) is a VLA optimized to run locally on the robot for low-latency, offline operation; DeepMind described it as its first VLA available for developer fine-tuning, adapting to new tasks from as few as 50 to 100 demonstrations, and shipped a Gemini Robotics SDK with MuJoCo support.[^gemod] Gemini Robotics 1.5 and Gemini Robotics-ER 1.5 (September 2025) split the stack into an orchestrating embodied-reasoning model (ER 1.5) that plans, calls digital tools such as web search, and estimates progress, and a VLA (1.5) that "thinks before acting" by generating natural-language reasoning before motion; DeepMind reported skill transfer across the ALOHA 2, Apptronik Apollo, and Franka platforms without per-robot specialization.[^gem15]
Figure's Helix (February 2025) is a vision-language-action model for generalist humanoid control that uses a System 1 plus System 2 split. System 2 is an internet-pretrained 7B vision-language model running at 7 to 9 Hz for scene and language understanding, and System 1 is an 80M-parameter visuomotor policy that produces continuous actions at 200 Hz over a 35-degree-of-freedom humanoid upper body, including individual finger control. Figure reported training on roughly 500 hours of teleoperated demonstrations augmented with auto-generated hindsight instructions, and showed two robots collaborating on novel objects without task-specific programming.[^helix]
RDT-1B (Robotics Diffusion Transformer, October 2024, Tsinghua University) is a 1.2-billion-parameter diffusion foundation model for bimanual manipulation, described by its authors as the largest diffusion-based foundation model for robotic manipulation at release. It was pretrained on a collection of 46 datasets with more than one million episodes and fine-tuned on a self-collected bimanual dataset, and the authors report zero-shot generalization to unseen objects and few-shot learning of new skills from one to five demonstrations.[^rdt] SmolVLA (June 2025, Hugging Face) is a compact 450M-parameter open VLA trained on community-contributed LeRobot datasets. It pairs a SmolVLM-2 backbone (SigLIP vision encoder plus a SmolLM2 language model) with a roughly 100M-parameter flow-matching action expert, targets consumer-grade hardware, and uses an asynchronous inference scheme that Hugging Face reports cuts task time by about 30 percent on average; the company states it matches or exceeds much larger models on its benchmarks.[^smolvla]
| Model | Year | Group | Type |
|---|---|---|---|
| DQN | 2013/2015 | DeepMind | Deep Q-learning, Atari |
| AlphaGo / AlphaZero | 2016/2017 | DeepMind | Self-play planning |
| OpenAI Five | 2018 | OpenAI | Dota 2 PPO at scale |
| AlphaStar | 2019 | DeepMind | StarCraft II |
| MuZero | 2020 | DeepMind | Model-based planning |
| DreamerV3 | 2023 | DeepMind | World-model RL |
| RT-1 | Dec 2022 | Robotics transformer, 35M | |
| Diffusion Policy | Mar 2023 | Columbia, Toyota | Diffusion control |
| ACT / Aloha | Apr 2023 | Stanford | Bimanual imitation |
| RT-2 | Jul 2023 | First VLA at scale | |
| RT-X | Oct 2023 | RT-X collaboration | Cross-embodiment |
| Octo | May 2024 | Berkeley, Stanford | Open generalist policy |
| OpenVLA | Jun 2024 | Stanford, Berkeley | Open-source 7B VLA |
| pi-0 | Oct 2024 | Physical Intelligence | Flow-matching VLA, 3.3B |
| RDT-1B | Oct 2024 | Tsinghua University | Bimanual diffusion VLA, 1.2B |
| Helix | Feb 2025 | Figure | Humanoid VLA, 7B + 80M, 200 Hz |
| Gemini Robotics | Mar 2025 | Google DeepMind | Gemini 2.0 with actions |
| GR00T N1 | Mar 2025 | NVIDIA | Open humanoid foundation model |
| pi-0.5 | Apr 2025 | Physical Intelligence | Open-world VLA |
| SmolVLA | Jun 2025 | Hugging Face | Compact open VLA, 450M |
| GR00T N1.5 | Jun 2025 | NVIDIA | 3B humanoid VLA, FLARE |
| Gemini Robotics On-Device | Jun 2025 | Google DeepMind | On-robot, fine-tunable VLA |
| Gemini Robotics 1.5 / ER 1.5 | Sep 2025 | Google DeepMind | Agentic VLA plus reasoning |
Large robot datasets are typically collected by human teleoperation, scripted policies, or online RL agents. The table below lists widely cited resources.
| Resource | Year | Scale | Notes |
|---|---|---|---|
| RoboNet | 2019 | 15M video frames | Multi-robot, multi-lab |
| Bridge Data | 2022/2023 | ~60K demos | Single-arm tabletop |
| RT-1 dataset | 2022 | ~130K demos, 700 tasks | Google kitchen environments |
| Open X-Embodiment | 2023 | 60 datasets, 22 embodiments | Cross-institution union |
| DROID | 2024 | 76K demos, 350 hours | 13 institutions, Franka arms |
| Aloha and Mobile Aloha | 2023/2024 | Teleop bimanual | Low-cost rig |
| BEHAVIOR-1K | 2023 | 1,000 household tasks | Stanford simulation |
| RLBench | 2019 | 100 tasks | CoppeliaSim |
| Meta-World | 2019 | 50 manipulation tasks | Multi-task RL benchmark |
| LIBERO | 2023 | Lifelong learning suite | Procedural tasks |
| ManiSkill | 2021 | Generalizable manipulation | SAPIEN simulator |
| SIMPLER | 2024 | Sim-to-real evaluation | Reproduces real benchmarks |
Simulators feeding these benchmarks include NVIDIA Isaac Sim and Isaac Gym, MuJoCo maintained by Google DeepMind, PyBullet, RoboCasa, Habitat for embodied navigation, and AI2-THOR. Sim-to-real transfer commonly relies on domain randomization (Tobin et al., 2017), which trains policies across randomized visual and dynamics parameters so that real-world physics falls inside the training distribution.[^domainrand]
Robotics models target several categories of skills. Manipulation covers pick-and-place, peg-in-hole, tool use, cloth folding, and dexterous in-hand reorientation. Navigation includes point-goal, object-goal, and instruction-following tasks. Bimanual manipulation, which the Aloha platform popularized, requires synchronized two-arm control. Locomotion spans flat ground walking, dynamic running, and recovery from disturbances. Many recent VLAs are deployed on humanoid robot platforms, listed below alongside quadrupeds and arms.
| Embodiment | Company or lab | Type |
|---|---|---|
| Spot | Boston Dynamics | Quadruped |
| ANYmal | ANYbotics, ETH Zurich | Quadruped |
| Go2 / B2 | Unitree | Quadruped |
| Atlas | Boston Dynamics | Humanoid |
| Optimus | Tesla | Humanoid |
| Figure 02 / Figure 03 | Figure | Humanoid |
| NEO | 1X Technologies | Humanoid |
| H1 / G1 | Unitree | Humanoid |
| Apollo | Apptronik | Humanoid |
| Phoenix | Sanctuary AI | Humanoid |
| Aloha 2 | Stanford, Google | Bimanual stationary |
| Franka Panda | Franka Robotics | 7-DoF arm |
The vision-language-action model recipe inherits a pretrained vision-language backbone, attaches a robot action head, and fine-tunes on demonstration data. RT-2 and OpenVLA discretize each action dimension into bins and predict them as tokens, reusing the language model's softmax head. pi-0 instead pairs PaliGemma with a flow-matching expert that emits continuous trajectories, giving 50 Hz control with smoother motion, an approach also taken by SmolVLA's action expert.[^pi0][^smolvla] Helix and GR00T N1 split inference between a slower internet-pretrained VLM for scene understanding (System 2) and a fast visuomotor policy (System 1) that runs the high-rate control loop.[^helix][^groot1] Across these systems the design pattern is the same: web-scale pretraining supplies general world knowledge, and a smaller robot-specific fine-tune supplies grounded motor control.
Several trends define the 2024 and 2025 wave of robotics models. Humanoid robot development has surged with commercial deliveries from Figure, 1X, Agility Robotics, and Apptronik, plus ongoing programs at Tesla and Boston Dynamics. Generalist VLAs are replacing per-task policies, supported by collaborations such as Open X-Embodiment and by open releases such as OpenVLA, GR00T, SmolVLA, and the openpi pi-0 weights. Large-scale teleoperation rigs including Aloha, Mobile Aloha, and the Universal Manipulation Interface (UMI) lower the cost of data collection. Diffusion and flow-matching action heads handle multimodal action distributions that classical mean-squared-error losses smooth out. Agentic stacks such as Gemini Robotics 1.5 add an explicit reasoning model that plans and calls external tools before the VLA acts.[^gem15] Edge inference platforms such as NVIDIA Jetson Thor and Orin, together with on-robot models like Gemini Robotics On-Device, run multibillion-parameter policies on the robot itself, which removes the need for a cloud round trip during control.[^gemod]
Robotics models reach production in several sectors. Industrial assembly cells use learned bin picking and insertion. Warehouse logistics is led by Amazon Robotics, which runs hundreds of thousands of mobile platforms. Service robots address cleaning, last-mile delivery, and food preparation. Surgical robotics combines teleoperation with learned assistance, as in the Intuitive da Vinci platform. Autonomous driving shares perception and planning components with robotics and is an adjacent field. Agriculture uses fruit-picking arms and autonomous tractors. Search and rescue, prosthetics, and exoskeletons round out the application stack.
Despite rapid progress, several issues remain open. The sim-to-real gap means policies trained in simulation often degrade on hardware, and domain randomization adds variance to training. Sample inefficiency is acute for real-world RL because robot rollouts are slow and risk hardware damage. Generalization across embodiments is partial: a policy trained on a Franka Panda may not transfer to a different gripper without fine-tuning, which is one motivation for cross-embodiment efforts like Open X-Embodiment and GR00T. Safety guarantees are weak compared with classical controllers because neural policies are not easily certified. Real-time inference imposes latency budgets that constrain model size on the robot. Data collection is expensive: 76,000 DROID trajectories required 12 months and 13 institutions.[^droid] Evaluation reproducibility is hampered by hardware variation, which has motivated reproducible suites like SIMPLER. Vendor performance figures cited above are self-reported and have not always been independently replicated.