Robotics Models
Last reviewed
May 11, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 2,223 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 2,223 words
Add missing citations, update stale details, or suggest a clearer explanation.
Robotics models are machine learning systems that give robots the ability to perceive their surroundings, plan actions, and execute motor control. They span perception networks (vision, depth, lidar), state estimation, motion planning, reinforcement learning policies, and the more recent class of vision-language-action models (VLAs) such as RT-2 and OpenVLA. The field has shifted from classical sense-plan-act pipelines, where each module is engineered by hand, toward end-to-end learned policies trained on large robot demonstration datasets.
See also: Reinforcement Learning Models, Multimodal Models, and Robotics Tasks
A robotics model maps sensor observations to actuator commands. Classical robotics decomposes this into modules: a perception stack builds a scene representation, a planner produces a trajectory in configuration space, and a controller tracks that trajectory with feedback loops. Learning-based robotics replaces some or all of these stages with neural networks trained from data, including raw demonstrations, simulated rollouts, or rewards collected during interaction. Modern robotics models typically share components with computer vision and natural language processing, reusing transformer and diffusion model architectures developed for images and text.
The push toward general-purpose policies accelerated after 2022, when Google's RT-1 showed that a single transformer trained on around 130,000 demonstrations of 700 tasks could control a mobile manipulator at 3 Hz (Brohan et al., 2022). RT-2 then demonstrated that fine-tuning a vision-language model like PaLI-X on robot trajectories produced emergent semantic reasoning, treating action tokens like words (Brohan et al., 2023). By 2025 humanoid platforms from Figure, 1X, Tesla, and Apptronik began running such VLA models in commercial pilots.
Most robotics stacks combine the following building blocks, regardless of whether they are hand-engineered or learned end-to-end:
Reinforcement learning (RL) trains a policy to maximize cumulative reward. Robotic RL typically uses model-free policy gradient methods such as Proximal Policy Optimization (PPO), introduced by Schulman et al. in 2017, and Soft Actor-Critic (SAC) from Haarnoja et al. in 2018. Both algorithms have been used for legged locomotion on quadrupeds like ANYmal and humanoid balance control. Q-learning variants including DQN, DDPG, and TD3 are common for discrete or low-dimensional continuous tasks, while Trust Region Policy Optimization (TRPO) preceded PPO with a tighter monotonic improvement guarantee.
Model-based RL learns a predictive model of the environment and plans inside it. Dreamer and its successor DreamerV3 (Hafner et al., 2019, 2023) learn latent dynamics models and act through imagined rollouts. MuZero (Schrittwieser et al., 2020) extended AlphaZero-style planning to environments without known rules, mastering Atari, Go, chess, and shogi from a single model.
Imitation learning trains policies from expert demonstrations. Behavioral cloning fits a supervised model that maps observations to expert actions. DAgger (Ross et al., 2011) addresses compounding errors by repeatedly querying the expert on states the learner visits. Generative Adversarial Imitation Learning, or GAIL (Ho and Ermon, 2016), frames imitation as a GAN-style game between the policy and a discriminator. Inverse reinforcement learning recovers the reward function the expert appears to optimize, then trains a policy against that reward.
Action Chunking with Transformers (ACT), released alongside the Aloha low-cost teleoperation rig by Zhao et al. in 2023, predicts short sequences of actions rather than single steps and reached 80 to 90 percent success on tasks like opening condiment cups with only ten minutes of demonstration data. Diffusion Policy (Chi et al., 2023) generates action sequences through conditional denoising and reported a 46.9 percent average improvement over prior methods across 12 manipulation benchmarks.
Self-supervised approaches learn dynamics or representations from unlabeled video and interaction logs. V-JEPA (Bardes et al., 2024) predicts masked features in video to learn embeddings useful for downstream control. World models such as Dreamer and Google DeepMind's Genie 2 simulate plausible futures conditioned on actions, supporting planning and data augmentation.
Foundation models pretrain on large mixed datasets and adapt to many tasks. The current generation of robotics foundation models is dominated by VLAs that take an image and a language instruction and emit either tokenized or continuous actions. RT-1 used a FiLM-conditioned EfficientNet plus TokenLearner and Transformer; RT-2 swapped in PaLI-X (55B) and PaLM-E (12B). RT-X and the Open X-Embodiment collaboration (October 2023) trained on 60 datasets from 22 robot embodiments across 21 institutions and showed positive transfer across platforms. OpenVLA (Kim et al., June 2024) is a 7-billion-parameter open-source VLA combining SigLIP, DINOv2, and Llama 2 that outperformed RT-2-X by 16.5 absolute points despite using roughly 7 times fewer parameters. Octo (May 2024) is an open-source generalist trained on 800,000 trajectories from Open X-Embodiment. Physical Intelligence's pi-0 (October 2024) augments PaliGemma with flow matching to emit 50 Hz continuous action trajectories. Figure's Helix (February 2025) is a System 1 plus System 2 VLA that controls a 35-DoF humanoid upper body at 200 Hz, and Google DeepMind's Gemini Robotics (March 2025) adds physical action outputs to Gemini 2.0 and ships with an embodied reasoning variant.
| Model | Year | Group | Type |
|---|---|---|---|
| DQN | 2013/2015 | DeepMind | Deep Q-learning, Atari |
| AlphaGo / AlphaZero | 2016/2017 | DeepMind | Self-play planning |
| OpenAI Five | 2018 | OpenAI | Dota 2 PPO at scale |
| AlphaStar | 2019 | DeepMind | StarCraft II |
| MuZero | 2020 | DeepMind | Model-based planning |
| DreamerV3 | 2023 | DeepMind | World-model RL |
| RT-1 | Dec 2022 | Robotics transformer | |
| Diffusion Policy | Mar 2023 | Columbia, Toyota | Diffusion control |
| ACT / Aloha | Apr 2023 | Stanford | Bimanual imitation |
| RT-2 | Jul 2023 | First VLA at scale | |
| RT-X | Oct 2023 | RT-X collaboration | Cross-embodiment |
| Octo | May 2024 | Berkeley, Stanford, CMU | Open generalist policy |
| OpenVLA | Jun 2024 | Stanford, Berkeley | Open-source 7B VLA |
| pi-0 | Oct 2024 | Physical Intelligence | Flow-matching VLA |
| Helix | Feb 2025 | Figure | Humanoid VLA at 200 Hz |
| Gemini Robotics | Mar 2025 | Google DeepMind | Gemini 2.0 with actions |
Large robot datasets are typically collected by human teleoperation, scripted policies, or online RL agents. The table below lists widely cited resources.
| Resource | Year | Scale | Notes |
|---|---|---|---|
| RoboNet | 2019 | 15M video frames | Multi-robot, multi-lab |
| Bridge Data | 2022/2023 | ~60K demos | Single-arm tabletop |
| RT-1 dataset | 2022 | ~130K demos, 700 tasks | Google kitchen environments |
| Open X-Embodiment | 2023 | 60 datasets, 22 embodiments | Cross-institution union |
| DROID | 2024 | 76K demos, 350 hours | 13 institutions, Franka arms |
| Aloha and Mobile Aloha | 2023/2024 | Teleop bimanual | Low-cost rig |
| BEHAVIOR-1K | 2023 | 1,000 household tasks | Stanford simulation |
| RLBench | 2019 | 100 tasks | CoppeliaSim |
| Meta-World | 2019 | 50 manipulation tasks | Multi-task RL benchmark |
| LIBERO | 2023 | Lifelong learning suite | Procedural tasks |
| ManiSkill | 2021 | Generalizable manipulation | SAPIEN simulator |
| SIMPLER | 2024 | Sim-to-real evaluation | Reproduces real benchmarks |
Simulators feeding these benchmarks include NVIDIA Isaac Sim and Isaac Gym, MuJoCo maintained by Google DeepMind, PyBullet, RoboCasa, Habitat for embodied navigation, and AI2-THOR. Sim-to-real transfer commonly relies on domain randomization (Tobin et al., 2017), which trains policies across randomized visual and dynamics parameters so that real-world physics falls inside the training distribution.
Robotics models target several categories of skills. Manipulation covers pick-and-place, peg-in-hole, tool use, cloth folding, and dexterous in-hand reorientation. Navigation includes point-goal, object-goal, and instruction-following tasks. Bimanual manipulation, which the Aloha platform popularized, requires synchronized two-arm control. Locomotion spans flat ground walking, dynamic running, and recovery from disturbances.
| Embodiment | Company or lab | Type |
|---|---|---|
| Spot | Boston Dynamics | Quadruped |
| ANYmal | ANYbotics, ETH Zurich | Quadruped |
| Go2 / B2 | Unitree | Quadruped |
| Atlas | Boston Dynamics | Humanoid |
| Optimus | Tesla | Humanoid |
| Figure 02 / Figure 03 | Figure | Humanoid |
| NEO | 1X Technologies | Humanoid |
| H1 / G1 | Unitree | Humanoid |
| Apollo | Apptronik | Humanoid |
| Phoenix | Sanctuary AI | Humanoid |
| Aloha 2 | Stanford, Google | Bimanual stationary |
| Franka Panda | Franka Robotics | 7-DoF arm |
The VLA recipe inherits a pretrained vision-language backbone, attaches a robot action head, and fine-tunes on demonstration data. RT-2 and OpenVLA discretize each action dimension into bins and predict them as tokens, reusing the language model's softmax head. Pi-0 instead pairs PaliGemma with a flow-matching expert that emits continuous trajectories, giving 50 Hz control with smoother motion. Helix splits inference between a slower internet-pretrained VLM for scene understanding (System 2) and a fast visuomotor policy (System 1) that runs the 200 Hz control loop. Across these systems the design pattern is the same: web-scale pretraining supplies general world knowledge, and a smaller robot-specific fine-tune supplies grounded motor control.
Several trends define the 2024 and 2025 wave of robotics models. Humanoid robotics has surged with commercial deliveries from Figure, 1X, Agility Robotics, and Apptronik, plus ongoing programs at Tesla and Boston Dynamics. Generalist VLAs are replacing per-task policies, supported by collaborations such as Open X-Embodiment. Large-scale teleoperation rigs including Aloha, Mobile Aloha, and the Universal Manipulation Interface (UMI) lower the cost of data collection. Diffusion and flow-matching action heads handle multimodal action distributions that classical mean-squared-error losses smooth out. Edge inference platforms such as NVIDIA Jetson Thor and Orin run multibillion-parameter policies on the robot itself, which removes the need for a cloud round trip during control.
Robotics models reach production in several sectors. Industrial assembly cells use learned bin picking and insertion. Warehouse logistics is led by Amazon Robotics, which runs hundreds of thousands of mobile platforms. Service robots address cleaning, last-mile delivery, and food preparation. Surgical robotics combines teleoperation with learned assistance, as in the Intuitive da Vinci platform. Autonomous driving shares perception and planning components with robotics and is an adjacent field. Agriculture uses fruit-picking arms and autonomous tractors. Search and rescue, prosthetics, and exoskeletons round out the application stack.
Despite rapid progress, several issues remain open. The sim-to-real gap means policies trained in simulation often degrade on hardware, and domain randomization adds variance to training. Sample inefficiency is acute for real-world RL because robot rollouts are slow and risk hardware damage. Generalization across embodiments is partial: a policy trained on a Franka Panda may not transfer to a different gripper without fine-tuning. Safety guarantees are weak compared with classical controllers because neural policies are not easily certified. Real-time inference imposes latency budgets that constrain model size on the robot. Data collection is expensive: 76,000 DROID trajectories required 12 months and 13 institutions. Evaluation reproducibility is hampered by hardware variation, which has motivated reproducible suites like SIMPLER.