Robotics Models

AI Models Reinforcement Learning

18 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v3 · 3,691 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Robotics models are machine learning systems that give robots the ability to perceive their surroundings, plan actions, and execute motor control. They span perception networks (vision, depth, lidar), state estimation, motion planning, reinforcement learning policies, and the more recent class of vision-language-action models (VLAs) such as RT-2 and OpenVLA. The field has shifted from classical sense-plan-act pipelines, where each module is engineered by hand, toward end-to-end learned policies trained on large robot demonstration datasets. This article catalogs the major learning-based robotics models; for the general field see robotics and embodied AI, and for the dominant model class see vision-language-action model.

Overview

A robotics model maps sensor observations to actuator commands. Classical robotics decomposes this into modules: a perception stack builds a scene representation, a planner produces a trajectory in configuration space, and a controller tracks that trajectory with feedback loops. Learning-based robotics replaces some or all of these stages with neural networks trained from data, including raw demonstrations, simulated rollouts, or rewards collected during interaction. Modern robotics models typically share components with computer vision and natural language processing, reusing transformer and diffusion model architectures developed for images and text.

The push toward general-purpose policies accelerated after 2022, when Google's RT-1 showed that a single transformer trained on around 130,000 demonstrations of 700 tasks could control a mobile manipulator at 3 Hz (Brohan et al., 2022).[^rt1] RT-2 then demonstrated that fine-tuning a vision-language model like PaLI-X on robot trajectories produced emergent semantic reasoning, treating action tokens like words (Brohan et al., 2023).[^rt2] By 2025 the field had converged on the vision-language-action model recipe, and humanoid platforms from Figure, 1X, Tesla, and Apptronik began running such VLA models in commercial pilots.

Core components

Most robotics stacks combine the following building blocks, regardless of whether they are hand-engineered or learned end-to-end:

Perception: RGB cameras, depth sensors, lidar, and proprioception (joint angles, torques) feed into encoders such as ResNet, ViT, or fused SigLIP and DINOv2 backbones.
State estimation: maps, occupancy grids, SLAM, object pose estimation, and tracking.
Planning: motion planners such as RRT and CHOMP, task-and-motion planners, or learned trajectory generators.
Policy: a function from observations to actions, expressed as a neural network in the learning setting.
Manipulation: grasping, pick-and-place, peg-in-hole, tool use, and dexterous in-hand control.
Locomotion: walking, running, and balance control for quadrupeds, bipeds, and wheeled bases.
Sim-to-real transfer: techniques that close the gap between simulated training environments and physical hardware.

Learning paradigms

Reinforcement learning

Reinforcement learning (RL) trains a policy to maximize cumulative reward. Robotic RL typically uses model-free policy gradient methods such as Proximal Policy Optimization (PPO), introduced by Schulman et al. in 2017, and Soft Actor-Critic (SAC) from Haarnoja et al. in 2018. Both algorithms have been used for legged locomotion on quadrupeds like ANYmal and humanoid balance control. Q-learning variants including DQN, DDPG, and TD3 are common for discrete or low-dimensional continuous tasks, while Trust Region Policy Optimization (TRPO) preceded PPO with a tighter monotonic improvement guarantee.

Model-based RL learns a predictive model of the environment and plans inside it. Dreamer and its successor DreamerV3 (Hafner et al., 2019, 2023) learn latent dynamics models and act through imagined rollouts. MuZero (Schrittwieser et al., 2020) extended AlphaZero-style planning to environments without known rules, mastering Atari, Go, chess, and shogi from a single model.

Imitation learning

Imitation learning trains policies from expert demonstrations. Behavioral cloning fits a supervised model that maps observations to expert actions. DAgger (Ross et al., 2011) addresses compounding errors by repeatedly querying the expert on states the learner visits. Generative Adversarial Imitation Learning, or GAIL (Ho and Ermon, 2016), frames imitation as a GAN-style game between the policy and a discriminator. Inverse reinforcement learning recovers the reward function the expert appears to optimize, then trains a policy against that reward.

Action Chunking with Transformers (ACT), released alongside the Aloha low-cost teleoperation rig by Zhao et al. in 2023, predicts short sequences of actions rather than single steps and reached 80 to 90 percent success on tasks like opening condiment cups with only ten minutes of demonstration data.[^act] Diffusion Policy (Chi et al., 2023) generates action sequences through conditional denoising and reported a 46.9 percent average improvement over prior methods across 12 manipulation benchmarks.[^dp]

World models and self-supervision

Self-supervised approaches learn dynamics or representations from unlabeled video and interaction logs. V-JEPA (Bardes et al., 2024) predicts masked features in video to learn embeddings useful for downstream control. World models such as Dreamer and Google DeepMind's Genie 2 simulate plausible futures conditioned on actions, supporting planning and data augmentation. NVIDIA's GR00T-Dreams blueprint applies the same idea to data generation, prompting a video world model with a single image and an instruction to synthesize training "dreams"; NVIDIA reported using it to produce data for GR00T N1.5 in about 36 hours rather than months of manual collection.[^groot15]

Foundation models for robotics

Foundation models pretrain on large mixed datasets and adapt to many tasks. The current generation of robotics foundation models is dominated by vision-language-action models that take an image and a language instruction and emit either tokenized or continuous actions. RT-1 used a FiLM-conditioned EfficientNet plus TokenLearner and Transformer; RT-2 swapped in PaLI-X (55B) and PaLM-E (12B).[^rt2] RT-X and the Open X-Embodiment collaboration (October 2023) trained on 60 datasets from 22 robot embodiments across 21 institutions and showed positive transfer across platforms.[^oxe] OpenVLA (Kim et al., June 2024) is a 7-billion-parameter open-source VLA combining SigLIP, DINOv2, and Llama 2, trained on 970,000 Open X-Embodiment demonstrations, that the authors report outperformed RT-2-X (55B) by 16.5 absolute points in task success across 29 tasks while using roughly 7 times fewer parameters.[^openvla] Octo (May 2024) is an open-source generalist trained on 800,000 trajectories from Open X-Embodiment.[^octo] The sections below detail the leading 2024 and 2025 robot foundation models.

Robot foundation and VLA models

This section surveys the major robot foundation models in roughly chronological order. Most are vision-language-action models; the vision-language-action model article covers the shared architecture in depth.

Google RT-1, RT-2, and RT-X

RT-1 (Robotics Transformer 1, December 2022) is a 35M-parameter transformer that tokenizes camera images and a language instruction and outputs discretized arm and base actions at 3 Hz. It was trained on roughly 130,000 episodes spanning more than 700 tasks, collected over 17 months with 13 Everyday Robots mobile manipulators, and Google reported a 97 percent success rate on training instructions.[^rt1] RT-2 (July 2023) reframed control as a vision-language task by co-fine-tuning a web-pretrained vision-language model (PaLI-X up to 55B, or PaLM-E 12B) on robot trajectories, with actions emitted as text tokens; Google described this as transferring web knowledge to robotic control and reported improved generalization to novel objects and instructions.[^rt2] RT-X (October 2023), part of the Open X-Embodiment collaboration, retrained RT-1 and RT-2 on the pooled cross-embodiment dataset (the RT-1-X and RT-2-X models) and demonstrated positive transfer across robot platforms.[^oxe]

Octo and OpenVLA

Octo (May 2024, Berkeley, Stanford, and collaborators) is an open-source generalist policy built on a transformer with a diffusion action head, trained on 800,000 trajectories from the Open X-Embodiment mixture, and designed to be fine-tuned quickly to new sensors and action spaces.[^octo] OpenVLA (June 2024, Stanford, Berkeley, Toyota Research Institute, Google DeepMind) is a 7-billion-parameter open-source VLA that fuses SigLIP and DINOv2 visual features into a Llama 2 backbone and predicts discretized action tokens. Trained on 970,000 real-world Open X-Embodiment demonstrations, its authors report it surpasses the closed RT-2-X (55B) by 16.5 absolute percentage points in success rate across 29 tasks with about 7 times fewer parameters, and that it supports parameter-efficient fine-tuning and quantization.[^openvla]

Physical Intelligence pi-0 and pi-0.5

Physical Intelligence's pi-0 (also written π0, October 2024) builds on the 3B-parameter PaliGemma vision-language model and adds a roughly 300M-parameter "action expert" trained with flow matching, for about 3.3 billion parameters total. Rather than discrete tokens, the action expert generates continuous action chunks, which the company reports enables high-frequency control of up to 50 Hz for dexterous tasks such as folding laundry and bussing tables; it was pretrained on over 10,000 hours of robot data across multiple embodiments. The weights were later released as part of the open-source openpi repository.[^pi0] pi-0.5 (April 2025) extends pi-0 toward open-world generalization through co-training on a heterogeneous mixture of data, including multimodal web data, verbal instructions, high-level semantic subtask prediction, and low-level robot actions from several robot types. Physical Intelligence reported that pi-0.5 was the first end-to-end learned system to perform long-horizon dexterous tasks such as cleaning kitchens and bedrooms in homes never seen during training.[^pi05]

NVIDIA Isaac GR00T

NVIDIA Isaac GR00T N1 (March 2025) was announced as the first open, customizable foundation model for generalist humanoid robots. It uses a dual-system design in which a vision-language module (System 2) interprets the scene and instruction and a diffusion-transformer module (System 1) generates continuous motor actions; the GR00T-N1-2B checkpoint is published on Hugging Face. It was trained on a mixture of real-robot trajectories, human videos, and synthetic data, and NVIDIA reported that adding synthetic data improved performance by 40 percent over real data alone.[^groot1] GR00T N1.5 (June 2025) is a 3B-parameter update that freezes its Eagle-2.5 vision-language backbone, adds a Future Latent Representation Alignment (FLARE) objective, and improves language following; NVIDIA reported large gains over N1, for example raising the real GR-1 humanoid language-following rate from 46.6 percent to 93.3 percent.[^groot15] NVIDIA has continued the series with GR00T N1.6 and an early-access GR00T N1.7 (2026), the latter built on a Cosmos-Reason VLM backbone.[^grootrepo]

Google DeepMind Gemini Robotics

Google DeepMind's Gemini Robotics (March 2025) is a vision-language-action model built on Gemini 2.0 that adds physical action outputs, paired with Gemini Robotics-ER, an embodied reasoning model with enhanced spatial and temporal understanding for perception, state estimation, planning, and code generation; DeepMind reported that Gemini Robotics-ER achieved a 2 to 3 times higher success rate than Gemini 2.0 on end-to-end embodied tasks, and named Apptronik as a humanoid partner.[^gemrob] Gemini Robotics On-Device (June 2025) is a VLA optimized to run locally on the robot for low-latency, offline operation; DeepMind described it as its first VLA available for developer fine-tuning, adapting to new tasks from as few as 50 to 100 demonstrations, and shipped a Gemini Robotics SDK with MuJoCo support.[^gemod] Gemini Robotics 1.5 and Gemini Robotics-ER 1.5 (September 2025) split the stack into an orchestrating embodied-reasoning model (ER 1.5) that plans, calls digital tools such as web search, and estimates progress, and a VLA (1.5) that "thinks before acting" by generating natural-language reasoning before motion; DeepMind reported skill transfer across the ALOHA 2, Apptronik Apollo, and Franka platforms without per-robot specialization.[^gem15]

Figure Helix

Figure's Helix (February 2025) is a vision-language-action model for generalist humanoid control that uses a System 1 plus System 2 split. System 2 is an internet-pretrained 7B vision-language model running at 7 to 9 Hz for scene and language understanding, and System 1 is an 80M-parameter visuomotor policy that produces continuous actions at 200 Hz over a 35-degree-of-freedom humanoid upper body, including individual finger control. Figure reported training on roughly 500 hours of teleoperated demonstrations augmented with auto-generated hindsight instructions, and showed two robots collaborating on novel objects without task-specific programming.[^helix]

Open and efficient VLAs: RDT-1B and SmolVLA

RDT-1B (Robotics Diffusion Transformer, October 2024, Tsinghua University) is a 1.2-billion-parameter diffusion foundation model for bimanual manipulation, described by its authors as the largest diffusion-based foundation model for robotic manipulation at release. It was pretrained on a collection of 46 datasets with more than one million episodes and fine-tuned on a self-collected bimanual dataset, and the authors report zero-shot generalization to unseen objects and few-shot learning of new skills from one to five demonstrations.[^rdt] SmolVLA (June 2025, Hugging Face) is a compact 450M-parameter open VLA trained on community-contributed LeRobot datasets. It pairs a SmolVLM-2 backbone (SigLIP vision encoder plus a SmolLM2 language model) with a roughly 100M-parameter flow-matching action expert, targets consumer-grade hardware, and uses an asynchronous inference scheme that Hugging Face reports cuts task time by about 30 percent on average; the company states it matches or exceeds much larger models on its benchmarks.[^smolvla]

Notable models

Model	Year	Group	Type
DQN	2013/2015	DeepMind	Deep Q-learning, Atari
AlphaGo / AlphaZero	2016/2017	DeepMind	Self-play planning
OpenAI Five	2018	OpenAI	Dota 2 PPO at scale
AlphaStar	2019	DeepMind	StarCraft II
MuZero	2020	DeepMind	Model-based planning
DreamerV3	2023	DeepMind	World-model RL
RT-1	Dec 2022	Google	Robotics transformer, 35M
Diffusion Policy	Mar 2023	Columbia, Toyota	Diffusion control
ACT / Aloha	Apr 2023	Stanford	Bimanual imitation
RT-2	Jul 2023	Google	First VLA at scale
RT-X	Oct 2023	RT-X collaboration	Cross-embodiment
Octo	May 2024	Berkeley, Stanford	Open generalist policy
OpenVLA	Jun 2024	Stanford, Berkeley	Open-source 7B VLA
pi-0	Oct 2024	Physical Intelligence	Flow-matching VLA, 3.3B
RDT-1B	Oct 2024	Tsinghua University	Bimanual diffusion VLA, 1.2B
Helix	Feb 2025	Figure	Humanoid VLA, 7B + 80M, 200 Hz
Gemini Robotics	Mar 2025	Google DeepMind	Gemini 2.0 with actions
GR00T N1	Mar 2025	NVIDIA	Open humanoid foundation model
pi-0.5	Apr 2025	Physical Intelligence	Open-world VLA
SmolVLA	Jun 2025	Hugging Face	Compact open VLA, 450M
GR00T N1.5	Jun 2025	NVIDIA	3B humanoid VLA, FLARE
Gemini Robotics On-Device	Jun 2025	Google DeepMind	On-robot, fine-tunable VLA
Gemini Robotics 1.5 / ER 1.5	Sep 2025	Google DeepMind	Agentic VLA plus reasoning

Datasets and benchmarks

Large robot datasets are typically collected by human teleoperation, scripted policies, or online RL agents. The table below lists widely cited resources.

Resource	Year	Scale	Notes
RoboNet	2019	15M video frames	Multi-robot, multi-lab
Bridge Data	2022/2023	~60K demos	Single-arm tabletop
RT-1 dataset	2022	~130K demos, 700 tasks	Google kitchen environments
Open X-Embodiment	2023	60 datasets, 22 embodiments	Cross-institution union
DROID	2024	76K demos, 350 hours	13 institutions, Franka arms
Aloha and Mobile Aloha	2023/2024	Teleop bimanual	Low-cost rig
BEHAVIOR-1K	2023	1,000 household tasks	Stanford simulation
RLBench	2019	100 tasks	CoppeliaSim
Meta-World	2019	50 manipulation tasks	Multi-task RL benchmark
LIBERO	2023	Lifelong learning suite	Procedural tasks
ManiSkill	2021	Generalizable manipulation	SAPIEN simulator
SIMPLER	2024	Sim-to-real evaluation	Reproduces real benchmarks

Simulators feeding these benchmarks include NVIDIA Isaac Sim and Isaac Gym, MuJoCo maintained by Google DeepMind, PyBullet, RoboCasa, Habitat for embodied navigation, and AI2-THOR. Sim-to-real transfer commonly relies on domain randomization (Tobin et al., 2017), which trains policies across randomized visual and dynamics parameters so that real-world physics falls inside the training distribution.[^domainrand]

Skills and embodiments

Robotics models target several categories of skills. Manipulation covers pick-and-place, peg-in-hole, tool use, cloth folding, and dexterous in-hand reorientation. Navigation includes point-goal, object-goal, and instruction-following tasks. Bimanual manipulation, which the Aloha platform popularized, requires synchronized two-arm control. Locomotion spans flat ground walking, dynamic running, and recovery from disturbances. Many recent VLAs are deployed on humanoid robot platforms, listed below alongside quadrupeds and arms.

Embodiment	Company or lab	Type
Spot	Boston Dynamics	Quadruped
ANYmal	ANYbotics, ETH Zurich	Quadruped
Go2 / B2	Unitree	Quadruped
Atlas	Boston Dynamics	Humanoid
Optimus	Tesla	Humanoid
Figure 02 / Figure 03	Figure	Humanoid
NEO	1X Technologies	Humanoid
H1 / G1	Unitree	Humanoid
Apollo	Apptronik	Humanoid
Phoenix	Sanctuary AI	Humanoid
Aloha 2	Stanford, Google	Bimanual stationary
Franka Panda	Franka Robotics	7-DoF arm

Vision-language-action paradigm

The vision-language-action model recipe inherits a pretrained vision-language backbone, attaches a robot action head, and fine-tunes on demonstration data. RT-2 and OpenVLA discretize each action dimension into bins and predict them as tokens, reusing the language model's softmax head. pi-0 instead pairs PaliGemma with a flow-matching expert that emits continuous trajectories, giving 50 Hz control with smoother motion, an approach also taken by SmolVLA's action expert.[^pi0][^smolvla] Helix and GR00T N1 split inference between a slower internet-pretrained VLM for scene understanding (System 2) and a fast visuomotor policy (System 1) that runs the high-rate control loop.[^helix][^groot1] Across these systems the design pattern is the same: web-scale pretraining supplies general world knowledge, and a smaller robot-specific fine-tune supplies grounded motor control.

Modern landscape

Several trends define the 2024 and 2025 wave of robotics models. Humanoid robot development has surged with commercial deliveries from Figure, 1X, Agility Robotics, and Apptronik, plus ongoing programs at Tesla and Boston Dynamics. Generalist VLAs are replacing per-task policies, supported by collaborations such as Open X-Embodiment and by open releases such as OpenVLA, GR00T, SmolVLA, and the openpi pi-0 weights. Large-scale teleoperation rigs including Aloha, Mobile Aloha, and the Universal Manipulation Interface (UMI) lower the cost of data collection. Diffusion and flow-matching action heads handle multimodal action distributions that classical mean-squared-error losses smooth out. Agentic stacks such as Gemini Robotics 1.5 add an explicit reasoning model that plans and calls external tools before the VLA acts.[^gem15] Edge inference platforms such as NVIDIA Jetson Thor and Orin, together with on-robot models like Gemini Robotics On-Device, run multibillion-parameter policies on the robot itself, which removes the need for a cloud round trip during control.[^gemod]

Applications

Robotics models reach production in several sectors. Industrial assembly cells use learned bin picking and insertion. Warehouse logistics is led by Amazon Robotics, which runs hundreds of thousands of mobile platforms. Service robots address cleaning, last-mile delivery, and food preparation. Surgical robotics combines teleoperation with learned assistance, as in the Intuitive da Vinci platform. Autonomous driving shares perception and planning components with robotics and is an adjacent field. Agriculture uses fruit-picking arms and autonomous tractors. Search and rescue, prosthetics, and exoskeletons round out the application stack.

Limitations

Despite rapid progress, several issues remain open. The sim-to-real gap means policies trained in simulation often degrade on hardware, and domain randomization adds variance to training. Sample inefficiency is acute for real-world RL because robot rollouts are slow and risk hardware damage. Generalization across embodiments is partial: a policy trained on a Franka Panda may not transfer to a different gripper without fine-tuning, which is one motivation for cross-embodiment efforts like Open X-Embodiment and GR00T. Safety guarantees are weak compared with classical controllers because neural policies are not easily certified. Real-time inference imposes latency budgets that constrain model size on the robot. Data collection is expensive: 76,000 DROID trajectories required 12 months and 13 institutions.[^droid] Evaluation reproducibility is hampered by hardware variation, which has motivated reproducible suites like SIMPLER. Vendor performance figures cited above are self-reported and have not always been independently replicated.

References

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347 Accessed 2026-05-31.
Haarnoja, T. et al. (2018). Soft Actor-Critic. https://arxiv.org/abs/1801.01290 Accessed 2026-05-31.
Schrittwieser, J. et al. (2020). MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. https://arxiv.org/abs/1911.08265 Accessed 2026-05-31.
Ho, J., Ermon, S. (2016). Generative Adversarial Imitation Learning. https://arxiv.org/abs/1606.03476 Accessed 2026-05-31.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Large Behavior Model