Robot foundation model

AI Models Robotics

27 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 5,458 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A robot foundation model is a large-scale machine learning model, typically based on the transformer architecture, that is pre-trained on broad, diverse datasets of robot interactions and then adapted to a wide range of robotic tasks and physical robot bodies (embodiments). It applies the scaling recipe that produced large language models (LLMs) and vision-language models (VLMs) to physical robot control, producing general-purpose policies that follow natural language instructions, perceive through computer vision, and output low-level motor actions. The field accelerated after Google's RT-1 (December 2022) and was unlocked by the Open X-Embodiment dataset, which pooled over 1 million real robot trajectories from 22 distinct robot embodiments and 527 manipulation skills, making the first generalist policies possible.^[1]^[5]

Robot foundation models represent a shift away from the traditional approach in robotics, where engineers hand-design control policies or train narrow models for individual tasks on individual robots. Instead, robot foundation models are pre-trained on massive, heterogeneous datasets spanning multiple robot types, sensor configurations, and manipulation skills, and then fine-tuned or prompted for specific downstream applications. The field gained rapid momentum starting in 2022, with Google DeepMind's Robotics Transformer (RT) family of models, and has since expanded to include efforts from NVIDIA, Physical Intelligence, Figure AI, Skild AI, and academic labs at Stanford, UC Berkeley, and elsewhere.

Why did robotics need foundation models?

For decades, robotics research relied on hand-crafted controllers, classical motion planning algorithms, and task-specific reinforcement learning policies. While effective in structured factory environments, these approaches struggled to generalize. A robot trained to pick up a red cup from a specific table position could not reliably pick up a blue mug placed at a different angle. Each new task, object, or robot platform typically required starting the training process from scratch.

The success of foundation models in natural language processing and computer vision suggested an alternative path. Models like GPT, BERT, and CLIP demonstrated that pre-training on web-scale data followed by task-specific fine-tuning could produce systems with broad generalization abilities. Researchers began asking whether the same approach could work for robotics: train a single, large model on diverse robot data, and then adapt it to new tasks, environments, and even robot bodies with minimal additional data.

Two factors made this feasible starting around 2022. First, the maturation of vision-language models provided a way to ground natural language instructions in visual perception, giving robots a shared representation space for understanding both what they see and what they are asked to do. Second, large-scale collaborative data collection efforts, most notably the Open X-Embodiment project, created datasets large enough to support pre-training of generalist robot policies.^[5]

What is a vision-language-action (VLA) model?

The dominant architecture for robot foundation models is the vision-language-action (VLA) model. A VLA takes an image (or short video) of the robot's surroundings and a natural language instruction as input and directly outputs low-level robot actions that can be executed to accomplish the requested task.

VLAs are generally constructed by fine-tuning a pre-trained vision-language model on a large-scale dataset that pairs visual observations and language instructions with robot trajectories. The core idea is that a VLM already understands the visual world and language; by adding an action output modality, the model can be taught to translate perception and language understanding into physical behavior.^[2]

Architecture

A typical VLA architecture has three components:

Visual encoder: A vision transformer (ViT) or convolutional neural network (CNN) that processes camera images into a set of visual token embeddings. Common backbones include SigLIP, DINOv2, and EfficientNet.
Language model backbone: A pre-trained large language model that processes the language instruction and visual tokens together, producing a joint representation. Backbones range from PaLM-E (540B parameters) and PaLI-X (55B) in RT-2 down to Llama 2 7B in OpenVLA and SmolLM 1.7B in GR00T N1.
Action decoder: A head that maps the model's internal representations to motor commands. This can be a simple discretization into action tokens (as in RT-1 and RT-2), a diffusion policy head (as in Octo), or a flow-matching decoder (as in pi0).

Action representations

Different models handle action output differently:

Discrete action tokens: RT-1 and RT-2 discretize each action dimension (x, y, z, roll, pitch, yaw, gripper opening) into 256 bins and treat the resulting tokens like language tokens. This allows the model to use standard autoregressive decoding. As the RT-2 authors put it, "in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens."^[1]^[2]
Diffusion decoding: Octo and GR00T N1 use a diffusion process to generate continuous, multi-modal action distributions. A single forward pass through the transformer produces conditioning tokens, and then a small diffusion head iteratively denoises action predictions.^[6]
Flow matching: pi0 from Physical Intelligence uses flow matching (a variant of diffusion) to generate smooth, high-frequency action trajectories at up to 50 Hz, well suited for dexterous manipulation tasks.^[9]

Key models and systems

RT-1 (Google, 2022)

RT-1 (Robotics Transformer 1) was one of the first large-scale transformer-based robot policies trained on real-world data at scale. Developed by Google's Everyday Robots team and published in December 2022, RT-1 was trained on 130,000 episodes covering over 700 tasks, collected using a fleet of 13 mobile manipulators over 17 months.^[1]

The architecture uses an ImageNet-pretrained EfficientNet backbone conditioned on language instruction embeddings via FiLM (Feature-wise Linear Modulation) layers, followed by a TokenLearner compression module and a transformer decoder that outputs discretized action tokens. Each action is discretized into 256 bins across 11 dimensions: seven for arm movement, three for base movement, and one discrete mode-switching variable.^[1]

RT-1 achieved 97% success on seen tasks and 76% on previously unseen tasks in the same environment, demonstrating that a single model could learn hundreds of manipulation skills simultaneously.^[1]

RT-2 (Google DeepMind, 2023)

RT-2 extended the RT-1 approach by building directly on top of large pre-trained vision-language models. Rather than training a robot-specific architecture from scratch, RT-2 fine-tuned existing VLMs (PaLI-X at 55B parameters and PaLM-E at 12B parameters) to output robot actions as text tokens alongside their normal language outputs.^[2]

The key insight was that robot actions could be represented as strings of integers, making them just another "language" that the model could learn. By co-fine-tuning on a mixture of web-scale vision-language data and robot trajectory data, RT-2 inherited the world knowledge and reasoning capabilities of the underlying VLM. For example, RT-2 could follow instructions like "pick up the object that could be used as an improvised hammer" by combining visual recognition with commonsense reasoning, a capability that emerged from the web data rather than being explicitly taught through robot demonstrations.^[2]

RT-2 achieved more than 3x the success rate of baselines on emergent skill evaluations involving symbol understanding, reasoning, and human recognition.^[2]

Gato (DeepMind, 2022)

Gato, published in May 2022, was an early generalist agent that operated across multiple modalities and domains. With 1.2 billion parameters, Gato could play Atari games, caption images, chat, and stack blocks with a real robot arm, all using the same network weights. Data from different tasks were serialized into flat sequences of tokens and processed by a single transformer.^[3]

Gato demonstrated the basic feasibility of multi-task, multi-embodiment policies but was limited by its relatively small scale and the breadth of its training data. It served as the foundation for RoboCat.^[4]

RoboCat (DeepMind, 2023)

RoboCat, unveiled in June 2023, built on Gato's architecture to create a self-improving robotic agent. It was trained on data from multiple robotic arms and could adapt to new tasks or even entirely new robot hardware with as few as 100 demonstrations.^[4]

RoboCat's self-improvement loop worked as follows: given a small number of demonstrations for a new task, a spin-off agent would practice the task roughly 10,000 times, generating additional training data. This new data was then folded back into RoboCat's training set, and the model was retrained. Through this cycle, RoboCat improved its success rate on new tasks by an average of 2x.^[4]

RoboCat was the first agent to solve and adapt to multiple tasks across different physical robots, including adapting from two-pronged grippers to a three-fingered gripper with twice as many controllable inputs, all within a few hours.^[4]

Open X-Embodiment and RT-X (Google DeepMind and partners, 2023)

The Open X-Embodiment (OXE) project, announced in October 2023, was a collaborative effort between Google DeepMind and over 20 research institutions to create the largest open-source real robot dataset. The dataset pooled 60 existing robot datasets from 34 labs worldwide, containing over 1 million trajectories from 22 different robot embodiments spanning single arms, bimanual systems, and mobile manipulators, covering 527 distinct manipulation skills.^[5]

Two models were trained on this data mixture:

RT-1-X: The RT-1 architecture trained on the OXE data mixture. Trained on data from many different robots, RT-1-X achieved a 50% higher success rate than the original state-of-the-art specialist methods contributed by individual collaborating institutions, demonstrating positive cross-robot transfer.^[5]
RT-2-X: The 55B-parameter RT-2 model fine-tuned on the OXE mixture. RT-2-X exhibited improved cross-robot transfer and emergent generalization to robot embodiments not seen during training.^[5]

The OXE dataset was standardized into the RLDS (Robot Learning Data Standard) format and made freely available, becoming an essential resource for the broader robot learning community.^[5]

Octo (UC Berkeley and partners, 2024)

Octo, published in May 2024, was the first fully open-source generalist robot policy designed from the ground up for broad applicability. Developed by a team at UC Berkeley and collaborators, Octo was trained on 800,000 robot episodes from 25 datasets in the Open X-Embodiment collection.^[6]

Octo comes in two sizes: Octo-Small (27M parameters) and Octo-Base (93M parameters). Its architecture tokenizes task descriptions using a pre-trained language model and observations using a lightweight CNN, then processes everything through a transformer backbone. A conditional diffusion decoding head generates continuous, multi-modal action distributions. Octo was the first VLA to use diffusion-based action decoding.^[6]

Octo supports both natural language instructions and goal-image conditioning, and it can be fine-tuned for new robots with just a few hours of data. Pre-training takes 8 hours (Octo-Small) or 14 hours (Octo-Base) on a TPUv4-128 pod. Notably, Octo runs on consumer GPUs, making it accessible to research labs without large-scale compute infrastructure.^[6]

OpenVLA (Stanford, 2024)

OpenVLA, published in June 2024, is a 7-billion-parameter open-source VLA trained on 970,000 real-world robot manipulation trajectories from the Open X-Embodiment dataset. It combines a fused visual encoder (SigLIP + DINOv2) with a Llama 2 7B language model backbone.^[7]

Incoming RGB observations are divided into patches and processed by both visual encoders; the resulting features are concatenated and projected into the LLM embedding space via a two-layer MLP. The model then outputs discretized action tokens autoregressively. OpenVLA was trained on 64 A100 GPUs for 15 days, and it can be fine-tuned on consumer GPUs using low-rank adaptation (LoRA) methods and served efficiently via quantization.^[7]

CrossFormer (UC Berkeley, 2024)

CrossFormer, presented at CoRL 2024 (Top 4% oral), is a transformer-based policy trained on 900,000 trajectories from 30 different robot embodiments, the largest and most diverse cross-embodiment training dataset at the time. Unlike prior work, CrossFormer does not require manual alignment of observation or action spaces across different robots. It casts cross-embodied imitation learning as a sequence-to-sequence problem, using modality-specific tokenizers for observations, proprioception, and task specifications.^[8]

The same CrossFormer weights can control single-arm and dual-arm manipulation systems, wheeled robots, quadcopters, and quadrupeds. Experiments showed it matched the performance of specialist policies tailored for each embodiment while outperforming prior cross-embodiment learning methods.^[8]

pi0 (Physical Intelligence, 2024)

Physical Intelligence released pi0 (also written as p0) in October 2024, describing it as a prototype generalist policy for robot control. The model was trained on data from seven robotic platforms covering 68 unique tasks, including laundry folding, table bussing, grocery bagging, box assembly, and object retrieval.^[9]

pi0 uses a mixture-of-experts-like architecture with a pre-trained 3B PaliGemma VLM and a separate set of action expert parameters. The VLM block, proprioception block, and action block interact through attention with block-wise causal masking: the VLM attends to itself, proprioception attends to itself and VLM tokens, and the action block attends to all tokens. Actions are generated via flow matching, a variant of diffusion that produces smooth, continuous action trajectories at up to 50 Hz.^[9]

In February 2025, Physical Intelligence open-sourced pi0 through its openpi repository on GitHub, releasing code, model weights, and fine-tuning examples for platforms including ALOHA and DROID.^[20]

pi0.5 (Physical Intelligence, 2025)

pi0.5, released in April 2025, introduced a hierarchical architecture for improved open-world generalization. It uses a two-stage inference procedure: first, a high-level textual subtask is predicted through discrete autoregressive token decoding; then, low-level motor commands are generated through continuous flow matching. The discrete decoding pathway uses the FAST tokenizer for action tokens during pre-training, while post-training switches to flow matching for continuous actions.^[10]

pi0.5 demonstrated meaningful generalization to entirely new environments that were not represented in its training data.^[10]

GR00T N1 (NVIDIA, 2025)

Isaac GR00T N1, announced at GTC on March 18, 2025, is described by NVIDIA as the world's first open, fully customizable foundation model for generalized humanoid reasoning and skills. Announcing it, NVIDIA CEO Jensen Huang said, "The age of generalist robotics is here."^[11]^[21] It uses a dual-system architecture inspired by theories of human cognition:^[11]

System 2 (slow thinking): A vision-language model based on NVIDIA Eagle-2 with a SmolLM 1.7B language backbone and SigLIP-2 image encoder. System 2 reasons about the environment, interprets language instructions, and generates high-level action plans.^[11]
System 1 (fast thinking): A Diffusion Transformer that translates System 2's plans into continuous, closed-loop motor commands at up to 120 Hz for real-time control.^[11]

The publicly released checkpoint, GR00T-N1-2B, has about 2.2 billion parameters in total (1.34 billion in the VLM), and can sample 16 actions in 63.9 milliseconds on an NVIDIA L40 GPU, fast enough for real-time control. These two systems are tightly coupled and optimized together during post-training. GR00T N1 was pre-trained on a heterogeneous mixture of real-robot trajectories, human videos, and synthetic data, using roughly 50,000 H100 GPU hours.^[11]

NVIDIA rapidly iterated on GR00T N1 through 2025:

GR00T N1.5 (mid-2025): Updated the VLM to Eagle 2.5, added FLARE (Future Latent Representation Alignment) loss for learning from human videos, and improved language-following rate from 46.6% to 93.3% on real GR-1 humanoid tasks.^[12]
GR00T N1.6 (late 2025): Switched to an internal Cosmos-Reason-2B VLM variant with flexible resolution and native aspect ratio encoding, doubled the Diffusion Transformer size to 32 layers, and integrated NVIDIA Cosmos Reason for instruction decomposition.^[13]

GR00T N1 and its successors are open-source and available on Hugging Face, and they are being adopted by humanoid robot companies including 1X Technologies, Agility Robotics, Apptronik, Boston Dynamics, Figure AI, Fourier Intelligence, Sanctuary AI, Unitree Robotics, and XPENG Robotics.^[11]

Helix (Figure AI, 2025)

Helix, unveiled by Figure AI in February 2025, is a generalist VLA for humanoid robot control. Like GR00T N1, Helix uses a dual System 1/System 2 architecture for high-rate, dexterous control of the entire humanoid upper body. Its System 1 is an 80M-parameter reactive visuomotor policy that runs at 200 Hz and commands 35 degrees of freedom (individual fingers, wrists, torso, and head), while a 7B-parameter System 2 VLM handles slower semantic reasoning; the model was trained on roughly 500 hours of teleoperated demonstrations. Robots running Helix can pick up virtually any small household object, including thousands of items never seen during training, by following natural language prompts.^[14]

Helix was the first VLA to operate simultaneously on two robots, enabling them to collaborate on a shared, long-horizon manipulation task with novel objects.^[14] An updated version, Helix 02, extended control to the full robot body, integrating walking, manipulation, and balance into a single continuous system. In a demonstration, Helix 02 autonomously unloaded and reloaded a dishwasher across a full-sized kitchen in a four-minute, uninterrupted sequence with no resets or human intervention.^[15]

Gemini Robotics (Google DeepMind, 2025)

Gemini Robotics, introduced in March 2025, extended Google DeepMind's Gemini 2.0 multimodal model with physical action output capabilities. Where Gemini 2.0 could process text, images, video, and audio, Gemini Robotics added the ability to directly control robots.^[16]

The model demonstrated three core qualities: generality (adapting to different situations and following open-vocabulary instructions), interactivity (responding quickly to new instructions and changing conditions), and dexterity (performing fine manipulation tasks like folding origami and manipulating playing cards). According to DeepMind, Gemini Robotics "more than doubles performance on a comprehensive generalization benchmark" compared to other state-of-the-art VLA models.^[16]

A companion model, Gemini Robotics-ER (Embodied Reasoning), extended Gemini's multimodal reasoning into the physical world with spatial and temporal understanding, including object detection, grasp prediction, and 3D bounding box estimation. An on-device variant was optimized for local execution on robotic hardware, adapting to new tasks with as few as 50 to 100 demonstrations.^[16] Gemini Robotics 1.5, released in September 2025, added the ability to reason about and explain its own planned actions before executing them.^[17]

Skild Brain (Skild AI, 2024-2025)

Skild AI, a Pittsburgh-based startup founded in 2023, is building what it calls a "general-purpose robotic brain." The Skild Brain is designed to be omni-bodied: it can control any robot without prior knowledge of the robot's exact body form, including quadrupeds, humanoids, tabletop arms, and mobile manipulators.^[18]

Skild's approach to the data problem relies heavily on large-scale simulation and internet video for pre-training, followed by targeted real-world data for post-training. The company demonstrated a notable research breakthrough in in-context learning for robotics: when introduced to a new body or unseen environment, the model adjusts its behavior based on live experience without requiring weight updates.^[18]

Skild AI raised $300 million in a Series A at a $1.5 billion valuation, and in January 2026 it closed a roughly $1.4 billion Series C led by SoftBank, with participation from NVIDIA, Bezos Expeditions, Sequoia, Coatue, Samsung, and LG, valuing the company above $14 billion (more than triple its valuation seven months earlier). CEO Deepak Pathak said the company had raised more than $2 billion to date.^[22]

How do the major robot foundation models compare?

Model	Organization	Year	Parameters	Training data	Action output	Open source	Key innovation
Gato	DeepMind	2022	1.2B	604 tasks (sim + real)	Discrete tokens	No	Multi-task, multi-modal generalist agent
RT-1	Google	2022	~35M	130K episodes, 700+ tasks	Discrete (256 bins)	Yes	First large-scale real-world robot transformer
RT-2	Google DeepMind	2023	12B / 55B	Web + robot data	Discrete tokens (as text)	No	Actions as language tokens in a VLM
RoboCat	DeepMind	2023	N/A (Gato-based)	Multi-robot, self-generated	Discrete tokens	No	Self-improving via practice data
RT-1-X / RT-2-X	DeepMind + 20 labs	2023	35M / 55B	1M+ OXE trajectories	Discrete tokens	Yes (data)	Cross-embodiment transfer at scale
Octo	UC Berkeley	2024	27M / 93M	800K OXE episodes	Diffusion head	Yes	First open-source generalist robot policy with diffusion
OpenVLA	Stanford	2024	7B	970K OXE trajectories	Discrete tokens	Yes	Open 7B VLA with LoRA fine-tuning
CrossFormer	UC Berkeley	2024	N/A	900K trajectories, 30 embodiments	Seq-to-seq tokens	Yes	Cross-embodiment without manual action alignment
pi0	Physical Intelligence	2024	3B+	7 platforms, 68 tasks	Flow matching (50 Hz)	Yes	Flow-matching action generation for dexterous tasks
pi0.5	Physical Intelligence	2025	3B+	Expanded dataset	Dual (discrete + flow)	Yes	Hierarchical subtask planning + motor control
GR00T N1	NVIDIA	2025	~2.2B	Real + human video + synthetic	Diffusion Transformer (120 Hz)	Yes	Dual-system architecture for humanoid robots
Helix	Figure AI	2025	80M (S1) + 7B (S2)	~500 hr teleop (proprietary)	Dual system (200 Hz)	No	First multi-robot cooperative VLA
Gemini Robotics	Google DeepMind	2025	N/A (Gemini 2.0-based)	Web + robot data	VLA (continuous)	No	Extends Gemini VLM to physical actions
Skild Brain	Skild AI	2024-2025	N/A	Sim + video + real	N/A	No	Omni-bodied control with in-context adaptation

What data are robot foundation models trained on?

The availability of large-scale, diverse robot datasets has been a critical enabler for robot foundation models. The field has converged on several data sources.

Open X-Embodiment (OXE)

The OXE dataset, released in October 2023, is the largest open-source real robot dataset. It aggregates 60 existing datasets from 34 research labs, containing over 1 million trajectories from 22 robot embodiments and 527 distinct skills. All data is standardized into the RLDS format (based on TensorFlow Datasets) with consistent episode structure, action spaces, and metadata. OXE covers tabletop manipulation, mobile manipulation, bimanual tasks, and some locomotion scenarios. It has become the standard pre-training dataset for open-source robot foundation models including Octo, OpenVLA, and CrossFormer.^[5]

Simulation and synthetic data

Real-world robot data collection is expensive and slow. A single demonstration might take several minutes, and collecting hundreds of thousands of demonstrations requires months of effort across multiple robot platforms. To close this gap, several groups have turned to synthetic data generation.

NVIDIA has invested heavily in this direction through its Omniverse simulation platform and Cosmos world foundation models. The NVIDIA Isaac GR00T Blueprint for synthetic manipulation motion generation can produce 780,000 synthetic trajectories in just 11 hours from a small number of human demonstrations. Combining synthetic data with real data improved GR00T N1's performance by 40% compared to using only real data. Common training ratios hover around 80% synthetic and 20% real data, though even small amounts of real data have been shown to close the domain gap in some settings.^[19]

Human video data

Several models, including GR00T N1.5 and Skild Brain, also train on large amounts of human demonstration video from the internet. While these videos lack robot action labels, they provide rich information about object manipulation, task structure, and physical scene understanding. Techniques like FLARE (Future Latent Representation Alignment), introduced in GR00T N1.5, allow the model to learn useful representations from human video even without corresponding action annotations.^[12]

How are robot foundation models trained?

Robot foundation models use several training strategies, often in combination.

Pre-training and fine-tuning

The most common paradigm mirrors the approach used in NLP. A model is first pre-trained on a large, diverse dataset (often the OXE mixture or a proprietary multi-robot dataset) to learn general manipulation skills and visual-language grounding. It is then fine-tuned on a smaller, task-specific dataset for the target robot and application. Octo, for instance, can be fine-tuned for a new robot with just a few hours of demonstration data.^[6]

Co-fine-tuning

RT-2 pioneered co-fine-tuning, where a pre-trained VLM is jointly fine-tuned on a mixture of its original vision-language data and new robot trajectory data. This preserves the model's web knowledge and reasoning capabilities while teaching it to output motor actions. The approach allows emergent skill transfer: the model can reason about novel situations using knowledge from web data that was never paired with robot actions.^[2]

Self-improvement

RoboCat demonstrated that a robot foundation model can improve itself through iterative practice. Starting from a small set of demonstrations, the model generates many additional trajectories through autonomous practice, and these are fed back into training. This bootstrapping approach reduces the human demonstration burden and allows the model to explore and learn from its own mistakes.^[4]

Sim-and-real co-training

Several models combine simulated and real data during training. Simulated data provides cheap, abundant trajectories with perfect labels, while real data grounds the model in the visual and physical properties of the real world. NVIDIA's pipeline generates vast quantities of synthetic trajectories in Isaac Sim and Omniverse, which are combined with smaller amounts of real-world data for training GR00T models.^[19]

What is cross-embodiment generalization?

A defining goal of robot foundation models is cross-embodiment generalization: the ability of a single model to control different types of robots with different sensor configurations, action spaces, and physical morphologies.

The OXE project demonstrated that cross-embodiment training consistently improves performance. RT-1-X showed positive transfer on the majority of tested robot embodiments compared to models trained on each robot's data alone. RT-2-X extended this to show emergent transfer to robot types not represented in the training data.^[5]

CrossFormer made cross-embodiment training more practical by eliminating the need for manual action space alignment. Its modality-specific tokenizers handle the variability in observation types and action dimensions across different robots automatically.^[8]

Skild Brain takes cross-embodiment furthest, claiming omni-bodied control across quadrupeds, humanoids, tabletop arms, and mobile manipulators from a single model. The company reports that the model can adapt to new bodies through in-context learning without weight updates, adjusting its behavior on the fly based on live feedback.^[18]

Despite this progress, cross-embodiment generalization remains limited in several ways. Most successful demonstrations involve variations of tabletop manipulation arms. Transfer between fundamentally different robot morphologies (for example, from an arm to a legged robot) is less reliable. The action spaces, dynamics, and perceptual requirements of different robot types are different enough that a single model often struggles to handle all of them at the level of a specialist model trained specifically for each platform.

What are the main challenges and limitations?

Data scarcity

Foundation models in NLP train on billions of text examples, and vision models train on billions of images. Robot foundation models, by contrast, have access to far less data. The largest open dataset, OXE, contains roughly 1 million trajectories.^[5] While this sounds large, it is orders of magnitude smaller than what NLP models train on, and collecting robot data is far more expensive and time-consuming than scraping the web. Simulation and synthetic data generation help, but the gap between simulated and real-world conditions (the "sim-to-real gap") remains an active research problem.

Real-time execution

Large models are computationally expensive. Running a 55B-parameter model in real time on a robot is impractical without cloud connectivity, which introduces latency and reliability concerns. This has driven interest in smaller, more efficient models (like Octo at 93M parameters) and on-device inference (like Gemini Robotics On-Device). The tradeoff between model size and inference speed is particularly acute in robotics, where delayed actions can lead to task failure or safety hazards.

Safety and hallucination

Vision-language models can hallucinate, generating outputs that are plausible-sounding but factually incorrect. In a text chatbot, this is annoying. In a robot, it can be dangerous. A model that hallucinates the presence or absence of an object may collide with obstacles or execute actions in the wrong location. No current robot foundation model provides formal safety guarantees, and certifiable safety for learned policies remains an open research problem.

Sensory limitations

Most robot foundation models are vision-centric, relying primarily on camera images as input. Real-world manipulation often requires tactile feedback, force/torque sensing, and proprioceptive awareness that current VLAs do not fully integrate. Incorporating richer sensory modalities (touch, audio, force) into unified, temporally coherent representations is an active area of research but remains immature.

Evaluation and benchmarking

The field lacks standardized, real-world evaluation frameworks. Different groups test their models on different robots, in different environments, with different task sets and success criteria. This makes direct comparison between models difficult. Metrics beyond simple task success rate, including safety, efficiency, robustness to perturbation, and compute-aware performance measures, are needed for meaningful community-wide progress.

Deployment gap

Most demonstrations of robot foundation models occur in controlled lab settings. Few examples exist in outdoor environments, and many setups use position-controlled arms on fixed tabletops rather than the more challenging scenarios of mobile manipulation, legged locomotion, or torque-controlled robots. Bridging the gap between lab demonstrations and real-world deployment in homes, warehouses, and hospitals remains a major challenge.

Industry and funding landscape

The development of robot foundation models has attracted substantial venture capital investment. The robotics industry crossed over $10.3 billion in total funding in 2025, with a significant portion directed at foundation model companies.

Company	Total funding	Valuation	Focus
Physical Intelligence	$470M+	N/A	General-purpose robot foundation model (pi0)
Skild AI	$2B+	$14B+ (Jan 2026)	Omni-bodied robotic brain
Figure AI	$675M+	N/A	Humanoid robots with Helix VLA
1X Technologies	$100M+	N/A	Humanoid robots (NEO)

NVIDIA, Google DeepMind, and other large technology companies fund robot foundation model research internally without requiring venture capital. The market is expected to consolidate in the coming years, with investment likely concentrating on three or four perceived leaders while smaller companies are acquired or struggle to raise additional capital.

What is the current state of robot foundation models (2025-2026)?

As of early 2026, robot foundation models have progressed rapidly but remain far from the generalist capability levels that foundation models have achieved in language and vision. The state of the field can be characterized along several axes.

What works well: Tabletop manipulation with single arms, following open-vocabulary natural language instructions, picking and placing familiar objects, and transferring skills across similar robot arm platforms. Models like Gemini Robotics and pi0 have demonstrated dexterous tasks like laundry folding and origami that were considered very difficult just a few years ago.

What is emerging: Full-body humanoid control (GR00T, Helix), cooperative multi-robot tasks (Helix), learning from human video, and on-device inference for edge deployment.

What remains difficult: Robust outdoor operation, contact-rich assembly tasks requiring force feedback, long-horizon planning over minutes or hours, guaranteed safety, and truly omni-embodiment transfer that works reliably across arms, legs, wheels, and drones.

Key research directions for 2026 and beyond include:

Scaling data: Continued expansion of datasets through simulation, synthetic generation, and collaborative data collection. The OXE project was expected to reach 1 million trajectories from 50+ robot types by late 2025.
Multi-modal sensing: Integrating touch, force, audio, and proprioception alongside vision and language.
World models: Using world models and video prediction to enable planning and reasoning about future states before acting.
Safety verification: Developing methods for certifiable safety of learned policies, including uncertainty quantification and safe exploration.
On-device deployment: Building smaller, faster models that can run locally on robot hardware without cloud connectivity.

References

Brohan, A. et al. "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv:2212.06817, December 2022. ↩
Brohan, A. et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818, July 2023. ↩
Reed, S. et al. "A Generalist Agent (Gato)." arXiv:2205.06175, May 2022. ↩
Bousmalis, K. et al. "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation." arXiv:2306.11706, June 2023. ↩
Open X-Embodiment Collaboration. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864, October 2023. ↩
Octo Model Team. "Octo: An Open-Source Generalist Robot Policy." arXiv:2405.12213, May 2024. ↩
Kim, M.J. et al. "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246, June 2024. ↩
Doshi, K. et al. "Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation (CrossFormer)." arXiv:2408.11812, August 2024. ↩
Black, K. et al. "pi0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164, October 2024. ↩
Physical Intelligence. "pi0.5: A VLA with Open-World Generalization." physicalintelligence.company/blog/pi05, April 2025. ↩
NVIDIA. "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv:2503.14734, March 2025. ↩
NVIDIA. "GR00T N1.5." research.nvidia.com/labs/gear/gr00t-n1_5/, 2025. ↩
NVIDIA. "GR00T N1.6." research.nvidia.com/labs/gear/gr00t-n1_6/, 2025. ↩
Figure AI. "Helix: A Vision-Language-Action Model for Generalist Humanoid Control." figure.ai/news/helix, February 2025. ↩
Figure AI. "Introducing Helix 02: Full-Body Autonomy." figure.ai/news/helix-02, 2025. ↩
Google DeepMind. "Gemini Robotics: Bringing AI into the Physical World." arXiv:2503.20020, March 2025. ↩
Google DeepMind. "Gemini Robotics 1.5 brings AI agents into the physical world." deepmind.google/blog/gemini-robotics-15, September 2025. ↩
Skild AI. "Building the general-purpose robotic brain." skild.ai/blogs/building-the-general-purpose-robotic-brain, 2024. ↩
NVIDIA. "Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models." developer.nvidia.com/blog, 2025. ↩
Physical Intelligence. "openpi: Open-source code and weights for pi0." github.com/Physical-Intelligence/openpi, February 2025. ↩
NVIDIA. "NVIDIA Announces Isaac GR00T N1, the World's First Open Humanoid Robot Foundation Model." nvidianews.nvidia.com, March 18, 2025. ↩
Skild AI / Bloomberg. "Robotics Startup Skild Valued Above $14 Billion After SoftBank-Led Funding Round." Bloomberg / skild.ai Series C announcement, January 14, 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit