A robot foundation model is a large-scale machine learning model, typically based on the transformer architecture, that is trained on broad, diverse datasets of robot interactions and then adapted to a wide range of robotic tasks and embodiments. These models apply the same scaling principles that produced large language models (LLMs) and vision-language models (VLMs) to the domain of physical robot control, aiming to create general-purpose robotic policies that can follow natural language instructions, perceive their environment through computer vision, and output low-level motor actions.
Robot foundation models represent a shift away from the traditional approach in robotics, where engineers hand-design control policies or train narrow models for individual tasks on individual robots. Instead, robot foundation models are pre-trained on massive, heterogeneous datasets spanning multiple robot types, sensor configurations, and manipulation skills, and then fine-tuned or prompted for specific downstream applications. The field gained rapid momentum starting in 2022, with Google DeepMind's Robotics Transformer (RT) family of models, and has since expanded to include efforts from NVIDIA, Physical Intelligence, Figure AI, Skild AI, and academic labs at Stanford, UC Berkeley, and elsewhere.
For decades, robotics research relied on hand-crafted controllers, classical motion planning algorithms, and task-specific reinforcement learning policies. While effective in structured factory environments, these approaches struggled to generalize. A robot trained to pick up a red cup from a specific table position could not reliably pick up a blue mug placed at a different angle. Each new task, object, or robot platform typically required starting the training process from scratch.
The success of foundation models in natural language processing and computer vision suggested an alternative path. Models like GPT, BERT, and CLIP demonstrated that pre-training on web-scale data followed by task-specific fine-tuning could produce systems with broad generalization abilities. Researchers began asking whether the same approach could work for robotics: train a single, large model on diverse robot data, and then adapt it to new tasks, environments, and even robot bodies with minimal additional data.
Two factors made this feasible starting around 2022. First, the maturation of vision-language models provided a way to ground natural language instructions in visual perception, giving robots a shared representation space for understanding both what they see and what they are asked to do. Second, large-scale collaborative data collection efforts, most notably the Open X-Embodiment project, created datasets large enough to support pre-training of generalist robot policies.
The dominant architecture for robot foundation models is the vision-language-action (VLA) model. A VLA takes an image (or short video) of the robot's surroundings and a natural language instruction as input and directly outputs low-level robot actions that can be executed to accomplish the requested task.
VLAs are generally constructed by fine-tuning a pre-trained vision-language model on a large-scale dataset that pairs visual observations and language instructions with robot trajectories. The core idea is that a VLM already understands the visual world and language; by adding an action output modality, the model can be taught to translate perception and language understanding into physical behavior.
A typical VLA architecture has three components:
Visual encoder: A vision transformer (ViT) or convolutional neural network (CNN) that processes camera images into a set of visual token embeddings. Common backbones include SigLIP, DINOv2, and EfficientNet.
Language model backbone: A pre-trained large language model that processes the language instruction and visual tokens together, producing a joint representation. Backbones range from PaLM-E (540B parameters) and PaLI-X (55B) in RT-2 down to Llama 2 7B in OpenVLA and SmolLM 1.7B in GR00T N1.
Action decoder: A head that maps the model's internal representations to motor commands. This can be a simple discretization into action tokens (as in RT-1 and RT-2), a diffusion policy head (as in Octo), or a flow-matching decoder (as in pi0).
Different models handle action output differently:
RT-1 (Robotics Transformer 1) was one of the first large-scale transformer-based robot policies trained on real-world data at scale. Developed by Google's Everyday Robots team and published in December 2022, RT-1 was trained on 130,000 episodes covering over 700 tasks, collected using a fleet of 13 mobile manipulators over 17 months.
The architecture uses an ImageNet-pretrained EfficientNet backbone conditioned on language instruction embeddings via FiLM (Feature-wise Linear Modulation) layers, followed by a TokenLearner compression module and a transformer decoder that outputs discretized action tokens. Each action is discretized into 256 bins across 11 dimensions: seven for arm movement, three for base movement, and one discrete mode-switching variable.
RT-1 achieved 97% success on seen tasks and 76% on previously unseen tasks in the same environment, demonstrating that a single model could learn hundreds of manipulation skills simultaneously.
RT-2 extended the RT-1 approach by building directly on top of large pre-trained vision-language models. Rather than training a robot-specific architecture from scratch, RT-2 fine-tuned existing VLMs (PaLI-X at 55B parameters and PaLM-E at 12B parameters) to output robot actions as text tokens alongside their normal language outputs.
The key insight was that robot actions could be represented as strings of integers, making them just another "language" that the model could learn. By co-fine-tuning on a mixture of web-scale vision-language data and robot trajectory data, RT-2 inherited the world knowledge and reasoning capabilities of the underlying VLM. For example, RT-2 could follow instructions like "pick up the object that could be used as an improvised hammer" by combining visual recognition with commonsense reasoning, a capability that emerged from the web data rather than being explicitly taught through robot demonstrations.
RT-2 achieved more than 3x the success rate of baselines on emergent skill evaluations involving symbol understanding, reasoning, and human recognition.
Gato, published in May 2022, was an early generalist agent that operated across multiple modalities and domains. With 1.2 billion parameters, Gato could play Atari games, caption images, chat, and stack blocks with a real robot arm, all using the same network weights. Data from different tasks were serialized into flat sequences of tokens and processed by a single transformer.
Gato demonstrated the basic feasibility of multi-task, multi-embodiment policies but was limited by its relatively small scale and the breadth of its training data. It served as the foundation for RoboCat.
RoboCat, unveiled in June 2023, built on Gato's architecture to create a self-improving robotic agent. It was trained on data from multiple robotic arms and could adapt to new tasks or even entirely new robot hardware with as few as 100 demonstrations.
RoboCat's self-improvement loop worked as follows: given a small number of demonstrations for a new task, a spin-off agent would practice the task roughly 10,000 times, generating additional training data. This new data was then folded back into RoboCat's training set, and the model was retrained. Through this cycle, RoboCat improved its success rate on new tasks by an average of 2x.
RoboCat was the first agent to solve and adapt to multiple tasks across different physical robots, including adapting from two-pronged grippers to a three-fingered gripper with twice as many controllable inputs, all within a few hours.
The Open X-Embodiment (OXE) project, announced in October 2023, was a collaborative effort between Google DeepMind and over 20 research institutions to create the largest open-source real robot dataset. The dataset pooled 60 existing robot datasets from 34 labs worldwide, containing over 1 million trajectories from 22 different robot embodiments spanning single arms, bimanual systems, and mobile manipulators, covering 527 distinct manipulation skills.
Two models were trained on this data mixture:
The OXE dataset was standardized into the RLDS (Robot Learning Data Standard) format and made freely available, becoming an essential resource for the broader robot learning community.
Octo, published in May 2024, was the first fully open-source generalist robot policy designed from the ground up for broad applicability. Developed by a team at UC Berkeley and collaborators, Octo was trained on 800,000 robot episodes from 25 datasets in the Open X-Embodiment collection.
Octo comes in two sizes: Octo-Small (27M parameters) and Octo-Base (93M parameters). Its architecture tokenizes task descriptions using a pre-trained language model and observations using a lightweight CNN, then processes everything through a transformer backbone. A conditional diffusion decoding head generates continuous, multi-modal action distributions. Octo was the first VLA to use diffusion-based action decoding.
Octo supports both natural language instructions and goal-image conditioning, and it can be fine-tuned for new robots with just a few hours of data. Pre-training takes 8 hours (Octo-Small) or 14 hours (Octo-Base) on a TPUv4-128 pod. Notably, Octo runs on consumer GPUs, making it accessible to research labs without large-scale compute infrastructure.
OpenVLA, published in June 2024, is a 7-billion-parameter open-source VLA trained on 970,000 real-world robot manipulation trajectories from the Open X-Embodiment dataset. It combines a fused visual encoder (SigLIP + DINOv2) with a Llama 2 7B language model backbone.
Incoming RGB observations are divided into patches and processed by both visual encoders; the resulting features are concatenated and projected into the LLM embedding space via a two-layer MLP. The model then outputs discretized action tokens autoregressively. OpenVLA was trained on 64 A100 GPUs for 15 days, and it can be fine-tuned on consumer GPUs using low-rank adaptation (LoRA) methods and served efficiently via quantization.
CrossFormer, presented at CoRL 2024 (Top 4% oral), is a transformer-based policy trained on 900,000 trajectories from 30 different robot embodiments, the largest and most diverse cross-embodiment training dataset at the time. Unlike prior work, CrossFormer does not require manual alignment of observation or action spaces across different robots. It casts cross-embodied imitation learning as a sequence-to-sequence problem, using modality-specific tokenizers for observations, proprioception, and task specifications.
The same CrossFormer weights can control single-arm and dual-arm manipulation systems, wheeled robots, quadcopters, and quadrupeds. Experiments showed it matched the performance of specialist policies tailored for each embodiment while outperforming prior cross-embodiment learning methods.
Physical Intelligence released pi0 (also written as p0) in October 2024, describing it as a prototype generalist policy for robot control. The model was trained on data from seven robotic platforms covering 68 unique tasks, including laundry folding, table bussing, grocery bagging, box assembly, and object retrieval.
pi0 uses a mixture-of-experts-like architecture with a pre-trained 3B PaliGemma VLM and a separate set of action expert parameters. The VLM block, proprioception block, and action block interact through attention with block-wise causal masking: the VLM attends to itself, proprioception attends to itself and VLM tokens, and the action block attends to all tokens. Actions are generated via flow matching, a variant of diffusion that produces smooth, continuous action trajectories at up to 50 Hz.
In February 2025, Physical Intelligence open-sourced pi0 through its openpi repository on GitHub, releasing code, model weights, and fine-tuning examples for platforms including ALOHA and DROID.
pi0.5, released in April 2025, introduced a hierarchical architecture for improved open-world generalization. It uses a two-stage inference procedure: first, a high-level textual subtask is predicted through discrete autoregressive token decoding; then, low-level motor commands are generated through continuous flow matching. The discrete decoding pathway uses the FAST tokenizer for action tokens during pre-training, while post-training switches to flow matching for continuous actions.
pi0.5 demonstrated meaningful generalization to entirely new environments that were not represented in its training data.
Isaac GR00T N1, announced at GTC in March 2025, is the first open foundation model designed specifically for humanoid robots. It uses a dual-system architecture inspired by theories of human cognition:
These two systems are tightly coupled and optimized together during post-training. GR00T N1 was pre-trained on a heterogeneous mixture of real-robot trajectories, human videos, and synthetic data, using roughly 50,000 H100 GPU hours.
NVIDIA rapidly iterated on GR00T N1 through 2025:
GR00T N1 and its successors are open-source and available on Hugging Face, and they are being adopted by humanoid robot companies including 1X Technologies, Agility Robotics, Apptronik, Boston Dynamics, Figure AI, Fourier Intelligence, Sanctuary AI, Unitree Robotics, and XPENG Robotics.
Helix, unveiled by Figure AI in February 2025, is a generalist VLA for humanoid robot control. Like GR00T N1, Helix uses a dual System 1/System 2 architecture for high-rate, dexterous control of the entire humanoid upper body. Robots running Helix can pick up virtually any small household object, including thousands of items never seen during training, by following natural language prompts.
Helix was the first VLA to operate simultaneously on two robots, enabling them to collaborate on a shared, long-horizon manipulation task with novel objects. An updated version, Helix 02, extended control to the full robot body, integrating walking, manipulation, and balance into a single continuous system. In a demonstration, Helix 02 autonomously unloaded and reloaded a dishwasher across a full-sized kitchen in a four-minute, uninterrupted sequence with no resets or human intervention.
Gemini Robotics, introduced in March 2025, extended Google DeepMind's Gemini 2.0 multimodal model with physical action output capabilities. Where Gemini 2.0 could process text, images, video, and audio, Gemini Robotics added the ability to directly control robots.
The model demonstrated three core qualities: generality (adapting to different situations and following open-vocabulary instructions), interactivity (responding quickly to new instructions and changing conditions), and dexterity (performing fine manipulation tasks like folding origami and manipulating playing cards). According to DeepMind, Gemini Robotics more than doubled performance on a comprehensive generalization benchmark compared to other state-of-the-art VLA models.
A companion model, Gemini Robotics-ER (Embodied Reasoning), extended Gemini's multimodal reasoning into the physical world with spatial and temporal understanding, including object detection, grasp prediction, and 3D bounding box estimation. An on-device variant was optimized for local execution on robotic hardware, adapting to new tasks with as few as 50 to 100 demonstrations. Gemini Robotics 1.5, released in September 2025, added the ability to reason about and explain its own planned actions before executing them.
Skild AI, a Pittsburgh-based startup founded in 2023, is building what it calls a "general-purpose robotic brain." The Skild Brain is designed to be omni-bodied: it can control any robot without prior knowledge of the robot's exact body form, including quadrupeds, humanoids, tabletop arms, and mobile manipulators.
Skild's approach to the data problem relies heavily on large-scale simulation and internet video for pre-training, followed by targeted real-world data for post-training. The company demonstrated a notable research breakthrough in in-context learning for robotics: when introduced to a new body or unseen environment, the model adjusts its behavior based on live experience without requiring weight updates.
Skild AI raised $300 million in a Series A at a $1.5 billion valuation, and in January 2026, the company closed a $1.4 billion round led by SoftBank, bringing its valuation above $14 billion.
| Model | Organization | Year | Parameters | Training data | Action output | Open source | Key innovation |
|---|---|---|---|---|---|---|---|
| Gato | DeepMind | 2022 | 1.2B | 604 tasks (sim + real) | Discrete tokens | No | Multi-task, multi-modal generalist agent |
| RT-1 | 2022 | ~35M | 130K episodes, 700+ tasks | Discrete (256 bins) | Yes | First large-scale real-world robot transformer | |
| RT-2 | Google DeepMind | 2023 | 12B / 55B | Web + robot data | Discrete tokens (as text) | No | Actions as language tokens in a VLM |
| RoboCat | DeepMind | 2023 | N/A (Gato-based) | Multi-robot, self-generated | Discrete tokens | No | Self-improving via practice data |
| RT-1-X / RT-2-X | DeepMind + 20 labs | 2023 | 35M / 55B | 1M+ OXE trajectories | Discrete tokens | Yes (data) | Cross-embodiment transfer at scale |
| Octo | UC Berkeley | 2024 | 27M / 93M | 800K OXE episodes | Diffusion head | Yes | First open-source generalist robot policy with diffusion |
| OpenVLA | Stanford | 2024 | 7B | 970K OXE trajectories | Discrete tokens | Yes | Open 7B VLA with LoRA fine-tuning |
| CrossFormer | UC Berkeley | 2024 | N/A | 900K trajectories, 30 embodiments | Seq-to-seq tokens | Yes | Cross-embodiment without manual action alignment |
| pi0 | Physical Intelligence | 2024 | 3B+ | 7 platforms, 68 tasks | Flow matching (50 Hz) | Yes | Flow-matching action generation for dexterous tasks |
| pi0.5 | Physical Intelligence | 2025 | 3B+ | Expanded dataset | Dual (discrete + flow) | Yes | Hierarchical subtask planning + motor control |
| GR00T N1 | NVIDIA | 2025 | ~2B | Real + human video + synthetic | Diffusion Transformer (120 Hz) | Yes | Dual-system architecture for humanoid robots |
| Helix | Figure AI | 2025 | N/A | Proprietary | Dual system | No | First multi-robot cooperative VLA |
| Gemini Robotics | Google DeepMind | 2025 | N/A (Gemini 2.0-based) | Web + robot data | VLA (continuous) | No | Extends Gemini VLM to physical actions |
| Skild Brain | Skild AI | 2024-2025 | N/A | Sim + video + real | N/A | No | Omni-bodied control with in-context adaptation |
The availability of large-scale, diverse robot datasets has been a critical enabler for robot foundation models. The field has converged on several data sources.
The OXE dataset, released in October 2023, is the largest open-source real robot dataset. It aggregates 60 existing datasets from 34 research labs, containing over 1 million trajectories from 22 robot embodiments. All data is standardized into the RLDS format (based on TensorFlow Datasets) with consistent episode structure, action spaces, and metadata. OXE covers tabletop manipulation, mobile manipulation, bimanual tasks, and some locomotion scenarios. It has become the standard pre-training dataset for open-source robot foundation models including Octo, OpenVLA, and CrossFormer.
Real-world robot data collection is expensive and slow. A single demonstration might take several minutes, and collecting hundreds of thousands of demonstrations requires months of effort across multiple robot platforms. To close this gap, several groups have turned to synthetic data generation.
NVIDIA has invested heavily in this direction through its Omniverse simulation platform and Cosmos world foundation models. The NVIDIA Isaac GR00T Blueprint for synthetic manipulation motion generation can produce 780,000 synthetic trajectories in just 11 hours from a small number of human demonstrations. Combining synthetic data with real data improved GR00T N1's performance by 40% compared to using only real data. Common training ratios hover around 80% synthetic and 20% real data, though even small amounts of real data have been shown to close the domain gap in some settings.
Several models, including GR00T N1.5 and Skild Brain, also train on large amounts of human demonstration video from the internet. While these videos lack robot action labels, they provide rich information about object manipulation, task structure, and physical scene understanding. Techniques like FLARE (Future Latent Representation Alignment), introduced in GR00T N1.5, allow the model to learn useful representations from human video even without corresponding action annotations.
Robot foundation models use several training strategies, often in combination.
The most common paradigm mirrors the approach used in NLP. A model is first pre-trained on a large, diverse dataset (often the OXE mixture or a proprietary multi-robot dataset) to learn general manipulation skills and visual-language grounding. It is then fine-tuned on a smaller, task-specific dataset for the target robot and application. Octo, for instance, can be fine-tuned for a new robot with just a few hours of demonstration data.
RT-2 pioneered co-fine-tuning, where a pre-trained VLM is jointly fine-tuned on a mixture of its original vision-language data and new robot trajectory data. This preserves the model's web knowledge and reasoning capabilities while teaching it to output motor actions. The approach allows emergent skill transfer: the model can reason about novel situations using knowledge from web data that was never paired with robot actions.
RoboCat demonstrated that a robot foundation model can improve itself through iterative practice. Starting from a small set of demonstrations, the model generates many additional trajectories through autonomous practice, and these are fed back into training. This bootstrapping approach reduces the human demonstration burden and allows the model to explore and learn from its own mistakes.
Several models combine simulated and real data during training. Simulated data provides cheap, abundant trajectories with perfect labels, while real data grounds the model in the visual and physical properties of the real world. NVIDIA's pipeline generates vast quantities of synthetic trajectories in Isaac Sim and Omniverse, which are combined with smaller amounts of real-world data for training GR00T models.
A defining goal of robot foundation models is cross-embodiment generalization: the ability of a single model to control different types of robots with different sensor configurations, action spaces, and physical morphologies.
The OXE project demonstrated that cross-embodiment training consistently improves performance. RT-1-X showed positive transfer on the majority of tested robot embodiments compared to models trained on each robot's data alone. RT-2-X extended this to show emergent transfer to robot types not represented in the training data.
CrossFormer made cross-embodiment training more practical by eliminating the need for manual action space alignment. Its modality-specific tokenizers handle the variability in observation types and action dimensions across different robots automatically.
Skild Brain takes cross-embodiment furthest, claiming omni-bodied control across quadrupeds, humanoids, tabletop arms, and mobile manipulators from a single model. The company reports that the model can adapt to new bodies through in-context learning without weight updates, adjusting its behavior on the fly based on live feedback.
Despite this progress, cross-embodiment generalization remains limited in several ways. Most successful demonstrations involve variations of tabletop manipulation arms. Transfer between fundamentally different robot morphologies (for example, from an arm to a legged robot) is less reliable. The action spaces, dynamics, and perceptual requirements of different robot types are different enough that a single model often struggles to handle all of them at the level of a specialist model trained specifically for each platform.
Foundation models in NLP train on billions of text examples, and vision models train on billions of images. Robot foundation models, by contrast, have access to far less data. The largest open dataset, OXE, contains roughly 1 million trajectories. While this sounds large, it is orders of magnitude smaller than what NLP models train on, and collecting robot data is far more expensive and time-consuming than scraping the web. Simulation and synthetic data generation help, but the gap between simulated and real-world conditions (the "sim-to-real gap") remains an active research problem.
Large models are computationally expensive. Running a 55B-parameter model in real time on a robot is impractical without cloud connectivity, which introduces latency and reliability concerns. This has driven interest in smaller, more efficient models (like Octo at 93M parameters) and on-device inference (like Gemini Robotics On-Device). The tradeoff between model size and inference speed is particularly acute in robotics, where delayed actions can lead to task failure or safety hazards.
Vision-language models can hallucinate, generating outputs that are plausible-sounding but factually incorrect. In a text chatbot, this is annoying. In a robot, it can be dangerous. A model that hallucinates the presence or absence of an object may collide with obstacles or execute actions in the wrong location. No current robot foundation model provides formal safety guarantees, and certifiable safety for learned policies remains an open research problem.
Most robot foundation models are vision-centric, relying primarily on camera images as input. Real-world manipulation often requires tactile feedback, force/torque sensing, and proprioceptive awareness that current VLAs do not fully integrate. Incorporating richer sensory modalities (touch, audio, force) into unified, temporally coherent representations is an active area of research but remains immature.
The field lacks standardized, real-world evaluation frameworks. Different groups test their models on different robots, in different environments, with different task sets and success criteria. This makes direct comparison between models difficult. Metrics beyond simple task success rate, including safety, efficiency, robustness to perturbation, and compute-aware performance measures, are needed for meaningful community-wide progress.
Most demonstrations of robot foundation models occur in controlled lab settings. Few examples exist in outdoor environments, and many setups use position-controlled arms on fixed tabletops rather than the more challenging scenarios of mobile manipulation, legged locomotion, or torque-controlled robots. Bridging the gap between lab demonstrations and real-world deployment in homes, warehouses, and hospitals remains a major challenge.
The development of robot foundation models has attracted substantial venture capital investment. The robotics industry crossed over $10.3 billion in total funding in 2025, with a significant portion directed at foundation model companies.
| Company | Total funding | Valuation | Focus |
|---|---|---|---|
| Physical Intelligence | $470M+ | N/A | General-purpose robot foundation model (pi0) |
| Skild AI | $1.7B+ | $14B+ (Jan 2026) | Omni-bodied robotic brain |
| Figure AI | $675M+ | N/A | Humanoid robots with Helix VLA |
| 1X Technologies | $100M+ | N/A | Humanoid robots (NEO) |
NVIDIA, Google DeepMind, and other large technology companies fund robot foundation model research internally without requiring venture capital. The market is expected to consolidate in the coming years, with investment likely concentrating on three or four perceived leaders while smaller companies are acquired or struggle to raise additional capital.
As of early 2026, robot foundation models have progressed rapidly but remain far from the generalist capability levels that foundation models have achieved in language and vision. The state of the field can be characterized along several axes.
What works well: Tabletop manipulation with single arms, following open-vocabulary natural language instructions, picking and placing familiar objects, and transferring skills across similar robot arm platforms. Models like Gemini Robotics and pi0 have demonstrated dexterous tasks like laundry folding and origami that were considered very difficult just a few years ago.
What is emerging: Full-body humanoid control (GR00T, Helix), cooperative multi-robot tasks (Helix), learning from human video, and on-device inference for edge deployment.
What remains difficult: Robust outdoor operation, contact-rich assembly tasks requiring force feedback, long-horizon planning over minutes or hours, guaranteed safety, and truly omni-embodiment transfer that works reliably across arms, legs, wheels, and drones.
Key research directions for 2026 and beyond include: