# Embodied AI

> Source: https://aiwiki.ai/wiki/embodied_ai
> Updated: 2026-06-20
> Categories: Artificial Intelligence, Deep Learning, Reinforcement Learning, Robotics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Embodied AI** is artificial intelligence that perceives, reasons about, and acts within physical or simulated environments through a body, using sensors to observe the world, actuators to manipulate it, and control policies that connect perception to action in a closed loop. Unlike conventional AI that operates on static datasets or text-based interactions, embodied [AI agents](/wiki/ai_agents) confront the physical world directly: noisy sensors, unpredictable physics, latency between thought and motion, and consequences that cannot be undone with a reset button. NVIDIA CEO Jensen Huang has called this the next phase of the field, stating at GTC 2024 that "the next wave of AI is physical AI."[19] Systems range from robotic arms performing assembly tasks to humanoid robots navigating kitchens to virtual agents exploring photorealistic 3D scenes. Goldman Sachs Research projects the global market for humanoid robots, one of the most visible embodiments, could reach $38 billion by 2035, with shipments of roughly 1.4 million units, and up to $154 billion in a blue-sky scenario.[20]

The core premise of embodied AI is that intelligence cannot be fully understood or replicated in isolation from physical interaction. An agent that must balance on two legs, grasp fragile objects, or navigate a cluttered room confronts challenges that purely digital systems never encounter: noisy sensor readings, unpredictable physics, latency between thought and motion, and consequences that cannot be undone with a reset button.

## Historical Background

### Classical AI and the Representational Paradigm

For much of the twentieth century, [artificial intelligence](/wiki/artificial_intelligence) research followed a symbolic or representational paradigm. Systems like SHRDLU (Terry Winograd, 1971) could discuss and manipulate blocks in a simulated world, but only because every aspect of that world was explicitly encoded in symbolic form. The physical grounding problem, how abstract symbols acquire meaning through sensory experience, remained unresolved.

Shakey the Robot, developed at the Stanford Research Institute between 1966 and 1972, was one of the earliest attempts to build an AI system that could perceive and act in a real environment. Shakey combined logical planning (using the STRIPS planner) with computer vision and motor control, but its deliberative architecture made it slow and brittle in practice.

### Rodney Brooks and Behavior-Based Robotics

The conceptual foundations of modern embodied AI trace back to Rodney Brooks at the MIT Artificial Intelligence Laboratory. In his landmark 1991 paper "Intelligence without Representation," Brooks argued that intelligent behavior could emerge from direct coupling between sensors and actuators, without requiring the elaborate internal world models favored by classical AI.[1] The paper crystallized the embodied stance in two now-famous claims: "Representation is the wrong unit of abstraction in building the bulkiest parts of intelligent systems," and "It turns out to be better to use the world as its own model."[1] He proposed that complex behavior arises from layers of simple reactive modules, each independently linking sensory input to motor output. The preceding year, in "Elephants Don't Play Chess" (1990), Brooks had argued that the higher cognitive abilities of a robot must be grounded in primary sensory-motor coupling rather than abstract symbolic reasoning.

Brooks introduced the **subsumption architecture** in 1986, a layered control system in which higher-level behaviors can suppress or override lower-level ones.[2] A robot built on this architecture might have a base layer for obstacle avoidance, a second layer for wandering, and a third layer for goal-directed navigation. Each layer runs concurrently, and the system operates in real time without waiting for a central planner.

Several robots demonstrated these ideas in practice:

- **Herbert** (late 1980s): A wheeled robot with 24 sonar sensors, a laser striper, and a two-degree-of-freedom arm that autonomously collected empty soda cans from desks and tables in the MIT AI Lab. Herbert used roughly 15 behavior layers, including obstacle avoidance, wandering, can detection, and arm manipulation.[1]
- **Genghis** (1989): A six-legged walking robot weighing under 1 kilogram, built to demonstrate that realistic locomotion gaits could emerge from distributed control across 12 actuators, with no central coordinator.

Brooks's work drew intellectual support from the **situated cognition** and **embodied cognition** movements in philosophy and cognitive science. Researchers like Francisco Varela, Evan Thompson, and Eleanor Rosch argued in *The Embodied Mind* (1991) that cognition is inseparable from bodily experience.[3] George Lakoff and Mark Johnson's earlier work, *Metaphors We Live By* (1980), had similarly proposed that even abstract thought is grounded in physical and sensory experience.[18] In robotics, Rolf Pfeifer and Hans Moravec championed the view that true artificial intelligence requires a physical body situated in a real environment.

### From Reactive Systems to Learning-Based Approaches

While behavior-based robotics demonstrated that useful behavior could emerge without explicit planning, these systems were hand-engineered and lacked the ability to learn from experience. The resurgence of [deep learning](/wiki/deep_learning) in the 2010s opened new possibilities. [Convolutional neural networks](/wiki/convolutional_neural_network) enabled robots to interpret raw camera images, and [reinforcement learning](/wiki/reinforcement_learning) provided a framework for learning control policies through trial and error. The combination of learned perception, simulation-based training, and large-scale data collection set the stage for the foundation model era in robotics.

## Key Areas of Embodied AI

### Robot Manipulation

Manipulation involves controlling a robotic arm and gripper (or hand) to grasp, move, place, and assemble objects. This is one of the most studied problems in embodied AI because it combines perception (identifying objects and estimating their poses), planning (determining a sequence of grasps and placements), and control (executing smooth, force-appropriate motions).

Dexterous manipulation, where a multi-fingered hand must reposition an object using finger gaits, represents a particularly difficult frontier. [OpenAI](/wiki/openai) demonstrated in 2019 that a simulated Shadow Dexterous Hand could learn to solve a Rubik's Cube through reinforcement learning with extensive domain randomization, though the policy required months of simulated experience.

### Navigation

Autonomous navigation requires an agent to move through an environment while avoiding obstacles, mapping its surroundings, and reaching specified goals. Navigation benchmarks range from PointGoal ("go to coordinates X, Y") to ObjectGoal ("find a chair") to Vision-Language Navigation ("go to the kitchen and pick up the red mug on the counter").

Classical approaches relied on simultaneous localization and mapping (SLAM) combined with path planning algorithms. Modern approaches increasingly use end-to-end [deep learning](/wiki/deep_learning), where a neural network maps camera images and goal specifications directly to movement commands.

### Locomotion

Locomotion research focuses on getting legged robots (bipeds, quadrupeds, hexapods) to walk, run, climb, and recover from disturbances. The challenge is acute because legged locomotion involves underactuated dynamics, intermittent ground contact, and the constant threat of falling.

[Boston Dynamics](/wiki/boston_dynamics) demonstrated impressive hand-engineered locomotion with [Atlas](/wiki/atlas_robot) (bipedal) and Spot (quadrupedal), but reinforcement learning approaches have increasingly matched or surpassed these results. Researchers at ETH Zurich trained ANYmal quadruped robots to traverse rough terrain, climb stairs, and recover from pushes using policies learned entirely in simulation and transferred to the real robot.

## Simulation Environments

Training embodied AI agents in the real world is expensive, slow, and potentially dangerous. Simulation provides a scalable alternative: agents can accumulate millions of episodes of experience in hours rather than months, and failures carry no cost. Several simulation platforms have become central to embodied AI research.

| Platform | Developer | Open Source | Physics Engine | Primary Use Cases | GPU Acceleration |
|---|---|---|---|---|---|
| [MuJoCo](/wiki/mujoco) | DeepMind (originally Emo Todorov) | Yes (Apache 2.0, 2022) | Custom (MuJoCo) | RL for locomotion, manipulation, biomechanics | Limited (MJX backend) |
| PyBullet | Erwin Coumans (Bullet Physics) | Yes (zlib License) | Bullet | RL, manipulation, legged robots | No |
| Isaac Sim / Isaac Lab | [NVIDIA](/wiki/nvidia) | Proprietary (free for research) | PhysX 5 | Manipulation, locomotion, industrial automation | Yes (thousands of parallel envs) |
| Habitat | [Meta AI](/wiki/meta_ai) (FAIR) | Yes (MIT License) | Custom | Visual navigation, embodied QA, human-robot interaction | Yes |
| SAPIEN | UC San Diego | Yes | PhysX | Manipulation, articulated objects | Yes |
| AI2-THOR | Allen Institute for AI | Yes | Unity | Indoor navigation, interaction | Limited |
| RoboCasa | UT Austin + NVIDIA | Yes | MuJoCo | Household tasks, large-scale environments | Limited |

### MuJoCo

[MuJoCo](/wiki/mujoco) (Multi-Joint dynamics with Contact) was originally developed by Emanuel Todorov, Tom Erez, and Yuval Tassa at the University of Washington and first described in 2012.[16] It was commercially available through Roboti LLC from 2015 until DeepMind acquired it in October 2021. DeepMind released MuJoCo as open-source software under the Apache 2.0 license in May 2022. As of 2024, the original MuJoCo publication had been cited over 5,300 times, making it one of the most widely used physics simulators in robotics research.

MuJoCo excels at fast, accurate simulation of articulated rigid bodies with contact, making it the standard platform for reinforcement learning research in locomotion and manipulation. Its MJX backend (introduced later) provides [JAX](/wiki/jax)-based GPU acceleration for massively parallel simulation.

### PyBullet

PyBullet is a Python interface to the Bullet Physics SDK, originally created by Erwin Coumans. It provides real-time simulation of rigid body dynamics, collision detection, forward and inverse kinematics, and inverse dynamics computation. PyBullet gained popularity in the robotics and reinforcement learning communities due to its ease of use, zero cost, and straightforward Python API. While it lacks the GPU parallelism of newer simulators, PyBullet remains widely used for prototyping and small-scale experiments.

### NVIDIA Isaac Sim and Isaac Lab

NVIDIA Isaac Sim, built on the Omniverse platform, combines high-fidelity GPU-accelerated physics (PhysX 5) with photorealistic rendering for robotics simulation. Isaac Lab extends Isaac Sim with a modular, composable framework for designing training environments, incorporating actuator models, sensor simulation, and domain randomization tools.

Isaac Lab's key advantage is massive parallelism: it can simulate thousands of robot environments simultaneously on a single GPU, dramatically accelerating reinforcement learning training. NVIDIA has demonstrated zero-shot sim-to-real transfer of policies trained in Isaac Lab to physical robots, including industrial assembly tasks on UR10e arms.

### Habitat

Habitat, developed by Meta AI's Fundamental AI Research (FAIR) team, is a simulation platform optimized for visual navigation and embodied reasoning. It renders photorealistic 3D scenes from real-world scan datasets (such as Matterport3D) at thousands of frames per second on a single GPU, enabling rapid training of navigation and interaction policies.[14]

Habitat 3.0 (2023) extended the platform to support both robot and humanoid avatars, enabling research on human-robot collaboration. Agents trained in Habitat 3.0 learn to find and collaborate with human partners on household tasks like tidying a room.

## Sim-to-Real Transfer

Policies trained in simulation often fail when deployed on physical robots due to the "reality gap," the differences between simulated and real-world physics, sensor noise, lighting, and dynamics. Several techniques have been developed to bridge this gap.

**Domain randomization** is the most widely used approach. During training in simulation, the system randomly varies physical parameters (friction, mass, joint damping), visual properties (lighting, textures, camera placement), and dynamics parameters. With sufficient variation, the real world appears to the trained policy as just another variation within the training distribution.[17] This technique was notably demonstrated by OpenAI in training dexterous manipulation policies that transferred to a physical Shadow Hand.

**System identification** takes the opposite approach: instead of randomizing broadly, it measures the real-world physical parameters as accurately as possible and configures the simulator to match. This produces a tighter match between simulation and reality but requires careful calibration.

**Sim-to-real fine-tuning** involves training initially in simulation and then fine-tuning the policy with a smaller amount of real-world data. This combines the data efficiency of simulation with the fidelity of real-world experience.

**Adaptive methods** train policies that can explicitly adapt to different dynamics at test time, often using meta-learning or online system identification. The agent learns not a single fixed policy but a policy that can quickly adjust its behavior based on observed dynamics.

## Foundation Models for Robotics

The success of large-scale [foundation models](/wiki/foundation_models) in [natural language processing](/wiki/nlu) and [computer vision](/wiki/computer_vision) has inspired analogous efforts in robotics. A robotics foundation model is a large neural network, pre-trained on diverse data, that can generalize across tasks, objects, environments, and even robot hardware.

### What is a Vision-Language-Action model?

Vision-Language-Action ([VLA](/wiki/vla)) models represent a convergence of [large language models](/wiki/llama), vision encoders, and robotic action prediction. A VLA typically takes camera images and a natural language instruction as input and outputs low-level robot actions (joint velocities, end-effector positions, or gripper commands). By inheriting semantic knowledge from internet-scale pretraining, VLAs can generalize to novel objects and instructions that were never seen during robot data collection.

The VLA paradigm emerged from work at Google DeepMind starting with RT-2 in 2023,[5] and has since been adopted by virtually every major robotics research group.

### Learning from Demonstration

[Imitation learning](/wiki/supervised_learning) is a central data collection strategy for robotics foundation models. Rather than defining reward functions for reinforcement learning, researchers collect demonstrations of desired behavior, typically through teleoperation, where a human operator controls the robot remotely.

Key teleoperation systems include:

- **ALOHA / ALOHA 2** (Stanford / Google DeepMind): A low-cost bimanual teleoperation platform where a human operator controls two robot arms by physically manipulating matched leader arms. ALOHA 2 was the primary data collection platform for Google's [Gemini](/wiki/gemini) Robotics models.[12]
- **Universal Manipulation Interface (UMI)**: A portable data collection framework using hand-held grippers, enabling demonstrations to be collected "in the wild" rather than in a fixed lab setup.
- **Kinesthetic teaching**: The human physically guides the robot's arm through the desired motion. Research in 2024-2025 found that kinesthetic teaching produces the cleanest data for downstream learning, though it scales less efficiently than teleoperation.

## Major Foundation Models for Robotics

The following table summarizes the principal foundation models that have shaped embodied AI research and development.

| Model | Organization | Year | Parameters | Architecture | Training Data | Key Innovation |
|---|---|---|---|---|---|---|
| RT-1 | Google | 2022 | 35M | EfficientNet + TokenLearner + [Transformer](/wiki/attention) | 130K episodes, 700+ tasks, 13 robots | First large-scale real-world robot transformer |
| RT-2 | [Google DeepMind](/wiki/google_deepmind) | 2023 | 12B / 55B | PaLM-E / PaLI-X (VLM) fine-tuned on robot data | RT-1 data + web-scale vision-language data | First VLA; web knowledge transfer to robot actions |
| RT-X | Google DeepMind | 2023 | 35M (RT-1-X) / 55B (RT-2-X) | Same as RT-1 / RT-2 | Open X-Embodiment (1M+ episodes, 22 embodiments) | Cross-embodiment positive transfer |
| Octo | UC Berkeley, Stanford, CMU, DeepMind | 2024 | 27M / 93M | Transformer-based diffusion policy | 800K episodes from Open X-Embodiment | Open-source; flexible fine-tuning to new robots |
| [OpenVLA](/wiki/openvla) | Stanford, UC Berkeley, DeepMind, TRI | 2024 | 7B | SigLIP + DinoV2 + Llama 2 | 970K episodes from Open X-Embodiment | Open-source 7B VLA outperforming 55B RT-2-X[8] |
| pi0 | [Physical Intelligence](/wiki/physical_intelligence) | 2024 | Not disclosed | PaliGemma VLM + action expert (flow matching) | 7 robot configs, 68 tasks + Open X-Embodiment | First generalist policy across diverse robot platforms |
| Helix | Figure AI | 2025 | Not disclosed | Dual-system VLA (slow VLM + fast visuomotor policy) | Humanoid demonstration data | First VLA for full humanoid upper-body control |
| GR00T N1 | [NVIDIA](/wiki/nvidia) | 2025 | 2B | Dual-system VLA (vision-language + diffusion transformer) | Real robot + human video + synthetic data | Open humanoid foundation model; 780K synthetic trajectories |
| Gemini Robotics | Google DeepMind | 2025 | Not disclosed | Gemini 2.0 extended with action output modality | ALOHA 2 teleoperation + multi-embodiment data | 2x generalization over prior SOTA VLAs |
| SmolVLA | Hugging Face | 2025 | 450M | Compact VLM + flow-matching action transformer | 10M frames from 487 community datasets (LeRobot) | Open-source compact VLA running on consumer hardware |

### RT-1 (Google, 2022)

RT-1 (Robotics Transformer 1) was a turning point for data-driven robot learning. Published in December 2022, it demonstrated that a transformer-based model trained on a large, diverse, real-world dataset could perform over 700 training instructions at a 97% success rate.[4] The architecture combined an ImageNet-pretrained [EfficientNet](/wiki/efficientnet)-B3 for visual processing, FiLM conditioning for language instruction integration, a TokenLearner module for token compression, and a Transformer backbone for action prediction.

The training dataset consisted of 130,000 episodes collected over 17 months using a fleet of 13 mobile manipulator robots from Everyday Robots (a Google/Alphabet project). The data covered kitchen-environment tasks like picking, placing, opening drawers, and moving objects to specified locations. RT-1 showed strong generalization, outperforming baselines by 25% on new tasks, 36% with novel distractors, and 18% in unseen backgrounds; it executed 76% of never-before-seen instructions, 24 points more than the next best baseline.[4]

### RT-2 (Google DeepMind, 2023)

RT-2, published in July 2023, introduced the Vision-Language-Action model concept.[5] Instead of building a robot-specific architecture from scratch, RT-2 fine-tuned pre-existing large vision-language models ([PaLM](/wiki/palm)-E at 12B parameters and PaLI-X at 55B parameters) on the same robot demonstration data used for RT-1. By encoding robot actions as text tokens appended to the language model's output vocabulary, RT-2 could leverage the semantic and visual knowledge acquired during web-scale pretraining.

This transfer proved powerful. RT-2 could interpret instructions involving concepts never seen in robot training data. For example, when asked to "pick up the object that can be used as an improvised hammer," the model correctly identified a rock, reasoning from general world knowledge rather than explicit robot training.[5]

### Open X-Embodiment and RT-X (Google DeepMind, 2023)

The Open X-Embodiment project, announced in October 2023, assembled the largest open-source real robot dataset to date. Created through a collaboration of 21 institutions and pooling 60-plus existing datasets, it contains more than one million real robot trajectories spanning 22 robot embodiments (single arms, bimanual systems, quadrupeds) and 527 distinct skills across roughly 160,266 tasks.[6]

Models trained on this combined dataset (RT-1-X and RT-2-X) demonstrated positive transfer: the diverse, cross-embodiment training data improved performance on individual robots compared to training on that robot's data alone.[6] The Open X-Embodiment dataset has since served as the primary training resource for most subsequent open-source robotics foundation models, much as [ImageNet](/wiki/image_recognition) catalyzed progress in computer vision.

### Octo (2024)

Octo is an open-source generalist robot policy developed collaboratively by researchers at UC Berkeley, Stanford, Carnegie Mellon University, and Google DeepMind. Published in May 2024, it is a transformer-based diffusion policy pretrained on 800,000 episodes from the Open X-Embodiment dataset.[7]

Octo was released in two sizes: Octo-Small (27M parameters) and Octo-Base (93M parameters). Its distinguishing feature is flexibility: Octo supports both language-conditioned and goal-image-conditioned task specification, and it can be fine-tuned to new robot setups with different sensors and action spaces within a few hours on standard consumer GPUs.[7] All model weights, training code, and fine-tuning scripts are publicly available.

### pi0 (Physical Intelligence, 2024)

Physical Intelligence, a startup founded in 2024, announced pi0 (pi-zero) as the first generalist robotic foundation model capable of controlling diverse robot platforms across qualitatively different tasks, including folding laundry, assembling boxes, bussing tables, and making coffee.[9] The model starts from a pre-trained vision-language model (PaliGemma, built from SigLIP and [Gemma](/wiki/gemma) encoders) and adds an action expert trained using flow matching on robot trajectory data.

pi0 was pre-trained on diverse data from seven distinct robot configurations and 68 tasks, and can be directly prompted or fine-tuned for complex downstream tasks.[9] Physical Intelligence raised $400 million in early funding and an additional $600 million in November 2025 at a $5.6 billion valuation, bringing its total raised to about $1.1 billion, with investors including Alphabet's CapitalG, Thrive Capital, Lux Capital, and Jeff Bezos.[21]

In February 2025, Physical Intelligence open-sourced the pi0 model weights and code. Version 0.6, released in early 2025, doubled throughput on tasks such as espresso filter insertion and laundry folding.

### Helix (Figure AI, 2025)

[Figure AI](/wiki/figure_ai) unveiled Helix in February 2025 as the first VLA designed specifically for full-body humanoid control.[10] Helix uses a dual-system architecture: System 2 is a slower, internet-pretrained vision-language model that handles scene understanding and language comprehension, while System 1 is a fast visuomotor policy that converts System 2's representations into real-time motor commands.

Helix is notable for being the first VLA to output high-rate continuous control of the entire humanoid upper body, including wrists, torso, head, and individual fingers.[10] It runs entirely onboard embedded low-power GPUs, making it suitable for commercial deployment. In demonstrations, Figure robots equipped with Helix picked up thousands of novel household objects following natural language prompts. A flagship demo showed the robot loading and unloading a dishwasher in a four-minute end-to-end sequence.

### NVIDIA GR00T and GR00T N1 (2025)

NVIDIA announced Project GR00T (Generalist Robot 00 Technology) as a foundation model initiative for humanoid robots. GR00T N1, released in March 2025, is a 2-billion-parameter open VLA model for generalized humanoid robot skills.[11]

GR00T N1 employs a dual-system architecture similar to Helix: a vision-language module (System 2) interprets the environment through vision and language instructions, while a diffusion transformer module (System 1) generates fluid motor actions in real time. The model was trained on a heterogeneous mixture of real-robot trajectories, human videos, and synthetic data.[11]

A particularly significant result involved synthetic data generation: NVIDIA produced 780,000 synthetic trajectories, the equivalent of 6,500 hours (about nine continuous months) of human demonstration, in just 11 hours using its simulation infrastructure. Combining this synthetic data with real data improved GR00T N1's performance by 40% compared to real data alone.[11]

The GR00T platform also includes [Jetson Thor](/wiki/jetson_thor), a system-on-chip with a Blackwell-architecture GPU delivering 800 teraflops of 8-bit floating point performance, designed to run the complete robot AI stack onboard.

### Gemini Robotics (Google DeepMind, 2025)

Google DeepMind introduced Gemini Robotics as an extension of the Gemini 2.0 family, adding physical action as a new output modality.[12] Alongside it, Gemini Robotics-ER (Embodied [Reasoning](/wiki/reasoning)) provides enhanced spatial understanding for robotic applications.

Gemini Robotics was trained primarily on data from the ALOHA 2 bimanual teleoperation platform and demonstrated control across multiple robot form factors, including ALOHA 2 arms, Franka bimanual setups, and the [Apptronik](/wiki/apptronik) Apollo humanoid. On a comprehensive generalization benchmark, Gemini Robotics more than doubled performance compared to prior state-of-the-art VLAs.[12]

Gemini Robotics 1.5, announced later in 2025, showed the ability to transfer learned motions from one robot embodiment to another without per-embodiment specialization. Gemini Robotics On-Device was optimized to run locally on robotic hardware, enabling deployment without cloud connectivity.

### SmolVLA (Hugging Face, 2025)

[SmolVLA](/wiki/smolvla), released by [Hugging Face](/wiki/hugging_face) in June 2025, represents the opposite end of the scale spectrum from models like Gemini Robotics. At 450 million parameters, SmolVLA is a compact, open-source VLA designed to run on consumer hardware. It was pretrained on 10 million frames curated from 487 community datasets tagged under "lerobot" on the Hugging Face Hub.[13]

Despite its small size and modest training data (fewer than 30,000 episodes), SmolVLA matched or outperformed larger models on both simulated (LIBERO, Meta-World) and real-world benchmarks.[13] Its architecture combines a compact vision-language backbone with a flow-matching transformer for action prediction, supporting asynchronous inference for improved throughput.

### How do open-source VLAs compare to proprietary ones?

A defining feature of the 2024-2025 wave was that small, fully open models began to match or beat much larger closed systems. OpenVLA, a 7-billion-parameter open model trained on 970,000 episodes from Open X-Embodiment, outperformed the closed-source 55-billion-parameter RT-2-X by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, despite having roughly seven times fewer parameters.[8] Octo, OpenVLA, pi0, and SmolVLA all released their weights and training code, and community-driven data collection through Hugging Face's LeRobot project (the source of SmolVLA's 487 datasets) is steadily expanding the shared data pool that open models depend on.

## Google DeepMind Robotics Program

Google DeepMind has maintained one of the most sustained robotics research efforts among major AI labs. Key milestones include:

- **2016**: Acquiring robotics capabilities through the merger of DeepMind and Google Brain teams.
- **2022**: RT-1 demonstrated large-scale transformer-based robot control.
- **2023**: RT-2 pioneered the VLA paradigm. The Open X-Embodiment collaboration assembled the largest open robot dataset. AutoRT explored large-scale autonomous robot data collection.
- **2024**: Advances in robot dexterity, including fine manipulation tasks. Continued ALOHA 2 development for bimanual research.
- **2025**: Gemini Robotics and Gemini Robotics-ER brought Gemini's capabilities to physical robot control. Gemini Robotics 1.5 and On-Device variants addressed multi-embodiment transfer and edge deployment.

Google DeepMind's approach has consistently emphasized scaling through data (the Open X-Embodiment philosophy), leveraging pretrained web-scale models (the VLA paradigm), and building general-purpose systems that work across multiple robot platforms.

## Benchmarks and Evaluation

Evaluating embodied AI systems is challenging because performance depends on the physical robot, the environment, the task distribution, and the evaluation protocol. Several benchmarks have been developed to enable reproducible comparison.

| Benchmark | Focus | Tasks | Environment | Notable Feature |
|---|---|---|---|---|
| BEHAVIOR-1K | Household activities | 1,000 | OmniGibson simulation, 50 scenes | 9,000+ objects with rich physical properties |
| RLBench | Manipulation | 100 | CoppeliaSim | 28 object categories; vision-based control |
| CALVIN | Language-conditioned manipulation | 34 | Custom simulation | Long-horizon multi-step tasks |
| LIBERO | Manipulation with diverse knowledge | 130 | Custom simulation | Procedural generation of task suites |
| Habitat Challenge | Navigation and interaction | Multiple tracks | Habitat simulator | Annual competition since 2019 |
| Open X-Embodiment Eval | Cross-embodiment manipulation | Varies by robot | Real-world | Standardized eval across 22 robot types |
| Meta-World | Multi-task manipulation | 50 | MuJoCo | Benchmark for multi-task and meta-RL |

**BEHAVIOR-1K** is among the most comprehensive embodied AI benchmarks. Developed at Stanford, it includes 1,000 everyday activities grounded in 50 scenes (houses, gardens, restaurants, offices) with more than 9,000 objects annotated with physical and semantic properties.[15] It uses the OmniGibson simulation environment, which supports rigid bodies, deformable bodies, and liquids. As of 2024, no existing AI system can autonomously solve all 1,000 activities, making it a long-term research target.

## Applications

### Manufacturing and Industrial Automation

Embodied AI enables robots to handle tasks that traditional fixed-automation cannot: picking irregularly shaped objects from bins, assembling components with varying tolerances, performing quality inspection through active perception, and adapting to product changeovers without manual reprogramming. NVIDIA Isaac Lab has been used to train reinforcement learning policies for industrial gear assembly tasks that transfer directly to UR10e robot arms.

### Warehouse and Logistics

Warehouse robots equipped with embodied AI perform goods-to-person picking, palletizing, and sorting at scale. Amazon's Sequoia system speeds inventory processing by 75%, and DHL's sorting robots have increased capacity by 40%. Humanoid and wheeled robots navigate warehouse floors, avoid human workers, and adapt to changing layouts without centralized orchestration.

### Healthcare

Robotic-assisted surgery systems use embodied AI for improved precision and stability. Rehabilitation robots adapt exercise regimens to patient progress in real time. Service robots in hospitals deliver medications, transport lab samples, and disinfect rooms. Research published in 2024 showed nearly sevenfold growth in embodied AI healthcare publications compared to 2019.

### Domestic Robotics

Home robots represent a long-standing goal of embodied AI. Recent foundation models have brought this closer to reality: pi0 demonstrated laundry folding and coffee making, Figure's Helix performed dishwasher loading, and Google's Gemini Robotics handled a range of kitchen manipulation tasks. The challenge remains achieving the reliability and robustness needed for unsupervised operation in unstructured home environments.

### Autonomous Vehicles

[Self-driving cars](/wiki/autonomous_driving) are a form of embodied AI that must perceive road conditions, predict the behavior of other road users, plan trajectories, and execute smooth control, all in real time with safety-critical constraints. While autonomous driving has developed somewhat independently from the robotics foundation model community, the underlying challenges of perception, prediction, and control are shared.

## Challenges and Open Problems

### Why is robot data the main bottleneck?

Robotics still lacks the equivalent of the internet-scale text and image datasets that powered progress in language and vision AI. While the Open X-Embodiment dataset contains over one million episodes,[6] this is modest compared to the trillions of tokens used to train large language models. Scaling robot data collection remains expensive and time-consuming, even with teleoperation systems. This data gap is the central reason NVIDIA, Physical Intelligence, and others have turned to synthetic data: NVIDIA's 780,000 simulated GR00T N1 trajectories were generated in 11 hours, against the roughly nine months the same volume would take to record from human operators.[11]

### Generalization

Current foundation models generalize well to new objects and instructions within familiar environments, but struggle with substantially different physical settings, lighting conditions, or robot hardware. True general-purpose robotics, where a single model can perform any physical task in any environment, remains distant.

### Safety and Reliability

Robots that interact with people and fragile objects must operate with extremely high reliability. A language model that occasionally produces an incorrect answer is an inconvenience; a robot that occasionally drops a heavy object on a person is a safety hazard. Ensuring safe behavior, especially when using learned policies that can behave unpredictably outside their training distribution, is an unsolved problem.

### Real-Time Inference

Many VLA models are computationally expensive. Running a 7-billion-parameter model at the 10-50 Hz control frequencies required for dexterous manipulation demands powerful onboard compute. Solutions include model distillation, quantization, and the dual-system architectures used by Helix and GR00T N1, where a lightweight action module runs at high frequency while a heavier reasoning module runs at lower frequency.

### Long-Horizon Planning

Most current systems excel at short-horizon tasks (pick up this object, move it there) but struggle with extended sequences of actions that require maintaining state, recovering from errors, and adapting plans. Cooking a meal, cleaning a house, or assembling furniture requires chaining dozens of subtasks over minutes or hours.

## Future Directions

The field is converging on several trends. Foundation models will continue to grow in scale and capability, with synthetic data generation (as demonstrated by NVIDIA's approach with GR00T N1) partially addressing the data bottleneck. The dual-system architecture, with a slow reasoning module and a fast action module, is becoming a standard design pattern. Open-source models like Octo, OpenVLA, pi0, and SmolVLA are democratizing access to robotics AI, and community-driven data collection through platforms like Hugging Face's LeRobot is building shared datasets.

The integration of embodied AI with the broader AI ecosystem, connecting robot agents to the same foundation models that power chatbots, code generators, and creative tools, points toward a future where physical and digital intelligence are two facets of a single system. As [Jensen Huang](/wiki/jensen_huang), CEO of NVIDIA, put it at GTC 2024: "The next wave of AI is physical AI."[19] Goldman Sachs analysts attribute their sharply higher humanoid market forecast (revised more than sixfold to $38 billion by 2035) largely to this AI progress, citing robotic large language models as a key reason for the change.[20]

## See Also

- [π*0.6 (pi-star-0.6)](/wiki/pi_star_0_6)
- [Drone AI](/wiki/drone_ai)
- [Reinforcement Learning](/wiki/reinforcement_learning)
- [Computer Vision](/wiki/computer_vision)
- [Autonomous Driving](/wiki/autonomous_driving)
- [NVIDIA](/wiki/nvidia)
- [Deep Learning](/wiki/deep_learning)
- [Diffusion Models](/wiki/diffusion_models)

## References

1. Brooks, R. A. (1991). "Intelligence without Representation." *Artificial Intelligence*, 47(1-3), 139-159.
2. Brooks, R. A. (1986). "A Robust Layered Control System for a Mobile Robot." *IEEE Journal on Robotics and Automation*, 2(1), 14-23.
3. Varela, F. J., Thompson, E., & Rosch, E. (1991). *The Embodied Mind: Cognitive Science and Human Experience*. MIT Press.
4. Brohan, A., et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale." *arXiv:2212.06817*.
5. Brohan, A., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." *arXiv:2307.15818*.
6. Open X-Embodiment Collaboration. (2023). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." *arXiv:2310.08864*.
7. Octo Model Team. (2024). "Octo: An Open-Source Generalist Robot Policy." *Robotics: Science and Systems (RSS) 2024*. *arXiv:2405.12213*.
8. Kim, M. J., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." *arXiv:2406.09246*.
9. Black, K., et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." *arXiv:2410.24164*.
10. Figure AI. (2025). "Helix: A Vision-Language-Action Model for Generalist Humanoid Control." *figure.ai/news/helix*.
11. NVIDIA. (2025). "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." *arXiv:2503.14734*; NVIDIA Newsroom, "NVIDIA Announces Isaac GR00T N1," March 2025.
12. Google DeepMind. (2025). "Gemini Robotics brings AI into the physical world." *deepmind.google/blog/gemini-robotics-brings-ai-into-the-physical-world*.
13. Hugging Face. (2025). "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data." *huggingface.co/blog/smolvla*.
14. Savva, M., et al. (2019). "Habitat: A Platform for Embodied AI Research." *ICCV 2019*.
15. Li, C., et al. (2024). "BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation." *arXiv:2403.09227*.
16. Todorov, E., Erez, T., & Tassa, Y. (2012). "MuJoCo: A physics engine for model-based control." *IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*.
17. Tobin, J., et al. (2017). "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World." *arXiv:1703.06907*.
18. Lakoff, G. & Johnson, M. (1980). *Metaphors We Live By*. University of Chicago Press.
19. NVIDIA. (2024). Jensen Huang keynote, GTC 2024. NVIDIA Newsroom and GTC 2024 keynote coverage.
20. Goldman Sachs Research. (2024). "The global market for humanoid robots could reach $38 billion by 2035." *goldmansachs.com/insights*.
21. Bloomberg / The Robot Report. (2025). "Robotics startup Physical Intelligence valued at $5.6 billion in new funding," November 2025.

