Robot learning is a field at the intersection of robotics and machine learning that focuses on enabling robots to acquire new skills, adapt to new environments, and improve their performance through experience rather than explicit programming. Instead of hand-coding every behavior a robot must execute, robot learning methods allow machines to develop competencies from data, demonstrations, trial and error, or a combination of these approaches.
The field draws on techniques from supervised learning, reinforcement learning, and unsupervised learning, applying them to physical systems that must operate in complex, unstructured, real-world environments. Because robots must deal with noisy sensors, imprecise actuators, and continuous state and action spaces, robot learning presents unique challenges beyond those found in purely digital domains.
The roots of robot learning stretch back to the earliest days of artificial intelligence research. Before the advent of learning-based approaches, robots were controlled entirely through manual programming, where engineers specified every motion, sensor reading, and decision rule by hand. This approach worked for structured factory environments but failed in unstructured settings where the world could not be fully predicted in advance.
The first industrial robot, Unimate, was installed at General Motors' Inland Fisher Guide Plant in Ewing Township, New Jersey, in 1961. Weighing 4,000 pounds, it extracted die-cast metal parts from a die-casting machine by following step-by-step commands stored on a magnetic drum. Unimate did not learn; it simply replayed programmed sequences. However, it demonstrated the potential of automating physical tasks and set the stage for later research into more flexible robotic systems.
Shakey, developed at the Stanford Research Institute (SRI) from 1966 to 1972, was the first general-purpose mobile robot that could reason about its own actions. Funded by DARPA, the project was led by Charles Rosen, Nils Nilsson, and Peter Hart. Shakey could visually perceive its environment, navigate between locations, communicate using plain English, and make its own action plans to solve problems.
Shakey's planning system relied on STRIPS (Stanford Research Institute Problem Solver), which represented the world as a set of logical propositions, defined actions as operators that changed which facts were true, and searched through possible action sequences to reach goal conditions. The project also produced the A* search algorithm and the Hough transform, both of which remain widely used today. Shakey was inducted into the Carnegie Mellon Robot Hall of Fame in 2004 and now resides at the Computer History Museum.
The Stanford Cart was a long-running research project at Stanford University between 1960 and 1980. In 1979, the Cart successfully navigated a chair-filled room autonomously using a camera mounted on a sliding track to create stereo image pairs. It processed these images to build three-dimensional environmental models, planned paths around obstacles, and executed movements. This was among the earliest demonstrations of a robot navigating through visual perception alone in an unstructured environment.
During the 1980s and 1990s, researchers began integrating machine learning into robotic systems. Work on neural networks, Q-learning, and behavior-based robotics (pioneered by Rodney Brooks at MIT) moved the field away from purely symbolic planning toward systems that could adapt through interaction with the environment. The TD-Gammon system (1992) by Gerald Tesauro, though not a robot, demonstrated that reinforcement learning could master complex tasks, inspiring similar approaches in robotics.
Learning from Demonstration (LfD), also called imitation learning, is a paradigm in which a robot learns to perform tasks by observing expert demonstrations. Rather than designing reward functions or programming control policies by hand, the robot extracts a mapping from observations to actions based on what a human (or another robot) has shown it.
Behavioral cloning (BC) is the simplest form of imitation learning. It treats demonstration data as a standard supervised learning problem: the robot collects observation-action pairs from an expert and trains a policy network to predict the expert's action given the current observation. Despite its simplicity, behavioral cloning has proven effective in many robotic manipulation tasks.
A well-known early example is ALVINN (Autonomous Land Vehicle In a Neural Network), developed by Dean Pomerleau at Carnegie Mellon University in 1989, which learned to steer a vehicle by watching a human driver. The system trained a neural network to map camera images directly to steering commands.
However, behavioral cloning suffers from a fundamental problem known as distribution shift (sometimes called covariate shift). Because the learned policy is imperfect, it will occasionally take actions that differ slightly from the expert's. These small errors accumulate over time, pushing the robot into states that were never represented in the training data. In those unfamiliar states, the policy may make increasingly poor decisions, leading to compounding errors.
DAgger, introduced by Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell at AISTATS 2011 in their paper "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning," directly addresses the distribution shift problem. The algorithm works iteratively:
By iteratively collecting data under the distribution of states induced by the learned policy (rather than the expert's), DAgger ensures that the training data better represents what the robot will actually encounter during deployment. The authors proved that this procedure guarantees that the learned policy's performance converges to that of the best policy in the policy class, given enough iterations.
Inverse reinforcement learning (IRL) takes a different approach to learning from demonstrations. Rather than directly cloning the expert's actions, IRL infers the underlying reward function that the expert appears to be optimizing. Once the reward function is recovered, standard reinforcement learning can be used to find an optimal policy.
Pieter Abbeel and Andrew Ng pioneered much of the foundational work in IRL for robotics. A notable early success was learning helicopter aerobatics from pilot demonstrations, where IRL recovered a reward function that captured the pilot's implicit objectives, and reinforcement learning then produced autonomous aerobatic maneuvers.
IRL is particularly useful when the expert's behavior reflects complex, hard-to-articulate preferences. Applications range from surgical robotics, where IRL replicates decision-making patterns of expert surgeons, to autonomous driving, where it captures nuanced driving styles.
Reinforcement learning (RL) trains agents to maximize cumulative rewards through trial-and-error interaction with an environment. In robotics, RL allows robots to discover control policies without requiring demonstrations, instead learning from the consequences of their own actions.
Several deep reinforcement learning algorithms have become standard tools for robot learning:
| Algorithm | Type | Key Properties | Typical Robotics Use |
|---|---|---|---|
| PPO (Proximal Policy Optimization) | On-policy, actor-critic | Clipped surrogate objective for stable training; good for high-dimensional action spaces | Locomotion, whole-body control |
| SAC (Soft Actor-Critic) | Off-policy, actor-critic | Maximum entropy framework; encourages exploration; data-efficient | Manipulation, real-robot training |
| DDPG (Deep Deterministic Policy Gradient) | Off-policy, actor-critic | Continuous action spaces; deterministic policy | Continuous control tasks |
| TD3 (Twin Delayed DDPG) | Off-policy, actor-critic | Addresses overestimation bias in DDPG; dual critic networks | Precise manipulation |
PPO has been widely adopted for locomotion tasks due to its stability, while SAC is often preferred for manipulation because of its sample efficiency and ability to train directly on physical hardware.
A persistent challenge in applying RL to robotics is the design of reward functions. Sparse rewards (for example, giving a reward of +1 only when a task is completed) make learning extremely slow because the robot must randomly stumble upon success before it can begin improving. Reward shaping addresses this by providing intermediate feedback that guides the agent toward the goal.
For example, when training a robot to walk, instead of rewarding only completed strides, a shaped reward might include terms for maintaining balance, moving the center of mass forward, and minimizing energy consumption. Potential-based reward shaping, introduced by Andrew Ng, Daishi Harada, and Stuart Russell in 1999, provides a formal guarantee that shaping will not change the optimal policy while accelerating learning.
However, reward shaping must be applied carefully. Poorly designed shaping can cause the robot to exploit the auxiliary rewards instead of solving the actual task, a phenomenon sometimes called reward hacking.
Training robots through RL in the real world is expensive, slow, and potentially dangerous. Simulation offers a safe, fast, and cheap alternative: policies can be trained on millions of episodes in a physics simulator before being deployed on physical hardware. However, the sim-to-real gap, the discrepancy between simulated and real-world physics, remains a central challenge.
Domain randomization addresses the sim-to-real gap by training the policy across a wide distribution of simulated environments with randomized physical parameters (friction, mass, damping, visual textures, lighting, and sensor noise). The idea is that if a policy succeeds across many different simulated conditions, it will be robust enough to handle the real world, which is just one more point in the distribution.
OpenAI's landmark 2019 demonstration of solving a Rubik's Cube with a dexterous robot hand (the Shadow Dexterous Hand) relied on a technique called Automatic Domain Randomization (ADR). Rather than manually specifying randomization ranges, ADR automatically expanded the range of environmental parameters during training as the policy improved. The system started with narrow randomization ranges and gradually widened them, increasing the difficulty of the training environments over time without human intervention.
The robot solved the Rubik's Cube approximately 60% of the time (and 20% of the time for maximally difficult scrambles). Remarkably, the system could handle situations it never encountered during training, such as being prodded with objects or having fingers tied together, demonstrating the robustness that domain randomization can achieve.
Beyond domain randomization, several other approaches help bridge the sim-to-real gap:
Physics simulators play a critical role in robot learning by providing environments where policies can be trained safely and at scale. The choice of simulator affects the fidelity of the physics, the speed of training, and the ease of sim-to-real transfer.
| Simulator | Developer | Key Strengths | License |
|---|---|---|---|
| MuJoCo | Google DeepMind (originally Emo Todorov) | Fast contact dynamics; widely used in RL research; GPU-accelerated (MJX) | Apache 2.0 (open source since May 2022) |
| Isaac Sim / Isaac Lab | NVIDIA | GPU-accelerated PhysX; photorealistic rendering via RTX; supports ROS/ROS2 | Apache 2.0 |
| SAPIEN | UC San Diego (Hao Su Lab) | Part-level articulated object interaction; GPU-parallelized; PartNet-Mobility dataset | Open source |
| PyBullet | Erwin Coumans | Lightweight; easy to use; good for prototyping | zlib license |
| Genesis | Various | Differentiable simulation; gradient-based optimization | Open source |
MuJoCo (Multi-Joint dynamics with Contact) was originally developed by Emanuel Todorov, Tom Erez, and Yuval Tassa at the University of Washington and described in a 2012 paper. It was commercialized through startup Roboti LLC in 2015 before being acquired by Google DeepMind in October 2021 and released as open-source software under the Apache 2.0 license in May 2022.
MuJoCo is one of the most cited simulation platforms in robot learning research, with over 9,000 citations as of early 2024. Its speed and accuracy in simulating contact dynamics make it particularly well-suited for manipulation and locomotion research. The MuJoCo Playground, released in 2025, enables researchers to train policies in minutes on a single GPU across diverse robotic platforms including quadrupeds, humanoids, dexterous hands, and robotic arms, supporting zero-shot sim-to-real transfer.
NVIDIA Isaac Sim is built on the Omniverse platform and uses PhysX for GPU-accelerated multi-physics simulation with RTX-based physically accurate sensor simulation (cameras, LiDARs). Isaac Lab, built on top of Isaac Sim, provides a unified framework for robot learning with over 16 robot models and more than 30 ready-to-train environments. It integrates with popular RL frameworks including RSL RL, SKRL, RL Games, and Stable Baselines. Isaac Lab supports procedural terrain generation, actuator dynamics modeling, domain randomization, and data collection from human demonstrations.
SAPIEN (SimulAted Part-based Interactive ENvironment) is a PhysX-based simulator developed at UC San Diego that specializes in manipulation tasks involving articulated objects. Its companion PartNet-Mobility dataset contains more than 2,000 articulated object models with over 14,000 movable parts. The ManiSkill framework, built on SAPIEN, focuses on manipulation skill benchmarks and can collect RGBD and segmentation data at over 30,000 frames per second on a single GPU.
The success of large-scale foundation models in natural language processing and computer vision has inspired a parallel effort in robotics: training large, general-purpose models on diverse robot data that can then be adapted to new tasks, environments, and robot embodiments.
RT-1, published by Google in December 2022, was one of the first large-scale transformer-based models for real-world robotic control. The architecture processes a short history of camera images along with natural language task descriptions and outputs tokenized actions.
The model uses an ImageNet-pretrained EfficientNet backbone conditioned on language instructions via FiLM (Feature-wise Linear Modulation) layers, followed by a Token Learner module that compresses visual tokens, and a Transformer that attends over these tokens to produce discretized action outputs. Despite having only 35 million parameters, RT-1 runs at 3 Hz.
RT-1 was trained on 130,000 episodes covering over 700 tasks, collected using a fleet of 13 robots from Everyday Robots over 17 months. The action space includes seven dimensions for arm movement (x, y, z, roll, pitch, yaw, gripper opening), three dimensions for base movement (x, y, yaw), and a mode-switching dimension.
RT-2, introduced by Google DeepMind in mid-2023, established the vision-language-action (VLA) model paradigm. Unlike RT-1, which was trained only on robot data, RT-2 is built on top of large vision-language models (VLMs) and learns from both web-scale data and robotic demonstrations.
RT-2 translates visual and language understanding into robotic actions, demonstrating generalization capabilities that go beyond its robotic training data. It can interpret novel commands, reason about object categories, and respond to high-level descriptions. With chain-of-thought reasoning, RT-2 can perform multi-stage semantic reasoning, such as deciding which object could serve as an improvised hammer (selecting a rock) or which drink would help a tired person (choosing an energy drink).
The Open X-Embodiment project, announced by Google DeepMind in October 2023, assembled the largest open-source real robot dataset to date. Created through collaboration with over 20 research institutions, it contains more than one million real robot trajectories spanning 22 robot embodiments, over 500 skills, and 150,000 tasks. The dataset was built by pooling 60 existing robot datasets from 34 robotic research labs worldwide.
Models trained on this data, called RT-X, demonstrate positive transfer across robot platforms. RT-1-X and RT-2-X showed that training on diverse multi-embodiment data improves performance compared to training on data from a single robot, with RT-2-X achieving triple the performance of its single-embodiment counterpart on real-world robotic skills.
Octo, developed by a team from UC Berkeley, Stanford, Carnegie Mellon University, and Google DeepMind, is an open-source generalist robot policy built on a transformer-based diffusion architecture. Released in 2024, Octo was pretrained on 800,000 robot episodes from the Open X-Embodiment dataset.
Two versions were released: Octo-Small (27 million parameters) and Octo-Base (93 million parameters). The model supports natural language instructions and goal images, processes observation histories, and generates multi-modal action distributions via diffusion decoding. Octo can be quickly fine-tuned to accommodate new observation modalities (such as force-torque inputs), new action spaces (such as joint position control), and entirely new robot embodiments.
In evaluations across 9 real robot setups at 4 institutions, Octo outperformed RT-1-X (the previous best open generalist policy) when using natural language task specification, and performed comparably to RT-2-X despite being orders of magnitude smaller.
Physical Intelligence, a company founded in 2024 by researchers from Google DeepMind, Stanford, and UC Berkeley, released pi-zero (written as the Greek letter, also stylized as pi0) as a general-purpose robot foundation model. The model is built on the PaliGemma vision-language model and was further trained on data from 7 different robots performing 68 tasks, as well as the Open X-Embodiment dataset.
A distinguishing feature of pi-zero is its use of flow matching to produce smooth, real-time action trajectories at 50 Hz, making it efficient and precise for real-world deployment. The model demonstrated capabilities that no prior robot learning system had achieved, including folding laundry from a hamper and assembling a cardboard box. In benchmarks, pi-zero achieved large improvements over OpenVLA and Octo across five different tasks.
Physical Intelligence open-sourced pi-zero's code and weights through its openpi repository on GitHub, and reported that between 1 and 20 hours of task-specific data was sufficient to fine-tune the model for new tasks.
OpenVLA, introduced by Stanford researchers in June 2024, is a 7-billion-parameter open-source vision-language-action model trained on 970,000 real-world robot manipulation trajectories from the Open X-Embodiment dataset. It combines a Llama 2 language model backbone with a visual encoder that fuses pretrained features from DINOv2 and SigLIP.
OpenVLA outperformed the much larger RT-2-X (55 billion parameters) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments while being seven times smaller. The model can be fine-tuned on consumer GPUs using low-rank adaptation (LoRA) and served efficiently via quantization. All checkpoints and code are released under the MIT License.
A significant trend in robot learning is the use of natural language to specify tasks, enabling non-expert users to instruct robots through verbal commands rather than programming.
SayCan, developed by Google Research, combines the broad knowledge of a large language model with the grounded capabilities of learned robotic skills. The system works by having the language model propose candidate actions in natural language, while a learned value function scores each candidate based on its feasibility in the current physical context. The action with the highest combined score (language plausibility multiplied by physical feasibility) is selected and executed.
The extended PaLM-SayCan system incorporated chain-of-thought prompting to handle tasks requiring multi-step reasoning. It also demonstrated multilingual capabilities, with almost no performance drop when switching queries from English to Chinese, French, or Spanish.
PaLM-E (2023), developed by Google, is an embodied multimodal language model that directly integrates visual and sensor observations into a large language model. By encoding explicit hardware embeddings, PaLM-E can adapt vision-language reasoning to new robot platforms, enabling transfer across different robot embodiments.
The VLA paradigm, established by RT-2 and extended by models like OpenVLA and pi-zero, represents the current frontier of language-conditioned manipulation. These models take natural language instructions and camera images as input and directly output low-level robot actions. By pretraining on web-scale vision-language data before fine-tuning on robot demonstrations, VLA models inherit broad semantic understanding that enables them to generalize to novel objects, scenes, and instructions that were not present in the robot training data.
Dexterous manipulation, the ability to perform precise, contact-rich tasks with multi-fingered hands or complex end-effectors, represents one of the most challenging frontiers in robot learning.
Diffusion Policy, developed at Columbia University, represents a robot's visuomotor policy as a conditional denoising diffusion process. This approach models the action distribution as a diffusion model, generating action sequences through iterative denoising. The advantage of this formulation is its ability to represent multi-modal action distributions, which is critical for dexterous tasks where multiple valid strategies may exist for a single observation.
Diffusion Policy has become a watershed result for dexterous manipulation in robotics, enabling a wide range of contact-rich tasks that were previously intractable with simpler policy representations.
ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) was originally developed at Stanford University as an affordable bimanual teleoperation platform. The system enables human operators to collect high-quality demonstration data for two-armed manipulation tasks.
ALOHA Unleashed (2024), developed at Google DeepMind using the ALOHA 2 platform, demonstrated that large-scale data collection combined with expressive models such as Diffusion Policies is sufficient to learn extremely challenging bimanual manipulation tasks. The robot successfully learned to tie a shoelace, hang a shirt on a hanger, repair another robot, insert a gear, and clean a kitchen. These tasks involve deformable objects and complex contact-rich dynamics that had previously been beyond the reach of learned policies.
OpenAI's demonstration of solving a Rubik's Cube with the Shadow Dexterous Hand remains one of the most celebrated achievements in dexterous manipulation. The system used reinforcement learning combined with Automatic Domain Randomization, training entirely in simulation before transferring to the physical robot. The five-fingered hand had 24 degrees of freedom and was controlled by a neural network that processed fingertip positions and the cube's orientation from vision.
Learning locomotion policies for legged robots has become one of the most successful applications of sim-to-real reinforcement learning.
Quadruped robots such as the ANYmal series (from ETH Zurich / ANYbotics), Boston Dynamics' Spot, and Unitree's Go-1 have become popular platforms for learned locomotion. Deep reinforcement learning enables these robots to traverse rough terrain, climb stairs, navigate dense vegetation, and perform agile maneuvers including jumping.
A typical pipeline involves training a policy in simulation using PPO or similar algorithms with domain randomization, then deploying it zero-shot on the physical robot. The policy takes proprioceptive observations (joint angles, angular velocities, body orientation) as input and outputs desired joint positions or torques.
Humanoid locomotion learning experienced rapid growth starting in 2023. A notable 2024 result from UC Berkeley demonstrated real-world humanoid locomotion learned entirely through reinforcement learning. The controller used a causal transformer that processed a history of proprioceptive observations and actions, trained with large-scale RL on thousands of randomized simulation environments. The resulting policy could walk over various outdoor terrains, remain robust to external disturbances, and adapt its gait in context, all via zero-shot sim-to-real transfer.
Imitation learning has also transformed legged locomotion research. Rather than hand-engineering reward functions, researchers can now train policies to replicate motion capture data from humans or animals, producing natural-looking gaits. More recent approaches use diffusion models to generate diverse locomotion behaviors from demonstration data.
Self-supervised learning allows robots to generate their own training signals through interaction with the environment, reducing dependence on labeled data or human demonstrations.
Self-supervised methods in robotics typically involve one or more of the following strategies:
Self-supervised approaches have dramatically improved reinforcement learning's practicality in industrial and commercial settings by reducing the amount of human supervision required.
| Approach | Learning Signal | Key Advantages | Key Challenges | Representative Methods |
|---|---|---|---|---|
| Behavioral Cloning | Expert demonstrations (state-action pairs) | Simple to implement; no environment interaction needed | Distribution shift; compounding errors | ALVINN, ACT, Diffusion Policy |
| DAgger | Expert demonstrations + online queries | Addresses distribution shift; provable guarantees | Requires interactive expert access | DAgger, HG-DAgger, MEGA-DAgger |
| Inverse RL | Expert demonstrations (infer reward) | Captures underlying objectives; generalizes beyond demonstrations | Computationally expensive; reward ambiguity | MaxEnt IRL, GAIL, AIRL |
| Model-Free RL | Environment reward signal | No model needed; can discover novel strategies | Sample inefficient; reward design is difficult | PPO, SAC, DDPG, TD3 |
| Model-Based RL | Environment reward + learned dynamics model | Sample efficient; enables planning | Model errors can compound; higher complexity | MBPO, Dreamer, PETS |
| Sim-to-Real Transfer | Simulated environment reward | Safe; cheap; massively parallelizable | Sim-to-real gap; calibration challenges | Domain randomization, ADR, system ID |
| Self-Supervised | Self-generated signals (prediction, curiosity) | No labels needed; scalable data collection | May learn task-irrelevant features | Forward prediction, contrastive learning |
| Foundation Model Fine-Tuning | Web data + robot demonstrations | Broad generalization; language grounding | Requires large compute; data diversity | RT-2, OpenVLA, pi-zero, Octo |
Robot learning has made remarkable progress, but several open challenges remain: