Robot learning

Deep Learning Machine Learning Reinforcement Learning Robotics

28 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v7 · 5,553 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Robot learning is a field at the intersection of robotics and machine learning in which robots acquire new skills, adapt to new environments, and improve their performance through experience rather than explicit programming. Instead of hand-coding every behavior a robot must execute, robot learning methods let machines develop competencies from data, demonstrations, trial and error, or a combination of these approaches. Since 2022 the field has converged on large, general-purpose policies trained on diverse robot data: the Open X-Embodiment dataset alone pools more than one million real robot trajectories spanning 22 robot embodiments. ^[7]

The field draws on techniques from supervised learning, reinforcement learning, and unsupervised learning, applying them to physical systems that must operate in complex, unstructured, real-world environments. Because robots must deal with noisy sensors, imprecise actuators, and continuous state and action spaces, robot learning presents unique challenges beyond those found in purely digital domains.

History and Early Developments

The roots of robot learning stretch back to the earliest days of artificial intelligence research. Before the advent of learning-based approaches, robots were controlled entirely through manual programming, where engineers specified every motion, sensor reading, and decision rule by hand. This approach worked for structured factory environments but failed in unstructured settings where the world could not be fully predicted in advance.

Unimate and Industrial Robotics (1961)

The first industrial robot, Unimate, was installed at General Motors' Inland Fisher Guide Plant in Ewing Township, New Jersey, in 1961. Weighing 4,000 pounds, it extracted die-cast metal parts from a die-casting machine by following step-by-step commands stored on a magnetic drum. Unimate did not learn; it simply replayed programmed sequences. However, it demonstrated the potential of automating physical tasks and set the stage for later research into more flexible robotic systems.

Shakey the Robot (1966-1972)

Shakey, developed at the Stanford Research Institute (SRI) from 1966 to 1972, was the first general-purpose mobile robot that could reason about its own actions. ^[1] Funded by DARPA, the project was led by Charles Rosen, Nils Nilsson, and Peter Hart. Shakey could visually perceive its environment, navigate between locations, communicate using plain English, and make its own action plans to solve problems. ^[1]

Shakey's planning system relied on STRIPS (Stanford Research Institute Problem Solver), which represented the world as a set of logical propositions, defined actions as operators that changed which facts were true, and searched through possible action sequences to reach goal conditions. The project also produced the A* search algorithm and the Hough transform, both of which remain widely used today. ^[1] Shakey was inducted into the Carnegie Mellon Robot Hall of Fame in 2004 and now resides at the Computer History Museum.

The Stanford Cart (1960s-1980s)

The Stanford Cart was a long-running research project at Stanford University between 1960 and 1980. In 1979, the Cart successfully navigated a chair-filled room autonomously using a camera mounted on a sliding track to create stereo image pairs. It processed these images to build three-dimensional environmental models, planned paths around obstacles, and executed movements. This was among the earliest demonstrations of a robot navigating through visual perception alone in an unstructured environment.

The 1980s and 1990s: Learning Enters Robotics

During the 1980s and 1990s, researchers began integrating machine learning into robotic systems. Work on neural networks, Q-learning, and behavior-based robotics (pioneered by Rodney Brooks at MIT) moved the field away from purely symbolic planning toward systems that could adapt through interaction with the environment. The TD-Gammon system (1992) by Gerald Tesauro, though not a robot, demonstrated that reinforcement learning could master complex tasks, inspiring similar approaches in robotics.

Learning from Demonstration

Learning from Demonstration (LfD), also called imitation learning, is a paradigm in which a robot learns to perform tasks by observing expert demonstrations. Rather than designing reward functions or programming control policies by hand, the robot extracts a mapping from observations to actions based on what a human (or another robot) has shown it.

Behavioral Cloning

Behavioral cloning (BC) is the simplest form of imitation learning. It treats demonstration data as a standard supervised learning problem: the robot collects observation-action pairs from an expert and trains a policy network to predict the expert's action given the current observation. Despite its simplicity, behavioral cloning has proven effective in many robotic manipulation tasks.

A well-known early example is ALVINN (Autonomous Land Vehicle In a Neural Network), developed by Dean Pomerleau at Carnegie Mellon University in 1989, which learned to steer a vehicle by watching a human driver. ^[20] The system trained a neural network to map camera images directly to steering commands. ^[20]

However, behavioral cloning suffers from a fundamental problem known as distribution shift (sometimes called covariate shift). Because the learned policy is imperfect, it will occasionally take actions that differ slightly from the expert's. These small errors accumulate over time, pushing the robot into states that were never represented in the training data. In those unfamiliar states, the policy may make increasingly poor decisions, leading to compounding errors.

DAgger (Dataset Aggregation)

DAgger, introduced by Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell at AISTATS 2011 in their paper "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning," directly addresses the distribution shift problem. ^[2] As the authors put it, "Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning." ^[2] The algorithm works iteratively:

Train an initial policy on expert demonstrations using behavioral cloning.
Execute the learned policy to collect new observations.
Query the expert for the correct actions at those new observations.
Add the newly collected observation-action pairs to the training dataset.
Retrain the policy on the aggregated dataset.
Repeat steps 2 through 5.

By iteratively collecting data under the distribution of states induced by the learned policy (rather than the expert's), DAgger ensures that the training data better represents what the robot will actually encounter during deployment. The authors proposed "a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting," and proved that this procedure guarantees that the learned policy's performance converges to that of the best policy in the policy class, given enough iterations. ^[2]

Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) takes a different approach to learning from demonstrations. Rather than directly cloning the expert's actions, IRL infers the underlying reward function that the expert appears to be optimizing. Once the reward function is recovered, standard reinforcement learning can be used to find an optimal policy.

Pieter Abbeel and Andrew Ng pioneered much of the foundational work in IRL for robotics. A notable early success was learning helicopter aerobatics from pilot demonstrations, where IRL recovered a reward function that captured the pilot's implicit objectives, and reinforcement learning then produced autonomous aerobatic maneuvers.

IRL is particularly useful when the expert's behavior reflects complex, hard-to-articulate preferences. Applications range from surgical robotics, where IRL replicates decision-making patterns of expert surgeons, to autonomous driving, where it captures nuanced driving styles.

Reinforcement Learning for Robotics

Reinforcement learning (RL) trains agents to maximize cumulative rewards through trial-and-error interaction with an environment. In robotics, RL allows robots to discover control policies without requiring demonstrations, instead learning from the consequences of their own actions.

Core Algorithms

Several deep reinforcement learning algorithms have become standard tools for robot learning:

Algorithm	Type	Key Properties	Typical Robotics Use
PPO (Proximal Policy Optimization)	On-policy, actor-critic	Clipped surrogate objective for stable training; good for high-dimensional action spaces	Locomotion, whole-body control
SAC (Soft Actor-Critic)	Off-policy, actor-critic	Maximum entropy framework; encourages exploration; data-efficient	Manipulation, real-robot training
DDPG (Deep Deterministic Policy Gradient)	Off-policy, actor-critic	Continuous action spaces; deterministic policy	Continuous control tasks
TD3 (Twin Delayed DDPG)	Off-policy, actor-critic	Addresses overestimation bias in DDPG; dual critic networks	Precise manipulation

PPO has been widely adopted for locomotion tasks due to its stability, while SAC is often preferred for manipulation because of its sample efficiency and ability to train directly on physical hardware.

Reward Shaping

A persistent challenge in applying RL to robotics is the design of reward functions. Sparse rewards (for example, giving a reward of +1 only when a task is completed) make learning extremely slow because the robot must randomly stumble upon success before it can begin improving. Reward shaping addresses this by providing intermediate feedback that guides the agent toward the goal.

For example, when training a robot to walk, instead of rewarding only completed strides, a shaped reward might include terms for maintaining balance, moving the center of mass forward, and minimizing energy consumption. Potential-based reward shaping, introduced by Andrew Ng, Daishi Harada, and Stuart Russell in 1999, provides a formal guarantee that shaping will not change the optimal policy while accelerating learning. ^[19]

However, reward shaping must be applied carefully. Poorly designed shaping can cause the robot to exploit the auxiliary rewards instead of solving the actual task, a phenomenon sometimes called reward hacking.

Sim-to-Real Transfer

Training robots through RL in the real world is expensive, slow, and potentially dangerous. Simulation offers a safe, fast, and cheap alternative: policies can be trained on millions of episodes in a physics simulator before being deployed on physical hardware. However, the sim-to-real gap, the discrepancy between simulated and real-world physics, remains a central challenge.

Domain Randomization

Domain randomization addresses the sim-to-real gap by training the policy across a wide distribution of simulated environments with randomized physical parameters (friction, mass, damping, visual textures, lighting, and sensor noise). The idea is that if a policy succeeds across many different simulated conditions, it will be robust enough to handle the real world, which is just one more point in the distribution.

OpenAI's landmark 2019 demonstration of solving a Rubik's Cube with a dexterous robot hand (the Shadow Dexterous Hand) relied on a technique called Automatic Domain Randomization (ADR). ^[3]^[4] Rather than manually specifying randomization ranges, ADR automatically expanded the range of environmental parameters during training as the policy improved. The system started with narrow randomization ranges and gradually widened them, increasing the difficulty of the training environments over time without human intervention. ^[4]

The robot solved the Rubik's Cube approximately 60% of the time for average scrambles (and about 20% of the time for the hardest scrambles, those requiring 26 quarter-face turns). ^[3]^[4] Remarkably, the system could handle situations it never encountered during training, such as being prodded with objects or having fingers tied together, demonstrating the robustness that domain randomization can achieve. ^[3]^[4]

Other Sim-to-Real Techniques

Beyond domain randomization, several other approaches help bridge the sim-to-real gap:

System identification: Measuring physical parameters of the real robot and calibrating the simulator to match as closely as possible.
Real-to-sim transfer: Using real-world data to improve the simulator's fidelity, sometimes by training neural network models of the physics.
Residual policies: Training a base policy in simulation and then learning a small correction (residual) on the real robot to account for simulation inaccuracies.
Human-in-the-loop correction: The TRANSIC framework (2024) allows humans to provide online corrections when a sim-trained policy encounters situations the simulation failed to model, learning a gated residual policy from a small amount of human correction data.
Language-based domain transfer: Recent research has explored using natural language descriptions of images as a domain-invariant representation, outperforming prior sim-to-real methods by 25 to 40%.

Simulation Environments

Physics simulators play a critical role in robot learning by providing environments where policies can be trained safely and at scale. The choice of simulator affects the fidelity of the physics, the speed of training, and the ease of sim-to-real transfer.

Simulator	Developer	Key Strengths	License
MuJoCo	Google DeepMind (originally Emo Todorov)	Fast contact dynamics; widely used in RL research; GPU-accelerated (MJX)	Apache 2.0 (open source since May 2022)
Isaac Sim / Isaac Lab	NVIDIA	GPU-accelerated PhysX; photorealistic rendering via RTX; supports ROS/ROS2	Apache 2.0
SAPIEN	UC San Diego (Hao Su Lab)	Part-level articulated object interaction; GPU-parallelized; PartNet-Mobility dataset	Open source
PyBullet	Erwin Coumans	Lightweight; easy to use; good for prototyping	zlib license
Genesis	Various	Differentiable simulation; gradient-based optimization	Open source

MuJoCo

MuJoCo (Multi-Joint dynamics with Contact) was originally developed by Emanuel Todorov, Tom Erez, and Yuval Tassa at the University of Washington and described in a 2012 paper. ^[11] It was commercialized through startup Roboti LLC in 2015 before being acquired by Google DeepMind in October 2021 and released as open-source software under the Apache 2.0 license in May 2022. ^[21]

MuJoCo is one of the most cited simulation platforms in robot learning research, with over 9,000 citations as of early 2024. Its speed and accuracy in simulating contact dynamics make it particularly well-suited for manipulation and locomotion research. The MuJoCo Playground, released in 2025, enables researchers to train policies in minutes on a single GPU across diverse robotic platforms including quadrupeds, humanoids, dexterous hands, and robotic arms, supporting zero-shot sim-to-real transfer.

NVIDIA Isaac Sim and Isaac Lab

NVIDIA Isaac Sim is built on the Omniverse platform and uses PhysX for GPU-accelerated multi-physics simulation with RTX-based physically accurate sensor simulation (cameras, LiDARs). ^[18] Isaac Lab, built on top of Isaac Sim, provides a unified framework for robot learning with over 16 robot models and more than 30 ready-to-train environments. ^[18] It integrates with popular RL frameworks including RSL RL, SKRL, RL Games, and Stable Baselines. Isaac Lab supports procedural terrain generation, actuator dynamics modeling, domain randomization, and data collection from human demonstrations.

SAPIEN

SAPIEN (SimulAted Part-based Interactive ENvironment) is a PhysX-based simulator developed at UC San Diego that specializes in manipulation tasks involving articulated objects. ^[17] Its companion PartNet-Mobility dataset contains more than 2,000 articulated object models with over 14,000 movable parts. ^[17] The ManiSkill framework, built on SAPIEN, focuses on manipulation skill benchmarks and can collect RGBD and segmentation data at over 30,000 frames per second on a single GPU.

Foundation Models for Robotics

The success of large-scale foundation models in natural language processing and computer vision has inspired a parallel effort in robotics: training large, general-purpose models on diverse robot data that can then be adapted to new tasks, environments, and robot embodiments.

What are foundation models for robotics?

Foundation models for robotics are large neural networks pretrained on broad robot and web data and adapted to many downstream tasks, rather than trained from scratch for a single task. The dominant form today is the vision-language-action (VLA) model, which takes camera images and a natural-language instruction as input and outputs low-level robot actions. The table below summarizes the most influential models.

Model	Developer	Year	Parameters	Training data	Notable result
RT-1	Google	2022	35M	130,000 episodes, 700+ tasks (13 robots)	Up to 97% success on seen tasks at 3 Hz ^[5]
RT-2	Google DeepMind	2023	up to 55B (RT-2-X)	Web data + robot data	More than 3x emergent-skill performance vs RT-1 ^[6]
Octo	UC Berkeley et al.	2024	27M / 93M	800,000 Open X-Embodiment episodes	Matched RT-2-X while orders of magnitude smaller ^[8]
OpenVLA	Stanford	2024	7B	970,000 Open X-Embodiment trajectories	+16.5% absolute over RT-2-X, 7x smaller ^[10]
pi-zero (pi0)	Physical Intelligence	2024	3B VLM + action expert	10,000+ hours, 7 robots, 68 tasks + OXE	Folds laundry; actions up to 50 Hz ^[9]

RT-1 (Robotics Transformer 1)

RT-1, published by Google in December 2022, was one of the first large-scale transformer-based models for real-world robotic control. ^[5] The architecture processes a short history of camera images along with natural language task descriptions and outputs tokenized actions.

The model uses an ImageNet-pretrained EfficientNet backbone conditioned on language instructions via FiLM (Feature-wise Linear Modulation) layers, followed by a Token Learner module that compresses visual tokens, and a Transformer that attends over these tokens to produce discretized action outputs. Despite having only 35 million parameters, RT-1 runs at 3 Hz and reaches success rates of up to 97% on tasks seen during training. ^[5]

RT-1 was trained on 130,000 episodes covering over 700 tasks, collected using a fleet of 13 robots from Everyday Robots over 17 months. ^[5] The action space includes seven dimensions for arm movement (x, y, z, roll, pitch, yaw, gripper opening), three dimensions for base movement (x, y, yaw), and a mode-switching dimension.

RT-2 (Robotics Transformer 2)

RT-2, introduced by Google DeepMind in mid-2023, established the vision-language-action (VLA) model paradigm. ^[6] Unlike RT-1, which was trained only on robot data, RT-2 is built on top of large vision-language models (VLMs) and learns from both web-scale data and robotic demonstrations. ^[6]

RT-2 translates visual and language understanding into robotic actions, demonstrating generalization capabilities that go beyond its robotic training data. It can interpret novel commands, reason about object categories, and respond to high-level descriptions. With chain-of-thought reasoning, RT-2 can perform multi-stage semantic reasoning, such as deciding which object could serve as an improvised hammer (selecting a rock) or which drink would help a tired person (choosing an energy drink). ^[6]

Across more than 6,000 evaluation trials, RT-2 more than tripled RT-1's performance on emergent-skill evaluations (covering symbol understanding, reasoning, and human recognition) and roughly doubled performance on novel, unseen scenarios, raising it to 62% from RT-1's 32%. ^[6]

Open X-Embodiment and RT-X

The Open X-Embodiment project, announced by Google DeepMind in October 2023, assembled the largest open-source real robot dataset to date. ^[7] Created through collaboration with 21 institutions, it contains more than one million real robot trajectories spanning 22 robot embodiments, 527 skills, and 160,266 tasks. The dataset was built by pooling 60 existing robot datasets from 34 robotic research labs worldwide. ^[7]

Models trained on this data, called RT-X, demonstrate positive transfer across robot platforms. RT-1-X and RT-2-X showed that training on diverse multi-embodiment data improves performance compared to training on data from a single robot: RT-1-X achieved roughly 50% better performance than the original robot-specific methods on small datasets, and RT-2-X showed about 3x better generalization to skills not present in the original robot's data. ^[7]

Octo

Octo, developed by a team from UC Berkeley, Stanford, Carnegie Mellon University, and Google DeepMind, is an open-source generalist robot policy built on a transformer-based diffusion architecture. ^[8] Released in 2024, Octo was pretrained on 800,000 robot episodes from the Open X-Embodiment dataset. ^[8]

Two versions were released: Octo-Small (27 million parameters) and Octo-Base (93 million parameters). ^[8] The model supports natural language instructions and goal images, processes observation histories, and generates multi-modal action distributions via diffusion decoding. Octo can be quickly fine-tuned to accommodate new observation modalities (such as force-torque inputs), new action spaces (such as joint position control), and entirely new robot embodiments. ^[8]

In evaluations across 9 real robot setups at 4 institutions, Octo outperformed RT-1-X (the previous best open generalist policy) when using natural language task specification, and performed comparably to RT-2-X despite being orders of magnitude smaller. ^[8]

Pi-Zero (from Physical Intelligence)

Physical Intelligence, a company founded in 2024 by researchers from Google DeepMind, Stanford, and UC Berkeley, released pi-zero (written as the Greek letter, also stylized as pi0) on October 31, 2024 as what the company describes as "a general-purpose robot foundation model." ^[9] The model uses a 3 billion parameter vision-language model (PaliGemma) as a starting point and was further trained on more than 10,000 hours of data, including data from 7 different robots performing 68 tasks as well as the Open X-Embodiment dataset. ^[9]

A distinguishing feature of pi-zero is its use of flow matching to produce smooth action trajectories at up to 50 Hz, making it efficient and precise for real-world deployment. ^[9] The model demonstrated capabilities that no prior robot learning system had achieved, including folding laundry from a hamper and assembling a cardboard box. In benchmarks, pi-zero achieved large improvements over OpenVLA and Octo across five different tasks. ^[9]

Physical Intelligence open-sourced pi-zero's code and weights through its openpi repository on GitHub, and reported that between 1 and 20 hours of task-specific data was sufficient to fine-tune the model for new tasks. ^[9]

OpenVLA

OpenVLA, introduced by Stanford researchers in June 2024, is a 7-billion-parameter open-source vision-language-action model trained on 970,000 real-world robot manipulation trajectories from the Open X-Embodiment dataset. ^[10] It combines a Llama 2 language model backbone with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. ^[10]

OpenVLA outperformed the much larger RT-2-X (55 billion parameters) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments while being seven times smaller. ^[10] It also outperformed Diffusion Policy by 20.4% in multi-task settings involving multiple objects and language grounding. ^[10] The model can be fine-tuned on consumer GPUs using low-rank adaptation (LoRA) and served efficiently via quantization. All checkpoints and code are released under the MIT License.

Language-Conditioned Manipulation

A significant trend in robot learning is the use of natural language to specify tasks, enabling non-expert users to instruct robots through verbal commands rather than programming.

SayCan

SayCan, developed by Google Research, combines the broad knowledge of a large language model with the grounded capabilities of learned robotic skills. ^[14] The system works by having the language model propose candidate actions in natural language, while a learned value function scores each candidate based on its feasibility in the current physical context. The action with the highest combined score (language plausibility multiplied by physical feasibility) is selected and executed. ^[14]

The extended PaLM-SayCan system incorporated chain-of-thought prompting to handle tasks requiring multi-step reasoning. It also demonstrated multilingual capabilities, with almost no performance drop when switching queries from English to Chinese, French, or Spanish. ^[14]

PaLM-E

PaLM-E (2023), developed by Google, is an embodied multimodal language model that directly integrates visual and sensor observations into a large language model. ^[15] By encoding explicit hardware embeddings, PaLM-E can adapt vision-language reasoning to new robot platforms, enabling transfer across different robot embodiments. ^[15]

Vision-Language-Action Models

The VLA paradigm, established by RT-2 and extended by models like OpenVLA and pi-zero, represents the current frontier of language-conditioned manipulation. These models take natural language instructions and camera images as input and directly output low-level robot actions. By pretraining on web-scale vision-language data before fine-tuning on robot demonstrations, VLA models inherit broad semantic understanding that enables them to generalize to novel objects, scenes, and instructions that were not present in the robot training data.

Dexterous Manipulation

Dexterous manipulation, the ability to perform precise, contact-rich tasks with multi-fingered hands or complex end-effectors, represents one of the most challenging frontiers in robot learning.

Diffusion Policy

Diffusion Policy, developed at Columbia University, represents "a robot's visuomotor policy as a conditional denoising diffusion process." ^[12] This approach models the action distribution as a diffusion model, generating action sequences through iterative denoising. The advantage of this formulation is its ability to represent multi-modal action distributions, which is critical for dexterous tasks where multiple valid strategies may exist for a single observation. ^[12]

In its original evaluation, Diffusion Policy was benchmarked across 12 tasks from 4 robot manipulation benchmarks and consistently outperformed prior state-of-the-art methods, with an average improvement of 46.9%. ^[12] It has since become a watershed result for dexterous manipulation in robotics, enabling a wide range of contact-rich tasks that were previously intractable with simpler policy representations.

ALOHA and ALOHA Unleashed

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) was originally developed at Stanford University as an affordable bimanual teleoperation platform. The system enables human operators to collect high-quality demonstration data for two-armed manipulation tasks.

ALOHA Unleashed (2024), developed at Google DeepMind using the ALOHA 2 platform, demonstrated that large-scale data collection combined with expressive models such as Diffusion Policies is sufficient to learn extremely challenging bimanual manipulation tasks. ^[13] Trained with a transformer-based Diffusion Policy that ingests four camera views and proprioceptive state and outputs a 14-degree-of-freedom action, the robot successfully learned to tie a shoelace, hang a shirt on a hanger, replace a finger on another robot, insert a gear, and stack kitchen items. ^[13] These tasks involve deformable objects and complex contact-rich dynamics that had previously been beyond the reach of learned policies. DeepMind released ALOHA Unleashed alongside DemoStart, a companion method that learns dexterous multi-fingered manipulation in simulation, in September 2024. ^[13]

OpenAI Rubik's Cube (2019)

OpenAI's demonstration of solving a Rubik's Cube with the Shadow Dexterous Hand remains one of the most celebrated achievements in dexterous manipulation. ^[3] The system used reinforcement learning combined with Automatic Domain Randomization, training entirely in simulation before transferring to the physical robot. ^[4] The five-fingered hand had 24 degrees of freedom and was controlled by a neural network that processed fingertip positions and the cube's orientation from vision. ^[4] It solved the cube about 60% of the time for typical scrambles and about 20% for the hardest scrambles requiring 26 quarter-face turns. ^[3]^[4]

Locomotion Learning

Learning locomotion policies for legged robots has become one of the most successful applications of sim-to-real reinforcement learning.

Quadruped Locomotion

Quadruped robots such as the ANYmal series (from ETH Zurich / ANYbotics), Boston Dynamics' Spot, and Unitree's Go-1 have become popular platforms for learned locomotion. Deep reinforcement learning enables these robots to traverse rough terrain, climb stairs, navigate dense vegetation, and perform agile maneuvers including jumping.

A typical pipeline involves training a policy in simulation using PPO or similar algorithms with domain randomization, then deploying it zero-shot on the physical robot. The policy takes proprioceptive observations (joint angles, angular velocities, body orientation) as input and outputs desired joint positions or torques.

Humanoid Locomotion

Humanoid locomotion learning experienced rapid growth starting in 2023. A notable result from UC Berkeley, published in Science Robotics on April 17, 2024 by Ilija Radosavovic and colleagues, demonstrated real-world humanoid locomotion learned entirely through reinforcement learning. ^[16] The controller used a causal transformer that processed a history of proprioceptive observations and actions, trained with large-scale model-free RL on thousands of randomized simulation environments. ^[16] The resulting policy could walk over various outdoor terrains, remain robust to external disturbances, and adapt its gait in context, all via zero-shot sim-to-real transfer. ^[16]

Imitation learning has also transformed legged locomotion research. Rather than hand-engineering reward functions, researchers can now train policies to replicate motion capture data from humans or animals, producing natural-looking gaits. More recent approaches use diffusion models to generate diverse locomotion behaviors from demonstration data.

Self-Supervised Robot Learning

Self-supervised learning allows robots to generate their own training signals through interaction with the environment, reducing dependence on labeled data or human demonstrations.

Approaches

Self-supervised methods in robotics typically involve one or more of the following strategies:

Predictive learning: The robot learns to predict the outcomes of its actions (forward dynamics models), then uses these predictions to plan or improve its policy. For example, a robot might push objects around a table and learn to predict where they will end up, building an internal model of physics that can be used for planning.
Contrastive learning: Techniques like Barlow Twins and SimCLR learn visual representations by contrasting different views of the same scene, producing features that are useful for downstream manipulation or navigation tasks without requiring task-specific labels.
Curiosity-driven exploration: The robot is rewarded for visiting novel states or encountering unexpected outcomes, encouraging systematic exploration of its environment. This intrinsic motivation helps the robot discover useful behaviors before any task reward is specified.
Proprioceptive feature learning: Recent work on bipedal robot locomotion uses the Barlow Twins algorithm to learn continuous-time latent features from proprioceptive data, removing the need for knowledge distillation from privileged simulation information.

Self-supervised approaches have dramatically improved reinforcement learning's practicality in industrial and commercial settings by reducing the amount of human supervision required.

Table of Robot Learning Approaches

Approach	Learning Signal	Key Advantages	Key Challenges	Representative Methods
Behavioral Cloning	Expert demonstrations (state-action pairs)	Simple to implement; no environment interaction needed	Distribution shift; compounding errors	ALVINN, ACT, Diffusion Policy
DAgger	Expert demonstrations + online queries	Addresses distribution shift; provable guarantees	Requires interactive expert access	DAgger, HG-DAgger, MEGA-DAgger
Inverse RL	Expert demonstrations (infer reward)	Captures underlying objectives; generalizes beyond demonstrations	Computationally expensive; reward ambiguity	MaxEnt IRL, GAIL, AIRL
Model-Free RL	Environment reward signal	No model needed; can discover novel strategies	Sample inefficient; reward design is difficult	PPO, SAC, DDPG, TD3
Model-Based RL	Environment reward + learned dynamics model	Sample efficient; enables planning	Model errors can compound; higher complexity	MBPO, Dreamer, PETS
Sim-to-Real Transfer	Simulated environment reward	Safe; cheap; massively parallelizable	Sim-to-real gap; calibration challenges	Domain randomization, ADR, system ID
Self-Supervised	Self-generated signals (prediction, curiosity)	No labels needed; scalable data collection	May learn task-irrelevant features	Forward prediction, contrastive learning
Foundation Model Fine-Tuning	Web data + robot demonstrations	Broad generalization; language grounding	Requires large compute; data diversity	RT-2, OpenVLA, pi-zero, Octo

Current Trends and Open Challenges

Robot learning has made remarkable progress, but several open challenges remain:

Generalization across embodiments: While Open X-Embodiment and similar efforts have demonstrated positive transfer across robot platforms, building truly embodiment-agnostic policies remains difficult. Different robots have different sensor configurations, action spaces, and physical capabilities.
Long-horizon tasks: Most current robot learning systems handle tasks that take seconds to minutes. Extending to tasks requiring hours of sequential decision-making, such as household cleanup or warehouse logistics, remains an active research area.
Safety and robustness: Learned policies can fail unpredictably when encountering out-of-distribution situations. Ensuring safe behavior, particularly around humans, is critical for real-world deployment.
Data efficiency: Despite progress with foundation models, most robot learning methods still require substantial amounts of task-specific data. Improving few-shot and zero-shot transfer is a key research priority.
Hardware co-design: The learning algorithm and the robot hardware are deeply intertwined. Low-cost, capable hardware platforms like ALOHA have accelerated progress by making data collection more accessible.

How does robot learning differ from classical robotics?

Classical robotics relies on hand-engineered models: an engineer specifies the kinematics, dynamics, and control laws, and a planner computes trajectories from an explicit model of the world. Robot learning instead acquires behavior from data. The two paradigms are increasingly combined. Model-based RL and residual policies wrap learned components around classical controllers, and sim-to-real pipelines depend on accurate physics models inside simulators such as MuJoCo and Isaac Sim. The practical distinction is where the behavior comes from: classical robotics derives it from equations, while robot learning derives it from demonstrations, rewards, or self-supervised interaction.

References

Shakey the Robot. SRI International. Retrieved from https://www.sri.com/hoi/shakey-the-robot/ ↩
Ross, S., Gordon, G. J., & Bagnell, J. A. (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. Proceedings of AISTATS. https://proceedings.mlr.press/v15/ross11a.html ↩
OpenAI. (2019). Solving Rubik's Cube with a Robot Hand. https://openai.com/index/solving-rubiks-cube/ ↩
Akkaya, I., et al. (2019). Solving Rubik's Cube with a Robot Hand. arXiv:1910.07113. https://arxiv.org/abs/1910.07113 ↩
Brohan, A., et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. arXiv:2212.06817. https://arxiv.org/abs/2212.06817 ↩
Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models. Google DeepMind. https://deepmind.google/blog/rt-2-new-model-translates-vision-and-language-into-action/ ↩
Open X-Embodiment Collaboration. (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv:2310.08864. https://arxiv.org/abs/2310.08864 ↩
Octo Model Team. (2024). Octo: An Open-Source Generalist Robot Policy. arXiv:2405.12213. https://arxiv.org/abs/2405.12213 ↩
Physical Intelligence. (2024). pi-zero: Our First Generalist Policy. https://www.pi.website/blog/pi0 (paper arXiv:2410.24164. https://arxiv.org/abs/2410.24164) ↩
Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246. https://arxiv.org/abs/2406.09246 ↩
Todorov, E., Erez, T., & Tassa, Y. (2012). MuJoCo: A physics engine for model-based control. IEEE/RSJ International Conference on Intelligent Robots and Systems. ↩
Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv:2303.04137. https://arxiv.org/abs/2303.04137 ↩
Zhao, T. Z., et al. (2024). ALOHA Unleashed: A Simple Recipe for Robot Dexterity. arXiv:2410.13126. https://arxiv.org/abs/2410.13126 ↩
Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). https://say-can.github.io/ ↩
Driess, D., et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378. ↩
Radosavovic, I., et al. (2024). Real-World Humanoid Locomotion with Reinforcement Learning. Science Robotics. https://www.science.org/doi/10.1126/scirobotics.adi9579 ↩
Xiang, F., et al. (2020). SAPIEN: A SimulAted Part-based Interactive ENvironment. CVPR 2020. https://arxiv.org/abs/2003.08515 ↩
NVIDIA. Isaac Sim and Isaac Lab Documentation. https://developer.nvidia.com/isaac/sim ↩
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. ICML 1999. ↩
Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. Advances in Neural Information Processing Systems. ↩
Google DeepMind. (2022). Open-sourcing MuJoCo. https://deepmind.google/discover/blog/open-sourcing-mujoco/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit