Imitation Learning

Imitation Learning (IL), sometimes called Learning from Demonstration (LfD), is a family of machine learning methods in which an agent acquires a policy by observing examples of behavior produced by an expert rather than by optimizing an explicitly defined reward signal.^[1]^[2] Instead of asking a designer to specify what the agent should achieve through a scalar reward function, imitation learning asks a teacher to show the agent what to do. The resulting datasets typically consist of pairs of observations and corresponding actions, and the learned policy is a function that reproduces the demonstrator's mapping from observations to actions.

Imitation learning sits at the intersection of supervised learning, reinforcement learning, and control theory. Its earliest practical demonstration was Dean Pomerleau's ALVINN system in 1989, a neural network that learned to steer an autonomous van by watching a human driver.^[3] The field grew slowly through the 1990s and early 2000s, then accelerated rapidly after 2010 with the introduction of DAgger, GAIL, and a wave of deep learning techniques that allowed policies to operate directly from raw images. By the mid-2020s, imitation learning had become the dominant training paradigm for general-purpose robot foundation models such as Google DeepMind's RT-2, Physical Intelligence's Pi0 family, and NVIDIA's Project GR00T.

This article surveys the definition and motivations of imitation learning, its principal algorithmic families (behavior cloning, DAgger, inverse reinforcement learning, GAIL, AIRL, and diffusion- and transformer-based action models), its theoretical properties (especially the compounding error problem), the data infrastructure that supports modern systems, and the major laboratories and people driving the field forward.

Definition and Scope

Imitation learning concerns sequential decision problems modeled as Markov decision processes. A standard MDP is defined by states s, actions a, a transition function P(s'|s,a), and a reward function R(s,a). In ordinary reinforcement learning, the agent observes the reward and tries to find a policy that maximizes long-term return. In imitation learning, the reward function is unknown, unspecified, or considered too brittle to write down by hand. What the agent does receive is a dataset of expert trajectories, each consisting of observed states paired with actions chosen by the demonstrator.

The motivation for this setup is practical. For tasks like driving a car, folding laundry, or performing a delicate surgical maneuver, designing a numeric reward that captures every nuance of skilled behavior is extraordinarily hard. A reward that rewards reaching a destination but not avoiding collisions, or one that rewards finishing a fold but not preserving the fabric, can be silently optimized in undesirable ways. Demonstrations sidestep this problem by encoding the expert's preferences implicitly through example.^[4]

Three assumptions distinguish the canonical imitation learning setting:

Access to demonstrations: A finite dataset of expert state and action pairs is available, possibly along with images, language, or proprioceptive signals.
No environment reward at training time: The learner does not see scalar feedback from the environment about how well it is performing. Some hybrid methods relax this assumption.
A target policy class: The learner picks a function approximator (linear, kernel, deep network, transformer) and tries to find parameters that match the expert's behavior.

Within these assumptions, imitation learning admits many formulations: pure supervised regression, distribution matching, reward inference followed by reinforcement learning, adversarial training, and increasingly hybrid offline reinforcement learning that leverages demonstrations as a regularization signal.

Distinction from Reinforcement Learning

The relationship between imitation learning and reinforcement learning is best described as complementary rather than oppositional. Both target the same kind of problem (control in a sequential environment) but they consume different signals.

Property	Behavior Cloning (BC)	Inverse Reinforcement Learning (IRL)	Reinforcement Learning (RL)
Required signal	Expert state, action pairs	Expert trajectories	Scalar reward function
Optimization target	Match expert's policy	Recover expert's reward	Maximize expected return
Sample efficiency	Very high if expert data is plentiful	Moderate	Low to moderate
Exploration needed	None	Required after reward inference	Required throughout
Robustness to distribution shift	Poor (compounding error)	Good (reward generalizes)	Depends on training
Common environments	Robotics, autonomous driving, gameplay	Robotics, animal behavior	Games, simulation
Typical failure mode	Cascading mistakes off-distribution	Reward ambiguity	Sparse reward exploration

In practice many production systems combine both. A common recipe is to bootstrap a policy from demonstrations using behavior cloning, then improve it with reinforcement learning in simulation or with offline RL on logged data. NVIDIA's Project GR00T explicitly mixes imitation learning, reinforcement learning, and synthetic video data to train humanoid policies.^[5] Physical Intelligence's Pi0 builds on a vision-language base and is trained almost entirely with imitation, but evaluates against task success metrics that resemble RL rewards.^[6]

Historical Background

ALVINN and the Origins of Behavior Cloning (1989)

The canonical first paper on imitation learning is Dean Pomerleau's ALVINN: An Autonomous Land Vehicle in a Neural Network, published at NeurIPS 1988 and demonstrated through 1989.^[3] ALVINN used a three-layer fully connected network with roughly 5,000 weights to map a 30x32 camera image and an 8x32 laser range finder reading to a steering command represented as 45 output units encoding turn curvature. The system was trained on roughly 1,200 simulated road snapshots that varied lighting, road position, and noise.

ALVINN's mounted truck drove on a variety of road types at speeds eventually reaching 55 miles per hour. Beyond its engineering achievement, the paper made two conceptual contributions. First, it showed that a neural network could learn a sensorimotor mapping end to end, anticipating the dominance of deep learning by twenty-five years. Second, it explicitly identified the covariate shift problem: when the network slightly mis-steered, it found itself in image distributions the human had never produced, and the network's behavior on those images was unpredictable. Pomerleau introduced an early data augmentation trick (synthetically translating and rotating images to simulate off-trajectory views) which presaged DAgger by two decades.

Apprenticeship Learning and Inverse Reinforcement Learning (1998 to 2010)

Andrew Ng and Stuart Russell formalized inverse reinforcement learning (IRL) in 2000, framing the problem as recovering an expert's reward function from observed behavior.^[7] Pieter Abbeel and Ng extended this in 2004 with Apprenticeship Learning via Inverse Reinforcement Learning, an algorithm that assumes the expert is maximizing a linear combination of known features and iteratively matches the feature counts of the expert and the learner.^[8] Their approach was famously demonstrated on autonomous helicopter aerobatics, where Stanford's helicopter learned to perform stunts that human pilots themselves could not consistently execute.

This era established a key intellectual frame: rather than copying the expert's actions directly, infer why the expert acted that way and then plan optimally with respect to the recovered objective. The advantage is robustness to small changes in dynamics or environment, since a reward function generalizes across situations in a way a state-conditioned action distribution does not. The disadvantages are computational cost (an MDP is solved repeatedly inside the learning loop) and reward ambiguity (many reward functions explain any given behavior).

DAgger and the No-Regret Era (2011)

In 2011 Stephane Ross, Geoffrey Gordon, and J. Andrew Bagnell published A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, introducing the DAgger (Dataset Aggregation) algorithm.^[9] DAgger addresses behavior cloning's covariate shift directly. It runs the current policy in the environment, queries the expert for the correct action on every state visited, aggregates those (state, expert action) pairs into the training set, and retrains. Repeated iterations drive the learner's state distribution to coincide with the policy it produces, breaking the open loop in which behavior cloning's small errors compound.

DAgger introduced a linear (rather than quadratic) bound on the imitation loss as a function of horizon, an enormous theoretical improvement. The cost of this guarantee is access to a queryable expert during training, which is feasible when the expert is a planner or a demonstrator using teleoperation but impractical when the expert is a cost-sensitive human professional.

Generative Adversarial Imitation Learning (2016)

Jonathan Ho and Stefano Ermon's Generative Adversarial Imitation Learning (GAIL), published at NeurIPS 2016, recast imitation learning as distribution matching.^[10] In GAIL, a discriminator network is trained to distinguish state-action pairs sampled from the expert from those produced by the learner, while the policy is trained (using policy gradient methods like TRPO or PPO) to fool the discriminator. The setup is structurally analogous to a generative adversarial network in image synthesis but operates over trajectories.

GAIL proved that an adversarial framing could deliver better empirical performance than behavior cloning on benchmark continuous control tasks while avoiding the explicit reward inference step inside IRL. It opened a productive research line that includes Adversarial Inverse Reinforcement Learning (AIRL) by Justin Fu, Katie Luo, and Sergey Levine in 2017, which recovers a transferable reward function rather than just a policy.^[11]

The Deep Learning Wave (2016 to 2022)

From 2016 onward, imitation learning increasingly fused with deep learning. End-to-end policies were trained from raw camera pixels to motor torques. Sergey Levine's group at Berkeley demonstrated guided policy search and visuomotor learning with deep networks. Chelsea Finn introduced one-shot imitation via meta-learning, where a model learns how to learn new skills from a single demonstration.^[12] Robotics datasets grew from a handful of tasks per lab to hundreds, then thousands. By 2022 the field had matured enough that the pieces for a genuine foundation model of robot control began to fall into place.

Algorithmic Families

Behavior Cloning (BC)

Behavior cloning is the simplest formulation of imitation learning. Given a dataset of observation-action pairs from an expert, a parametric model is fit by supervised learning to predict the expert's action given the observation. The loss function is typically mean squared error for continuous actions or cross-entropy for discrete ones. Inference at deployment time is a single forward pass through the network.

BC's strengths are speed (training is supervised, parallel, and well understood), simplicity (no environment interaction or rollout is required at training time), and generality (any architecture, from MLPs through transformers, can serve). Its weaknesses are also well known. The most important is covariate shift: at deployment, small errors push the agent into states it has never trained on, where it makes larger errors, which compound. Theoretical bounds (see below) show that BC's regret can grow quadratically with the task horizon, while methods like DAgger achieve linear bounds.

In practice modern BC is far more capable than the basic recipe suggests, because of three engineering refinements. First, action chunking (predicting a sequence of future actions rather than a single one) reduces the number of decision points where errors can compound. Second, expressive policy classes (diffusion models, transformers, mixture-of-experts) capture multimodal action distributions that mean-squared-error regression smooths over. Third, massive data scaling makes the policy correct on a much larger fraction of the state space, mitigating compounding error empirically even when it cannot be eliminated theoretically.

DAgger and Interactive Imitation

DAgger (Dataset Aggregation) is the canonical interactive imitation learning algorithm.^[9] At iteration n, the algorithm rolls out the current policy to collect states, queries the expert for the correct action on each visited state, aggregates these labels into the training set, and retrains. Variants relax the requirement of querying the expert at every step. HG-DAgger and SafeDAgger gate expert queries based on uncertainty, reducing demonstration cost. Confidence-based DAgger queries the expert only when the learner's predicted action is suspect.

DAgger is widely used in driving simulation, where a human-designed planner can serve as the expert and is cheap to query, and in robotics with model-predictive control as the expert. It is less common in human-demonstrator workflows because of the burden of providing labels in the loop.

Inverse Reinforcement Learning (IRL)

Inverse reinforcement learning frames imitation as inferring the reward function the expert is optimizing, then planning with respect to that reward.^[7]^[8] Classic IRL algorithms include Maximum Margin Planning, Maximum Entropy IRL (Ziebart et al. 2008), and Bayesian IRL.

The attraction of IRL is generalization. A reward function describes preferences that hold across situations, so a policy derived from a learned reward can transfer to new starting states, new dynamics, or new goals more gracefully than a directly cloned policy. The challenges are computational cost (each iteration of inference may require solving a forward RL problem) and identifiability (multiple reward functions can rationalize the same behavior). Maximum entropy IRL addresses identifiability by selecting the reward that makes the expert's behavior maximally entropic among trajectories of equal value.

Generative Adversarial Imitation Learning (GAIL)

GAIL combines IRL's distribution-matching intuition with the algorithmic machinery of generative adversarial networks.^[10] A discriminator is trained to distinguish (state, action) pairs from the expert from those generated by the policy, and the policy is trained with reinforcement learning to maximize the probability that the discriminator labels its actions as expert. The discriminator's output thus serves as an implicit reward.

GAIL avoids explicitly recovering a reward function, which sidesteps some IRL ambiguity, but it inherits the brittleness of adversarial training. Mode collapse, oscillation, and reward sparsity all complicate practical use. Variants include InfoGAIL, which conditions on latent intent, and option-GAIL, which factors policies into hierarchical options.

Adversarial Inverse Reinforcement Learning (AIRL)

AIRL, introduced by Fu, Luo, and Levine in 2017, is a hybrid that recovers a transferable reward function via an adversarial training objective.^[11] AIRL augments the GAIL discriminator with structure that lets it disentangle reward shaping from the underlying reward, with the result that the learned reward generalizes to environments with different dynamics. This makes AIRL particularly attractive for sim-to-real transfer in robotics, where the training environment differs from the deployment environment in physics or appearance.

Action Chunking with Transformers (ACT)

Introduced by Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn in 2023 as part of the Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware paper, the Action Chunking Transformer (ACT) is a variational transformer policy that predicts a sequence of k future actions at each timestep rather than a single action.^[13] Action chunking reduces compounding error by reducing the number of decisions that have to be correct, and it captures the temporal structure of skilled behavior better than per-step prediction.

ACT was developed alongside the open-source ALOHA bimanual teleoperation rig and demonstrated dexterous tasks like opening a translucent condiment cup, slotting a battery, and threading a zip tie with success rates of 80 to 90 percent after only ten minutes of demonstrations. ACT became a baseline for almost all subsequent dexterous manipulation imitation work and is supported by the LeRobot framework distributed by Hugging Face.

Diffusion Policy

Diffusion Policy, presented at Robotics: Science and Systems 2023 by Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, Russ Tedrake, and Shuran Song from Columbia University, Toyota Research Institute, and MIT, models the policy as a conditional denoising diffusion process over future action chunks.^[14] Diffusion models have natural strengths for action modeling: they capture multi-modal distributions cleanly, scale to high-dimensional outputs, and are stable to train.

The paper benchmarked Diffusion Policy across twelve tasks from four manipulation benchmarks and reported an average 46.9 percent improvement over prior state-of-the-art methods. Toyota Research Institute used Diffusion Policy as the basis of their Large Behavior Models program, teaching robots more than sixty dexterous skills (pouring liquids, using tools, manipulating deformable objects) within a few hours of demonstration each.

Vision-Language-Action Models (VLAs)

Vision-language-action models are the most recent and arguably most consequential development. A VLA is a multimodal foundation model that takes an image (or video) of the robot's surroundings and a natural-language instruction and outputs a sequence of low-level robot actions. Training data is overwhelmingly imitation: human teleoperators or scripted policies produce thousands or millions of episodes, each annotated with a language command, and the model is fit by supervised learning over those triples.

The table below lists the most prominent VLA systems through early 2026.

Model	Year	Lead Org	Architecture	Highlights
RT-1	2022	Google	EfficientNet + TokenLearner + Transformer (35M)	Trained on 130k episodes, 700+ tasks across 13 robots over 17 months^[15]
RT-2	2023	Google DeepMind	PaLM-E or PaLI-X with action tokens (5B to 55B)	Co-fine-tuned VLM on robot trajectories and web vision-language tasks^[16]
Open X-Embodiment / RT-X	2023	Google DeepMind plus 33 academic labs	Shared transformer backbone	Pooled data from 22 robot types; RT-1-X achieved 50% improvement on average across 5 platforms^[17]
ACT	2023	Stanford	Transformer with VAE	Bimanual dexterous tasks from 10 minutes of teleoperation^[13]
Diffusion Policy	2023	Columbia, TRI, MIT	Conditional denoising diffusion	46.9% average improvement on 12 manipulation tasks^[14]
OpenVLA	2024	Stanford / Google	Open-source 7B VLA on Llama 2	Open weights and training recipe
Pi0	2024	Physical Intelligence	VLM backbone with flow-matching action expert	Trained on 7 robot configurations, 68 tasks; first widely cited generalist VLA^[18]
Pi0-FAST	2025	Physical Intelligence	Pi0 with FAST action tokenizer	Trains 5x faster than the original Pi0
Pi0.5	2025	Physical Intelligence	Updated architecture	Improved generalization over Pi0
GR00T N1	2025	NVIDIA	Dual-system: System 2 VLM + System 1 action expert	Mixes robot trajectories, human videos, synthetic data^[19]
GR00T N1.6	2025	NVIDIA	VLA + world model (NVIDIA Cosmos Reason)	Loco-manipulation with sim-to-real workflow

Offline Reinforcement Learning Crossover

Offline reinforcement learning (also called batch RL) trains a policy from a fixed dataset without further environment interaction. It overlaps significantly with imitation learning. Conservative Q-Learning (CQL), introduced by Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine in 2020, learns a value function whose expected value lower-bounds the true value, so the policy stays close to the data distribution.^[20] Implicit Q-Learning (IQL) and Decision Transformers extend the offline-RL toolkit. When the dataset consists entirely of expert demonstrations, offline RL reduces to a robust form of imitation; when it contains a mix of skill levels, offline RL can outperform imitation by extracting the best parts of each trajectory.

A common modern recipe combines imitation pretraining on large mixed datasets with offline RL fine-tuning on task-specific data. NVIDIA's GR00T workflow and Physical Intelligence's Pi0.5 both adopt variants of this approach.

Theoretical Limitations

Covariate Shift and Compounding Error

The central theoretical challenge in pure behavior cloning is covariate shift. Because the policy is trained on the state distribution induced by the expert but executed in the state distribution induced by itself, even small errors compound as the trajectory unrolls. If the expected one-step error of the policy is e and the horizon is T, and if errors do not cancel out, the expected total cost of executing the policy can grow as O(T^2 e), quadratic in the horizon.^[9]

The quadratic bound, sometimes attributed to Ross and Bagnell (2010), follows from a worst-case analysis in which a small mistake at step t can land the agent in a state where it makes another mistake at step t+1, and so on. Behavior cloning's failure mode in self-driving (the classic example: a vehicle drifts a few inches off the lane center, then sees an image distribution it has never seen, and drifts further) is the empirical face of this bound.

The DAgger Bound

DAgger achieves an O(Te) bound, linear in horizon, by training on the policy's own state distribution.^[9] This is essentially optimal for the imitation problem in the worst case, and it explains why DAgger and its variants are favored in safety-critical settings whenever the expert can be queried at training time.

Recoverability and the Three Regimes

More recent theoretical work, particularly Feedback in Imitation Learning: The Three Regimes of Covariate Shift by Spencer, Choudhury, Bachman, and others (2021), characterizes when behavior cloning is fundamentally hopeless, when it is recoverable with effort, and when it is essentially benign.^[21] In k-step recoverable MDPs, where the agent can return to the expert distribution within k steps, DAgger achieves O(kTe). In MDPs with no recoverability, behavior cloning's quadratic dependence is unavoidable. The body of work on the Chen-Ross bounds and successors quantifies precisely how rough the imitation problem is as a function of MDP structure.

Practical Mitigations

Working practitioners reduce compounding error in three main ways. First, action chunking (used by ACT and Diffusion Policy) reduces the number of decision points. Second, augmentation and recovery data (used by Tesla's FSD pipeline and many academic robotics labs) add synthetic perturbations, off-trajectory recoveries, and corrective actions to the training set. Third, massive scaling (the bet of RT-2, Pi0, and GR00T) covers a much larger fraction of the state space, so the deployment distribution shifts less far from the training distribution.

Data Infrastructure

Imitation learning is bottlenecked above all by demonstration data. Modern systems rely on three major data sources.

Teleoperation

Teleoperation is the gold standard for collecting clean, high-quality demonstration data. A human operator controls the robot directly using a device (a haptic controller, a VR headset, an exoskeleton, or in the case of ACT a puppet leader robot of the same kinematics). Teleoperation rigs in widespread use include ALOHA (Stanford), Open Teach (Cornell), GELLO (Berkeley), and proprietary setups at Toyota Research Institute, Physical Intelligence, Figure, 1X, and Tesla. Tesla's Optimus program is widely understood to involve workers wearing camera rigs while performing reference tasks (folding clothes, picking objects), with the resulting video used to train Optimus's policies.

Open X-Embodiment Dataset

Released in October 2023, the Open X-Embodiment dataset is the most ambitious cross-laboratory pooling of robot trajectory data to date.^[17] The collaboration brought together Google DeepMind and 33 academic robotics labs to share data from 22 different robot types in a standardized format. The accompanying RT-1-X model demonstrated a 50 percent average improvement on five common robot platforms over methods designed specifically for each platform. RT-2-X tripled performance on real-world robotic skills compared to a baseline. The dataset and models are openly available.

Synthetic and Simulated Data

Synthetic data complements real demonstrations. NVIDIA's Isaac Sim and Isaac Lab generate trillions of simulated steps, and pipelines like GR00T-Mimic generate motion data from a small set of teleoperated demonstrations.^[19] Diffusion-based world models (NVIDIA Cosmos, Google DeepMind's Genie, and others) generate realistic video that can be used as additional training data. The bet is that sim-to-real and synthetic-to-real transfer will eventually allow training data to grow much faster than human teleoperation can generate it.

Major Laboratories and People

Pieter Abbeel

Pieter Abbeel, professor at UC Berkeley and co-founder of Covariant, completed his Stanford PhD in 2008 under Andrew Ng with a dissertation titled Apprenticeship Learning and Reinforcement Learning with Application to Robotic Control.^[8] Abbeel's apprenticeship learning algorithm and its helicopter aerobatics demonstration are foundational results. He continues to lead a high-output Berkeley robot learning group and has trained many of the field's senior researchers, including Chelsea Finn, John Schulman, and Sergey Levine.

Sergey Levine

Sergey Levine, also at UC Berkeley, has been one of the most prolific contributors to imitation learning and offline RL. He coauthored AIRL,^[11] CQL,^[20] ACT,^[13] and many of the deep visuomotor learning papers that connected imitation learning to modern deep networks. His group operates one of the largest academic robot fleets and has driven a string of advances in dexterous manipulation, sim-to-real, and offline RL.

Chelsea Finn

Chelsea Finn, assistant professor at Stanford, completed her PhD in 2018 jointly under Abbeel and Levine. Her one-shot visual imitation learning via meta-learning and her work on model-agnostic meta-learning (MAML) have influenced both robotics and machine learning broadly.^[12] She is co-founder of Physical Intelligence and a primary author on the Pi0 family of models.

Russ Tedrake

Russ Tedrake, MIT professor and Senior Vice President of Large Behavior Models at Toyota Research Institute, leads the Large Behavior Models program at TRI.^[14] His group's collaboration with Shuran Song's Columbia lab produced Diffusion Policy. Tedrake is also the author of the long-running open textbook Underactuated Robotics, which contains a widely used chapter on imitation learning.

Stefano Ermon

Stefano Ermon, associate professor at Stanford, coauthored GAIL with Jonathan Ho.^[10] His research combines generative modeling, probabilistic methods, and reinforcement learning. GAIL remains one of the most cited imitation learning papers and inspired much of the adversarial-RL literature that followed.

Stephane Ross and J. Andrew Bagnell

Stephane Ross (now at DeepMind) and Drew Bagnell (Carnegie Mellon and Aurora) introduced DAgger and provided the dominant theoretical framework for imitation learning's regret bounds.^[9] Bagnell's group at CMU produced a long line of follow-on work on no-regret imitation, structured prediction, and learning to plan.

Industry Labs

Google DeepMind produced RT-1, RT-2, RT-X, and Open X-Embodiment.^[15]^[16]^[17]
Physical Intelligence (San Francisco) produced Pi0, Pi0-FAST, and Pi0.5.^[18] Founders include Karol Hausman and Chelsea Finn.
NVIDIA Project GR00T built around Jensen Huang's bet on humanoid robotics produced GR00T N1 and N1.6, plus the Isaac Sim and Cosmos infrastructure.^[19]
Toyota Research Institute (Russ Tedrake) developed Diffusion Policy and the Large Behavior Models program.
Tesla Optimus uses imitation learning from human teleoperation as its primary training paradigm. Optimus is a direct intellectual descendant of Tesla's Full Self-Driving (FSD) program, applying the same end-to-end visuomotor learning playbook.
1X Technologies, Figure AI, Apptronik, Agility Robotics, and Sanctuary AI all rely heavily on imitation learning for their humanoid product lines.

Applications

Autonomous Driving

Autonomous driving was the first application of behavior cloning (ALVINN) and remains one of the most studied. Tesla's FSD stack, NVIDIA's DRIVE platform, Wayve's end-to-end learned driving system, and academic work like CARLA-based imitation learning all leverage imitation in some form. Pure behavior cloning is rarely sufficient; modern stacks combine imitation pretraining with extensive simulation, scenario synthesis, and recovery data.

Dexterous Manipulation

The dexterous manipulation explosion of 2023 to 2026 was largely driven by imitation learning. Diffusion Policy and ACT enabled fine-grained tasks like cup opening, cable insertion, deformable cloth manipulation, and bimanual coordination.^[13]^[14] Pi0 extended these capabilities to multi-task generalist control across multiple robot embodiments.^[18]

Humanoid Robots

Humanoid robots are the highest-stakes proving ground for imitation learning today. Tesla Optimus, Figure 02 and 03, 1X NEO, Apptronik Apollo, and Agility Digit all use imitation learning as a primary training paradigm. NVIDIA's GR00T provides the pretraining infrastructure that several of these companies build on.^[19]

Surgical Robotics

Intuitive Surgical's da Vinci platform and academic systems like the Smart Tissue Autonomous Robot use demonstrations from expert surgeons to train autonomous subtask policies. The high cost and high stakes of surgery make imitation particularly attractive (designing reward functions for surgery is essentially impossible) but also raise the bar for safety guarantees.

Game Playing

Imitation learning has powered AI agents in StarCraft II (DeepMind's AlphaStar bootstrapped from human games), Dota 2 (OpenAI Five used imitation pretraining followed by self-play RL), and Minecraft (OpenAI's VPT trained an imitation policy on YouTube videos as a basis for further RL). The pattern of imitation pretraining followed by self-play or RL fine-tuning is one of the most reliably effective recipes in modern RL.

Language Model Alignment

Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) used to align large language models have intellectual ancestry in inverse reinforcement learning. The supervised fine-tuning step (SFT) of an LLM is exactly behavior cloning on a dataset of human-generated demonstrations of desired model behavior.

Open Problems and Active Research

Despite enormous progress, several core problems remain open in imitation learning.

Sample efficiency: Even Pi0 and GR00T require millions of trajectories. Reducing the data requirement (through meta-learning, retrieval, or world models) is a primary research goal.
Long-horizon and multi-step reasoning: Current VLAs handle short-horizon tasks well but struggle with multi-stage tasks that require planning, memory, or compositional reasoning.
Safety and verification: Imitation policies can fail silently when they leave their training distribution. Formal verification, runtime monitoring, and fallback control are active research areas.
Cross-embodiment generalization: Open X-Embodiment showed that pooling data across robots helps, but truly seamless transfer across radically different morphologies (a quadruped to a humanoid) is unsolved.
Reward inference at scale: Modern systems mostly skip reward inference and copy actions directly. Reviving IRL ideas at the scale of foundation models could yield more transferable, more aligned policies.
Demonstration quality and noise: Real teleoperation data contains hesitations, mistakes, and human idiosyncrasies. Methods for robust learning from noisy demonstrations remain an important practical question.

Tips for Practitioners

For low-data dexterous tasks, start with ACT or Diffusion Policy and a teleoperation rig like ALOHA or GELLO. Both methods produce strong baselines with under an hour of demonstrations.
For new tasks on robots already supported by Pi0 or GR00T, fine-tune the foundation model with a small task-specific demonstration set rather than training from scratch.
If covariate shift is the dominant failure mode, add DAgger-style interactive correction or augment the training set with off-trajectory recoveries.
For language-conditioned tasks, use a vision-language base model and train the action head on top, following the RT-2 or Pi0 recipes.
Combine imitation pretraining with offline reinforcement learning for tasks where the demonstration set contains a mix of skill levels.
Use simulation aggressively for validation, even if training is on real data. Sim-to-real transfer is now a viable path to scaling beyond what real demonstrations can support.

References

Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C. (2017). "Imitation Learning: A Survey of Learning Methods." *ACM Computing Surveys* 50, no. 2: 1-35.
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., & Peters, J. (2018). "An Algorithmic Perspective on Imitation Learning." *Foundations and Trends in Robotics* 7 (1-2): 1-179.
Pomerleau, D. A. (1989). "ALVINN: An Autonomous Land Vehicle in a Neural Network." *Advances in Neural Information Processing Systems 1*. Available at https://papers.neurips.cc/paper_files/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf
Tedrake, R. "Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation." Chapter 21 (Imitation Learning). Available at https://underactuated.mit.edu/imitation.html
NVIDIA Research. "NVIDIA Isaac GR00T N1: An Open Foundation Model for Humanoid Robots." March 2025. Available at https://research.nvidia.com/publication/2025-03_nvidia-isaac-gr00t-n1-open-foundation-model-humanoid-robots
Physical Intelligence. "Pi-Zero: Our First Generalist Policy." Blog post, October 31, 2024. Available at https://www.pi.website/blog/pi0
Ng, A. Y., & Russell, S. J. (2000). "Algorithms for Inverse Reinforcement Learning." *Proceedings of the 17th International Conference on Machine Learning (ICML)*: 663-670.
Abbeel, P., & Ng, A. Y. (2004). "Apprenticeship Learning via Inverse Reinforcement Learning." *Proceedings of the 21st International Conference on Machine Learning (ICML)*. Available at https://ai.stanford.edu/~ang/papers/icml04-apprentice.pdf
Ross, S., Gordon, G. J., & Bagnell, J. A. (2011). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)*: 627-635. Available at https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf
Ho, J., & Ermon, S. (2016). "Generative Adversarial Imitation Learning." *Advances in Neural Information Processing Systems 29 (NeurIPS)*. arXiv:1606.03476. Available at https://arxiv.org/abs/1606.03476
Fu, J., Luo, K., & Levine, S. (2018). "Learning Robust Rewards with Adversarial Inverse Reinforcement Learning." *International Conference on Learning Representations (ICLR)*. arXiv:1710.11248. Available at https://arxiv.org/abs/1710.11248
Finn, C., Yu, T., Zhang, T., Abbeel, P., & Levine, S. (2017). "One-Shot Visual Imitation Learning via Meta-Learning." *Proceedings of the 1st Annual Conference on Robot Learning (CoRL)*. arXiv:1709.04905. Available at https://arxiv.org/abs/1709.04905
Zhao, T. Z., Kumar, V., Levine, S., & Finn, C. (2023). "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." *Robotics: Science and Systems (RSS)*. arXiv:2304.13705. Available at https://arxiv.org/abs/2304.13705
Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., Tedrake, R., & Song, S. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." *Robotics: Science and Systems (RSS)*. arXiv:2303.04137. Available at https://arxiv.org/abs/2303.04137
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv:2212.06817. Available at https://arxiv.org/abs/2212.06817
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818. Available at https://arxiv.org/abs/2307.15818
Open X-Embodiment Collaboration et al. (2023). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864. Available at https://arxiv.org/abs/2310.08864
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., et al. (2024). "Pi0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164. Available at https://arxiv.org/abs/2410.24164
NVIDIA. "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv:2503.14734. Available at https://arxiv.org/abs/2503.14734
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). "Conservative Q-Learning for Offline Reinforcement Learning." *Advances in Neural Information Processing Systems 33 (NeurIPS)*. arXiv:2006.04779. Available at https://arxiv.org/abs/2006.04779
Spencer, J., Choudhury, S., Venkatraman, A., Ziebart, B., & Bagnell, J. A. (2021). "Feedback in Imitation Learning: The Three Regimes of Covariate Shift." arXiv:2102.02872. Available at https://arxiv.org/abs/2102.02872

Definition and Scope

Distinction from Reinforcement Learning

Historical Background

ALVINN and the Origins of Behavior Cloning (1989)

Apprenticeship Learning and Inverse Reinforcement Learning (1998 to 2010)

DAgger and the No-Regret Era (2011)

Generative Adversarial Imitation Learning (2016)

The Deep Learning Wave (2016 to 2022)

Algorithmic Families

Behavior Cloning (BC)

DAgger and Interactive Imitation

Inverse Reinforcement Learning (IRL)

Generative Adversarial Imitation Learning (GAIL)

Adversarial Inverse Reinforcement Learning (AIRL)

Action Chunking with Transformers (ACT)

Diffusion Policy

Vision-Language-Action Models (VLAs)

Offline Reinforcement Learning Crossover

Theoretical Limitations

Covariate Shift and Compounding Error

The DAgger Bound

Recoverability and the Three Regimes

Practical Mitigations

Data Infrastructure

Teleoperation

Open X-Embodiment Dataset

Synthetic and Simulated Data

Major Laboratories and People

Pieter Abbeel

Sergey Levine

Chelsea Finn

Russ Tedrake

Stefano Ermon

Stephane Ross and J. Andrew Bagnell

Industry Labs

Applications

Autonomous Driving

Dexterous Manipulation

Humanoid Robots

Surgical Robotics

Game Playing

Language Model Alignment

Open Problems and Active Research

Tips for Practitioners

See Also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

Robot learning

Embodied AI

Sim-to-real transfer

Definition and Scope

Distinction from Reinforcement Learning

Historical Background

ALVINN and the Origins of Behavior Cloning (1989)

Apprenticeship Learning and Inverse Reinforcement Learning (1998 to 2010)

DAgger and the No-Regret Era (2011)

Generative Adversarial Imitation Learning (2016)

The Deep Learning Wave (2016 to 2022)

Algorithmic Families

Behavior Cloning (BC)

DAgger and Interactive Imitation

Inverse Reinforcement Learning (IRL)

Generative Adversarial Imitation Learning (GAIL)

Adversarial Inverse Reinforcement Learning (AIRL)

Action Chunking with Transformers (ACT)

Diffusion Policy

Vision-Language-Action Models (VLAs)

Offline Reinforcement Learning Crossover

Theoretical Limitations

Covariate Shift and Compounding Error

The DAgger Bound

Recoverability and the Three Regimes

Practical Mitigations

Data Infrastructure

Teleoperation

Open X-Embodiment Dataset