Diffusion policy is a method for robot behavior generation that represents a robot's visuomotor policy as a conditional denoising diffusion process. Instead of directly predicting a single action from observations, diffusion policy iteratively refines random noise into a sequence of robot actions through learned denoising steps. First introduced by Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song in 2023, the method was published at Robotics: Science and Systems (RSS) 2023 and later extended in a journal version in the International Journal of Robotics Research (IJRR) in 2024. Diffusion policy has since become one of the most widely adopted approaches in imitation learning for robot manipulation, serving as a foundational building block for numerous follow-up systems in both research and industry.
Learning robot control policies from human demonstrations, known as behavioral cloning or learning from demonstrations, is a central problem in robot learning. A key challenge in this setting is representing the policy, which is the mapping from observations to actions. Human demonstrations are often inherently multimodal: for the same observation, multiple valid action trajectories may exist. For example, when reaching around an obstacle, a robot might go left or right, and both are correct. Traditional policy representations struggle with this multimodality in different ways.
Gaussian or Gaussian Mixture Model (GMM) policies, such as LSTM-GMM (also known as BC-RNN from the RoboMimic benchmark), represent the action distribution as an explicit parametric density. While simple and fast at inference time, these methods are limited in the number of modes they can capture and tend to suffer from mode collapse or mode averaging when the demonstration data contains diverse behaviors.
Energy-based models, such as Implicit Behavioral Cloning (IBC), represent the policy implicitly through an energy function over observation-action pairs. While theoretically capable of representing arbitrary distributions, IBC requires drawing negative samples during training to approximate the intractable normalization constant. This negative sampling introduces training instability, manifesting as error spikes and inconsistent evaluation performance.
The Behavior Transformer (BET) uses a categorical distribution over discretized action bins combined with a continuous offset. Although it can represent multiple modes, BET struggles with temporal consistency and can fail to commit to a single mode during execution, leading to hesitant or erratic behavior.
Diffusion policy was motivated by the observation that diffusion models, which had achieved remarkable success in image generation (such as DALL-E 2, Stable Diffusion, and Imagen), could be adapted from denoising pixels into images to denoising noise into robot action sequences. The score-function-based formulation of diffusion models provides a natural way to represent arbitrary normalizable distributions, including highly multimodal ones, while maintaining stable training through a simple mean squared error (MSE) denoising objective.
Diffusion policy formulates the robot policy as a conditional Denoising Diffusion Probabilistic Model (DDPM). Given a set of observations, the policy generates an action sequence by starting from Gaussian noise and performing K iterations of learned denoising. Each denoising step nudges the noisy action sequence toward the distribution of demonstrated actions conditioned on the current observations.
Formally, the forward diffusion process gradually adds Gaussian noise to a clean action sequence over K steps:
q(A^k | A^(k-1)) = N(A^k; sqrt(alpha_k) * A^(k-1), (1 - alpha_k) * I)
where A^0 is the original demonstrated action sequence and alpha_k controls the noise schedule. The reverse (denoising) process is parameterized by a neural network that predicts the noise added at each step:
p_theta(A^(k-1) | A^k, O_t) = N(A^(k-1); mu_theta(O_t, A^k, k), sigma_k^2 * I)
where O_t represents the observation conditioning. The network is trained to minimize a noise prediction objective (a variant of the standard DDPM loss):
L = E[||epsilon - epsilon_theta(O_t, A^k, k)||^2]
Rather than learning the action distribution directly, diffusion policy learns the gradient of the action-distribution score function. At inference time, the model performs stochastic Langevin dynamics on this learned gradient field to generate action samples. This formulation sidesteps the need to compute intractable normalization constants, giving diffusion policy a significant advantage over energy-based models in training stability.
The original diffusion policy paper employs the Square Cosine Schedule proposed in improved DDPM (iDDPM). This schedule was found to be well-suited for control tasks because it properly manages both high-frequency and low-frequency characteristics of action signals. The number of training diffusion steps is set to K=100.
A critical design choice in diffusion policy is that the model predicts a sequence of future actions rather than a single action. The approach uses three key horizon parameters:
| Parameter | Symbol | Description |
|---|---|---|
| Observation horizon | T_o | Number of past observation steps used as conditioning |
| Prediction horizon | T_p | Total number of future action steps predicted by the model |
| Action horizon | T_a | Number of predicted actions actually executed before re-planning |
At each control step, the policy observes T_o recent observations, generates a T_p-length action sequence through the denoising process, but only executes the first T_a actions. It then re-plans conditioned on new observations. This receding horizon control scheme balances temporal consistency (from predicting long sequences) with closed-loop reactivity (from frequent re-planning). The prediction horizon T_p is typically set to 16 steps, and the authors found that an action horizon of T_a = 8 provides a good trade-off across tasks.
This design allows diffusion policy to maintain smooth, temporally coherent trajectories while still reacting to perturbations or unexpected changes in the environment. It also implicitly handles action-level multimodality: the model commits to a mode at the start of each predicted sequence and maintains that commitment for T_a steps, preventing the mode-switching oscillation that plagues single-step policies.
Diffusion policy treats visual observations as conditioning rather than as part of the joint data distribution. The observation images are encoded through a visual encoder, and the resulting features are injected into the denoising network through conditioning mechanisms (such as FiLM for the CNN architecture or cross-attention for the Transformer architecture). This design is more sample-efficient than alternatives that model the joint distribution of observations and actions.
The default visual encoder is a ResNet-18 with spatial softmax pooling (replacing global average pooling to retain spatial information) and GroupNorm (replacing BatchNorm for compatibility with Exponential Moving Average training). Notably, the authors found that training the visual encoder end-to-end from scratch performed comparably to using pretrained representations from CLIP or R3M, though CLIP-pretrained ViT-B/16 with fine-tuning achieved the highest performance (98% on the Square task).
The original paper proposes and compares two network architectures for the denoising model: a CNN-based architecture and a Transformer-based architecture.
The CNN-based noise prediction network is a modified 1D temporal U-Net architecture adapted from Janner et al. (2022). The architecture operates on 1D temporal sequences of actions. Key modifications include:
The CNN architecture works reliably across most tasks and is recommended as the default starting point for new applications. However, due to the locality bias inherent in convolutions, it can exhibit over-smoothing effects, producing overly smooth action trajectories that may miss sharp turns or rapid changes.
The Transformer-based architecture adapts the minGPT design for action denoising. In this architecture:
The Transformer architecture reduces the over-smoothing effect due to its lack of locality bias and can be more suitable for tasks requiring sharp or uneven trajectories. However, it requires more careful hyperparameter tuning compared to the CNN variant.
| Feature | CNN (Diffusion Policy-C) | Transformer (Diffusion Policy-T) |
|---|---|---|
| Backbone | 1D temporal U-Net | minGPT-style Transformer |
| Observation conditioning | FiLM layers | Cross-attention |
| Strengths | Reliable, minimal tuning needed | Handles sharp trajectories, reduces over-smoothing |
| Weaknesses | Over-smoothing on rapid action changes | Requires more hyperparameter tuning |
| Recommended use | Default for new tasks | High-complexity tasks with uneven trajectories |
Diffusion policy can use either DDPM or DDIM (Denoising Diffusion Implicit Models) for the inference (denoising) process. While DDPM is used during training with K=100 diffusion steps, the full 100-step denoising process is too slow for real-time robot control.
DDIM provides a critical acceleration by decoupling the number of denoising iterations used in training from those used at inference. DDIM makes the reverse process deterministic and allows skipping steps, enabling the model to generate high-quality action sequences with far fewer denoising iterations. In practice, diffusion policy uses 100 training iterations but only 10 inference iterations with DDIM, achieving approximately 0.1 seconds of inference latency on an NVIDIA RTX 3080 GPU. This makes real-time closed-loop control feasible, with the policy running at approximately 10 Hz and actions being interpolated to the robot's native control frequency (e.g., 125 Hz for a UR5 robot).
| Inference method | Training steps | Inference steps | Latency (RTX 3080) | Deterministic |
|---|---|---|---|---|
| DDPM | 100 | 100 | ~1.0 s | No (stochastic) |
| DDIM | 100 | 10 | ~0.1 s | Yes |
Diffusion policy offers several key advantages compared to prior policy representations for imitation learning:
By learning the gradient of the action score function, diffusion policy can represent arbitrary normalizable distributions, including complex multimodal distributions. This allows it to faithfully capture the full diversity of human demonstrations without mode collapse or mode averaging. In contrast, Gaussian mixture models are limited in the number of modes they can represent, and energy-based models require unstable negative sampling.
The training objective is a straightforward MSE loss on noise prediction, which avoids the instabilities associated with energy-based training (negative sampling in IBC) or adversarial training. The authors report that optimal hyperparameters are mostly consistent across tasks, requiring little task-specific tuning.
Diffusion policy naturally handles high-dimensional output spaces because the denoising process operates in the full action-sequence space. This makes it well-suited for predicting multi-step action chunks, which is essential for temporal consistency.
In demonstrations, robots often remain stationary for portions of the trajectory (e.g., waiting at the start). The Gaussian noise prior of the diffusion process naturally contracts toward zero, which aligns with idle or near-zero velocity actions. Other methods, particularly energy-based models, struggle with these near-zero regions where the energy landscape is ambiguous.
The authors discovered an important interaction between diffusion policy and action space representation. Diffusion policy shows a 15-25% performance improvement when using position control (predicting target positions) instead of velocity control (predicting velocities). This is because position control has less pronounced multimodality and fewer compounding errors when predicting action sequences. Interestingly, other baselines like LSTM-GMM and BET actually degrade with position control, making this advantage unique to the diffusion policy formulation.
Diffusion policy was evaluated extensively across four benchmark suites containing 12 tasks total. It consistently outperformed all prior state-of-the-art methods with an average improvement of 46.9% in success rate.
The RoboMimic benchmark consists of five manipulation tasks with both proficient human (PH) and mixed human (MH) demonstration datasets. Results are reported as best performance / average of last 10 checkpoints across 3 seeds.
State-based results (Proficient Human demonstrations):
| Task | Diffusion Policy-C | Diffusion Policy-T | IBC | BET | LSTM-GMM |
|---|---|---|---|---|---|
| Lift | 1.00 / 0.98 | 1.00 / 1.00 | 0.79 / 0.41 | 1.00 / 0.96 | 1.00 / 0.96 |
| Can | 1.00 / 0.97 | 1.00 / 1.00 | 0.15 / 0.02 | 1.00 / 0.99 | 1.00 / 0.93 |
| Square | 1.00 / 0.96 | 1.00 / 1.00 | 0.00 / 0.00 | 1.00 / 0.89 | 1.00 / 0.91 |
| Transport | 1.00 / 0.96 | 1.00 / 0.94 | 0.01 / 0.01 | 1.00 / 0.90 | 1.00 / 0.81 |
| ToolHang | 1.00 / 0.93 | 1.00 / 0.89 | 0.00 / 0.00 | 0.76 / 0.52 | 0.95 / 0.73 |
Image-based results (Proficient Human demonstrations):
| Task | Diffusion Policy-C | Diffusion Policy-T | IBC | LSTM-GMM |
|---|---|---|---|---|
| Lift | 1.00 / 1.00 | 1.00 / 1.00 | 0.94 / 0.73 | 1.00 / 0.95 |
| Can | 1.00 / 1.00 | 1.00 / 0.99 | 0.39 / 0.05 | 1.00 / 0.95 |
| Square | 1.00 / 0.97 | 1.00 / 0.98 | 0.08 / 0.01 | 1.00 / 0.88 |
| Transport | 1.00 / 0.96 | 1.00 / 0.98 | 0.00 / 0.00 | 0.98 / 0.90 |
| ToolHang | 0.98 / 0.92 | 1.00 / 0.90 | 0.03 / 0.00 | 0.82 / 0.59 |
Diffusion policy achieved near-perfect performance on most tasks, and its advantage was most pronounced on the hardest tasks (ToolHang and Transport) where baselines showed significant degradation.
In the Push-T task, the robot must push a T-shaped block to align with a target outline. Performance is measured by Intersection over Union (IoU) coverage rather than binary success. Diffusion policy achieved 0.95 (CNN) and 0.91 (Transformer) IoU in simulation. On a real UR5 robot, diffusion policy achieved 0.80 IoU compared to 0.84 for the human demonstrator, with a 95% success rate. IBC achieved only 20% success and LSTM-GMM 0% on this task.
The block pushing task requires pushing two blocks to target locations in sequence, testing multi-stage planning with multimodal solutions. Diffusion Policy-T achieved 0.99 success on phase 1 and 0.94 on phase 2, compared to BET at 0.96/0.71 and IBC at 0.01/0.00.
The Franka Kitchen environment requires completing four subtasks in sequence (e.g., open microwave, move kettle, turn on burner, slide cabinet). Diffusion Policy-C achieved 0.99 success on completing all four subtasks, compared to BET at 0.44 and IBC at 0.24.
Diffusion policy was validated on multiple real-world manipulation tasks across different robot platforms:
Using a top-down camera setup, diffusion policy learned to push a T-shaped block with 95% success rate across 20 trials. The policy exhibited remarkable robustness: it recovered from human-applied perturbations that displaced the block during execution and maintained performance even when the camera was deliberately occluded for 3 seconds. The policy also generalized to novel block starting positions not seen during training.
This 6-DOF task required the robot to flip a mug upright. Diffusion policy achieved 90% success (18 out of 20 trials) and notably exhibited emergent behaviors not present in any training demonstration, such as performing multiple corrective pushes to align the mug handle or executing re-grasps when the initial grasp was suboptimal.
These tasks involve scooping sauce with a ladle, pouring it onto a plate, and spreading it evenly. The sauce pouring task achieved 79% success with 0.74 IoU (human baseline: 0.79). The sauce spreading task achieved 100% success with 0.77 coverage (human: 0.79). In contrast, LSTM-GMM failed to lift the ladle in 15 out of 20 pouring trials.
Diffusion policy was also tested on bimanual tasks requiring coordinated two-arm control:
| Task | Demonstrations | Success rate |
|---|---|---|
| Egg beater | 210 | 55% |
| Mat unrolling | 162 | 75% |
| Shirt folding | 284 | 75% |
All bimanual tasks were trained without any task-specific hyperparameter adjustments, demonstrating the method's generality.
The original paper includes extensive ablation experiments that reveal important design principles:
Switching from velocity control to position control yielded a 15-25% performance improvement for diffusion policy. This improvement is attributed to two factors: position control has less pronounced multimodality (reaching a target position from the same start has fewer valid trajectories than reaching a target velocity), and position control has less compounding error when predicting sequences. Other baselines (LSTM-GMM, BET) showed 5-15% degradation with position control, suggesting this advantage is specific to the diffusion formulation.
The action execution horizon T_a controls the balance between temporal consistency and reactivity. Longer horizons produce smoother trajectories but reduce the robot's ability to react to disturbances. Shorter horizons increase reactivity but may lead to jerky movements or mode-switching. The authors found T_a = 8 to be optimal across most tasks.
An ablation on the Square task compared different visual encoder choices:
| Visual encoder | Training strategy | Success rate |
|---|---|---|
| ViT-B/16 (CLIP) | Fine-tuned | 98% |
| ResNet-34 | Fine-tuned | 94% |
| ResNet-18 | End-to-end from scratch | 94% |
| Frozen pretrained | Various | 40-70% |
Frozen pretrained encoders consistently underperformed, indicating that fine-tuning or end-to-end training of the visual backbone is important for diffusion policy.
Despite its strong performance, diffusion policy has several notable limitations:
The iterative denoising process requires multiple forward passes through the neural network at inference time. Even with DDIM acceleration (reducing from 100 to 10 steps), inference takes approximately 0.1 seconds per action chunk. This latency makes diffusion policy unsuitable for tasks requiring very high-frequency control (above approximately 10 Hz) or extremely fast reactions. Contact-rich or high-speed manipulation tasks may suffer from the delay between observation and action execution.
Diffusion policy has higher computational costs than simpler methods like LSTM-GMM, both during training and inference. Standard architectures involve millions of parameters, leading to high memory usage. Deploying on resource-constrained platforms such as mobile robots or drones remains challenging due to the computational and memory footprint.
While the ability to represent multimodal distributions is an advantage during training, it can cause issues during real-time execution. As the policy executes a trajectory, the mode should naturally collapse to a single consistent behavior. However, with limited conditional context, the policy may occasionally oscillate between different modes rather than committing to one, particularly at decision points.
Diffusion policy typically requires a substantial number of high-quality demonstrations (often 100-600+) to learn robust policies. This is more than some simpler methods require for basic tasks, although the quality and generalizability of the resulting policy tend to justify the increased data collection effort.
Diffusion policy has inspired a large body of follow-up research addressing its limitations and extending its capabilities.
3D Diffusion Policy (DP3), published at RSS 2024 by Ze et al., extends diffusion policy by incorporating 3D visual representations extracted from sparse point clouds using an efficient point encoder. DP3 handles most tasks with just 10 demonstrations and surpasses 2D baselines with a 24.2% relative improvement across 72 simulation tasks. In real-robot experiments, DP3 achieved an 85% success rate with only 40 demonstrations per task and demonstrated strong generalization across viewpoints, appearances, and spatial configurations. An improved version (iDP3) was developed for humanoid manipulation and presented at IROS 2025.
Consistency Policy, published at RSS 2024, uses the Consistency Trajectory Model (CTM) to distill a pre-trained diffusion policy into a model that generates action sequences in a single denoising step. This dramatically reduces inference time while maintaining comparable task performance, addressing the latency limitation.
One-Step Diffusion Policy applies diffusion distillation to compress the multi-step denoising process into a single forward pass, achieving inference speeds suitable for higher-frequency control loops without significant performance degradation.
FlowPolicy, presented as an oral paper at AAAI 2025, replaces the diffusion process with conditional consistency flow matching. By learning a straight-line flow that generates actions in a single step, FlowPolicy enables real-time inference while maintaining action quality comparable to multi-step diffusion policies.
DPPO, published at ICLR 2025, provides an algorithmic framework for fine-tuning pre-trained diffusion policies using reinforcement learning (specifically Proximal Policy Optimization). DPPO takes advantage of synergies between RL fine-tuning and the diffusion parameterization, enabling structured and on-manifold exploration with stable training. It outperforms alternative RL approaches (such as IDQL and DIPO) for fine-tuning diffusion-based policies on RoboMimic and other benchmarks.
UMI, developed by Cheng Chi and colleagues and a Best Systems Paper Finalist at RSS 2024, uses diffusion policy as its core policy learning component. UMI enables in-the-wild data collection using handheld grippers with GoPro cameras, with the resulting diffusion policies being hardware-agnostic and deployable across multiple robot platforms. UMI incorporates inference-time latency matching and relative-trajectory action representations for practical deployment.
RDT-1B is a 1.2 billion parameter diffusion foundation model for bimanual manipulation. It builds on diffusion policy principles at a much larger scale and can be fine-tuned to learn new skills from just 1-5 demonstrations, exhibiting zero-shot generalization to unseen objects and scenes.
Other notable extensions include:
| Extension | Key contribution | Venue |
|---|---|---|
| ManiCM | Accelerates 3D diffusion policy via consistency models | 2024 |
| Mamba Policy | Replaces backbone with Mamba selective state model | 2024 |
| Diffusion Transformer Policy | Adapts diffusion policy to Transformer-only architecture | 2024 |
| Reactive Diffusion Policy | Addresses real-time reactivity for contact-rich tasks | 2025 |
| D3P (Dynamic Denoising Diffusion Policy) | Adapts denoising schedule dynamically via RL | 2025 |
| FastDP | Optimizes on-device deployment for mobile platforms | 2025 |
Diffusion policy has seen broad adoption across the robotics research community and is increasingly used in industry:
The original implementation is available at the GitHub repository maintained by the authors at Columbia University (later Stanford University). The codebase provides both CNN-based and Transformer-based variants with support for state-based and image-based observations.
Diffusion policy is also integrated into Hugging Face's LeRobot framework, which aims to make robot learning accessible with end-to-end learning tools. LeRobot provides pre-trained diffusion policy models (e.g., for the Push-T task), standardized training pipelines, and integration with various robot hardware through a plugin system.
Several robotics companies have built upon or been influenced by diffusion policy. Physical Intelligence developed pi0, a vision-language-action flow model for general robot control that extends the core idea of using generative models for action prediction. In comparative evaluations, pi0 outperforms standard diffusion policy and other methods like ACT, OpenVLA, and Octo when fine-tuned on small datasets. Physical Intelligence raised $400M in late 2024 at a $2B valuation, reflecting the commercial potential of this line of research.
Octo, a 93M parameter generalist robot policy, uses diffusion action heads as its output representation, further demonstrating the adoption of diffusion-based action generation in foundation models for robotics.
As of 2025, the original diffusion policy paper has been cited extensively, and diffusion-based policies have become a default baseline in robot learning research. Surveys on diffusion models for robotic manipulation document dozens of follow-up methods and applications spanning grasping, locomotion, navigation, and multi-agent coordination.
Diffusion policy relates to several other lines of research in deep learning and embodied AI: