Diffusion policy

Deep Learning Diffusion Models Robotics

25 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 4,977 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Diffusion policy is a robot imitation-learning method, introduced in 2023 by Cheng Chi, Shuran Song, and collaborators at Columbia University, the Toyota Research Institute, and MIT, that represents a robot's visuomotor policy as a conditional denoising diffusion process. Instead of directly predicting one action from observations, diffusion policy iteratively refines random noise into a sequence of future robot actions through learned denoising steps, conditioned on recent camera images and robot state. Across 12 tasks from 4 robot-manipulation benchmarks, the original paper reported that diffusion policy "consistently outperforms existing state-of-the-art robot learning methods with an average improvement of 46.9%" ^[1]. First presented at Robotics: Science and Systems (RSS) 2023 and later extended in the International Journal of Robotics Research (IJRR) in 2024, diffusion policy has become one of the most widely adopted approaches in imitation learning for robot manipulation and serves as a foundational building block for many vision-language-action (VLA) models and robot foundation models, including pi0 and RDT-1B ^[1]^[7]^[8].

The core authors are Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song, spanning Columbia University, the Toyota Research Institute, and MIT ^[1]. As of 2025 the paper has accumulated well over two thousand citations and has become a default baseline in robot learning research ^[2].

What is diffusion policy?

Diffusion policy is a way of generating robot behavior by treating action selection as a generative modeling problem. Rather than outputting a single best action, the policy samples a short sequence of future actions from a learned conditional distribution using the same denoising diffusion machinery that powers modern image generators. The authors summarize the contribution directly: "This paper introduces Diffusion Policy, a new way of generating robot behavior by representing a robot's visuomotor policy as a conditional denoising diffusion process" ^[1].

The paper identifies three properties that make the diffusion formulation well suited to robot policies. It is described as "gracefully handling multimodal action distributions, being suitable for high-dimensional action spaces, and exhibiting impressive training stability" ^[1]. These three properties (multimodality, high-dimensional action chunks, and stable training) are the recurring themes that distinguish diffusion policy from the policy representations that preceded it.

When was diffusion policy released?

The paper "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" was first posted to arXiv (2303.04137) on March 7, 2023, and presented at RSS 2023 in Daegu, Republic of Korea ^[1]. An extended journal version was published in the International Journal of Robotics Research (IJRR) in 2024 (doi:10.1177/02783649241273668) ^[2]. The reference implementation is open source and maintained by the authors under the GitHub organization that moved from Columbia University to Stanford University as Shuran Song's lab relocated ^[1].

Background and motivation

Learning robot control policies from human demonstrations, known as behavioral cloning or learning from demonstrations, is a central problem in robot learning. A key challenge in this setting is representing the policy, which is the mapping from observations to actions. Human demonstrations are often inherently multimodal: for the same observation, multiple valid action trajectories may exist. For example, when reaching around an obstacle, a robot might go left or right, and both are correct. Traditional policy representations struggle with this multimodality in different ways.

Gaussian or Gaussian Mixture Model (GMM) policies, such as LSTM-GMM (also known as BC-RNN from the RoboMimic benchmark), represent the action distribution as an explicit parametric density ^[11]. While simple and fast at inference time, these methods are limited in the number of modes they can capture and tend to suffer from mode collapse or mode averaging when the demonstration data contains diverse behaviors.

Energy-based models, such as Implicit Behavioral Cloning (IBC), represent the policy implicitly through an energy function over observation-action pairs ^[12]. While theoretically capable of representing arbitrary distributions, IBC requires drawing negative samples during training to approximate the intractable normalization constant. This negative sampling introduces training instability, manifesting as error spikes and inconsistent evaluation performance.

The Behavior Transformer (BET) uses a categorical distribution over discretized action bins combined with a continuous offset. Although it can represent multiple modes, BET struggles with temporal consistency and can fail to commit to a single mode during execution, leading to hesitant or erratic behavior.

Diffusion policy was motivated by the observation that diffusion models, which had achieved remarkable success in image generation (such as DALL-E 2, Stable Diffusion, and Imagen), could be adapted from denoising pixels into images to denoising noise into robot action sequences. The score-function-based formulation of diffusion models provides a natural way to represent arbitrary normalizable distributions, including highly multimodal ones, while maintaining stable training through a simple mean squared error (MSE) denoising objective.

How does diffusion policy work?

Core idea

Diffusion policy formulates the robot policy as a conditional Denoising Diffusion Probabilistic Model (DDPM). Given a set of observations, the policy generates an action sequence by starting from Gaussian noise and performing K iterations of learned denoising. Each denoising step nudges the noisy action sequence toward the distribution of demonstrated actions conditioned on the current observations. As the paper puts it, the method "learns the gradient of the action-distribution score function and iteratively optimizes with respect to this gradient field during inference via a series of stochastic Langevin dynamics steps" ^[1].

Formally, the forward diffusion process gradually adds Gaussian noise to a clean action sequence over K steps:

q(A^k \mid A^{k-1}) = \mathcal{N}(A^k; \sqrt{\alpha_k}\, A^{k-1}, (1 - \alpha_k) I)

where $A^0$ is the original demonstrated action sequence and $\alpha_k$ controls the noise schedule. The reverse (denoising) process is parameterized by a neural network that predicts the noise added at each step:

p_\theta(A^{k-1} \mid A^k, O_t) = \mathcal{N}(A^{k-1}; \mu_\theta(O_t, A^k, k), \sigma_k^2 I)

where $O_t$ represents the observation conditioning. The network is trained to minimize a noise prediction objective (a variant of the standard DDPM loss):

L = \mathbb{E}\left[\lVert \epsilon - \epsilon_\theta(O_t, A^k, k) \rVert^2\right]

Rather than learning the action distribution directly, diffusion policy learns the gradient of the action-distribution score function. At inference time, the model performs stochastic Langevin dynamics on this learned gradient field to generate action samples ^[1]. This formulation sidesteps the need to compute intractable normalization constants, giving diffusion policy a significant advantage over energy-based models in training stability.

Noise schedule

The original diffusion policy paper employs the Square Cosine Schedule proposed in improved DDPM (iDDPM). This schedule was found to be well-suited for control tasks because it properly manages both high-frequency and low-frequency characteristics of action signals. The number of training diffusion steps is set to K=100 ^[1].

Action prediction with receding horizon control

A critical design choice in diffusion policy is that the model predicts a sequence of future actions rather than a single action. The approach uses three key horizon parameters:

Parameter	Symbol	Description
Observation horizon	T_o	Number of past observation steps used as conditioning
Prediction horizon	T_p	Total number of future action steps predicted by the model
Action horizon	T_a	Number of predicted actions actually executed before re-planning

At each control step, the policy observes T_o recent observations, generates a T_p-length action sequence through the denoising process, but only executes the first T_a actions. It then re-plans conditioned on new observations. This receding horizon control scheme balances temporal consistency (from predicting long sequences) with closed-loop reactivity (from frequent re-planning). The prediction horizon T_p is typically set to 16 steps, and the authors found that an action horizon of T_a = 8 provides a good trade-off across tasks ^[1].

This design allows diffusion policy to maintain smooth, temporally coherent trajectories while still reacting to perturbations or unexpected changes in the environment. It also implicitly handles action-level multimodality: the model commits to a mode at the start of each predicted sequence and maintains that commitment for T_a steps, preventing the mode-switching oscillation that plagues single-step policies.

Visual conditioning

Diffusion policy treats visual observations as conditioning rather than as part of the joint data distribution. The observation images are encoded through a visual encoder, and the resulting features are injected into the denoising network through conditioning mechanisms (such as FiLM for the CNN architecture or cross-attention for the Transformer architecture). This design is more sample-efficient than alternatives that model the joint distribution of observations and actions ^[1].

The default visual encoder is a ResNet-18 with spatial softmax pooling (replacing global average pooling to retain spatial information) and GroupNorm (replacing BatchNorm for compatibility with Exponential Moving Average training). Notably, the authors found that training the visual encoder end-to-end from scratch performed comparably to using pretrained representations from CLIP or R3M, though CLIP-pretrained ViT-B/16 with fine-tuning achieved the highest performance (98% on the Square task) ^[1].

Network architectures

The original paper proposes and compares two network architectures for the denoising model: a CNN-based architecture and a Transformer-based architecture.

CNN-based architecture (Diffusion Policy-C)

The CNN-based noise prediction network is a modified 1D temporal U-Net architecture adapted from Janner et al. (2022) ^[10]. The architecture operates on 1D temporal sequences of actions. Key modifications include:

Feature-wise Linear Modulation (FiLM) layers that inject observation features into the network. Observation features generate affine transformation parameters ( $\gamma$ and $\beta$ ) that are applied channel-wise to intermediate feature maps at each convolutional layer.
The network uses the standard U-Net structure with downsampling convolutional layers, skip connections, and upsampling layers.
Diffusion iteration step k is encoded via sinusoidal position embeddings and also injected through FiLM.

The CNN architecture works reliably across most tasks and is recommended as the default starting point for new applications. However, due to the locality bias inherent in convolutions, it can exhibit over-smoothing effects, producing overly smooth action trajectories that may miss sharp turns or rapid changes.

Transformer-based architecture (Diffusion Policy-T)

The Transformer-based architecture adapts the minGPT design for action denoising. In this architecture:

Noisy action tokens serve as the input sequence.
Diffusion iteration embeddings are prepended to the sequence.
Observation embeddings are integrated through cross-attention layers, allowing the model to attend to relevant visual and proprioceptive features.
The Transformer naturally handles varying sequence lengths and captures long-range temporal dependencies.

The Transformer architecture reduces the over-smoothing effect due to its lack of locality bias and can be more suitable for tasks requiring sharp or uneven trajectories. However, it requires more careful hyperparameter tuning compared to the CNN variant.

Feature	CNN (Diffusion Policy-C)	Transformer (Diffusion Policy-T)
Backbone	1D temporal U-Net	minGPT-style Transformer
Observation conditioning	FiLM layers	Cross-attention
Strengths	Reliable, minimal tuning needed	Handles sharp trajectories, reduces over-smoothing
Weaknesses	Over-smoothing on rapid action changes	Requires more hyperparameter tuning
Recommended use	Default for new tasks	High-complexity tasks with uneven trajectories

How fast is diffusion policy at inference? (DDPM vs. DDIM)

Diffusion policy can use either DDPM or DDIM (Denoising Diffusion Implicit Models) for the inference (denoising) process. While DDPM is used during training with K=100 diffusion steps, the full 100-step denoising process is too slow for real-time robot control.

DDIM provides a critical acceleration by decoupling the number of denoising iterations used in training from those used at inference. DDIM makes the reverse process deterministic and allows skipping steps, enabling the model to generate high-quality action sequences with far fewer denoising iterations. In practice, diffusion policy uses 100 training iterations but only 10 inference iterations with DDIM, achieving approximately 0.1 seconds of inference latency on an NVIDIA RTX 3080 GPU ^[1]. This makes real-time closed-loop control feasible, with the policy running at approximately 10 Hz and actions being linearly interpolated to the robot's native control frequency (125 Hz for the UR5 robot used in the experiments) ^[1].

Inference method	Training steps	Inference steps	Latency (RTX 3080)	Deterministic
DDPM	100	100	~1.0 s	No (stochastic)
DDIM	100	10	~0.1 s	Yes

Why is diffusion policy better than earlier policy representations?

Diffusion policy offers several key advantages compared to prior policy representations for imitation learning:

Multimodal action distribution modeling

By learning the gradient of the action score function, diffusion policy can represent arbitrary normalizable distributions, including complex multimodal distributions. This allows it to faithfully capture the full diversity of human demonstrations without mode collapse or mode averaging. In contrast, Gaussian mixture models are limited in the number of modes they can represent, and energy-based models require unstable negative sampling.

Training stability

The training objective is a straightforward MSE loss on noise prediction, which avoids the instabilities associated with energy-based training (negative sampling in IBC) or adversarial training. The authors report that optimal hyperparameters are mostly consistent across tasks, requiring little task-specific tuning ^[1].

High-dimensional action sequences

Diffusion policy naturally handles high-dimensional output spaces because the denoising process operates in the full action-sequence space. This makes it well-suited for predicting multi-step action chunks, which is essential for temporal consistency.

Graceful handling of idle actions

In demonstrations, robots often remain stationary for portions of the trajectory (e.g., waiting at the start). The Gaussian noise prior of the diffusion process naturally contracts toward zero, which aligns with idle or near-zero velocity actions. Other methods, particularly energy-based models, struggle with these near-zero regions where the energy landscape is ambiguous.

Position control compatibility

The authors discovered an important interaction between diffusion policy and action space representation. Diffusion policy shows a 15-25% performance improvement when using position control (predicting target positions) instead of velocity control (predicting velocities) ^[1]. This is because position control has less pronounced multimodality and fewer compounding errors when predicting action sequences. Interestingly, other baselines like LSTM-GMM and BET actually degrade with position control, making this advantage unique to the diffusion policy formulation ^[1].

Benchmark results

Diffusion policy was evaluated extensively across four benchmark suites containing 12 tasks total. It consistently outperformed all prior state-of-the-art methods with an average improvement of 46.9% in success rate ^[1].

Robomimic benchmark

The RoboMimic benchmark consists of five manipulation tasks with both proficient human (PH) and mixed human (MH) demonstration datasets ^[11]. Results are reported as best performance / average of last 10 checkpoints across 3 seeds.

State-based results (Proficient Human demonstrations):

Task	Diffusion Policy-C	Diffusion Policy-T	IBC	BET	LSTM-GMM
Lift	1.00 / 0.98	1.00 / 1.00	0.79 / 0.41	1.00 / 0.96	1.00 / 0.96
Can	1.00 / 0.97	1.00 / 1.00	0.15 / 0.02	1.00 / 0.99	1.00 / 0.93
Square	1.00 / 0.96	1.00 / 1.00	0.00 / 0.00	1.00 / 0.89	1.00 / 0.91
Transport	1.00 / 0.96	1.00 / 0.94	0.01 / 0.01	1.00 / 0.90	1.00 / 0.81
ToolHang	1.00 / 0.93	1.00 / 0.89	0.00 / 0.00	0.76 / 0.52	0.95 / 0.73

Image-based results (Proficient Human demonstrations):

Task	Diffusion Policy-C	Diffusion Policy-T	IBC	LSTM-GMM
Lift	1.00 / 1.00	1.00 / 1.00	0.94 / 0.73	1.00 / 0.95
Can	1.00 / 1.00	1.00 / 0.99	0.39 / 0.05	1.00 / 0.95
Square	1.00 / 0.97	1.00 / 0.98	0.08 / 0.01	1.00 / 0.88
Transport	1.00 / 0.96	1.00 / 0.98	0.00 / 0.00	0.98 / 0.90
ToolHang	0.98 / 0.92	1.00 / 0.90	0.03 / 0.00	0.82 / 0.59

Diffusion policy achieved near-perfect performance on most tasks, and its advantage was most pronounced on the hardest tasks (ToolHang and Transport) where baselines showed significant degradation.

Push-T task

In the Push-T task, the robot must push a T-shaped block to align with a target outline. Performance is measured by Intersection over Union (IoU) coverage rather than binary success. Diffusion policy achieved 0.95 (CNN) and 0.91 (Transformer) IoU in simulation. On a real UR5 robot, diffusion policy achieved 0.80 IoU compared to 0.84 for the human demonstrator, with a 95% success rate. IBC achieved only 20% success and LSTM-GMM 0% on this task ^[1].

Block pushing

The block pushing task requires pushing two blocks to target locations in sequence, testing multi-stage planning with multimodal solutions. Diffusion Policy-T achieved 0.99 success on phase 1 and 0.94 on phase 2, compared to BET at 0.96/0.71 and IBC at 0.01/0.00 ^[1].

Franka Kitchen

The Franka Kitchen environment requires completing four subtasks in sequence (e.g., open microwave, move kettle, turn on burner, slide cabinet). Diffusion Policy-C achieved 0.99 success on completing all four subtasks, compared to BET at 0.44 and IBC at 0.24 ^[1].

Real-world robot experiments

Diffusion policy was validated on multiple real-world manipulation tasks across different robot platforms ^[1]:

Push-T (UR5 robot)

Using a top-down camera setup, diffusion policy learned to push a T-shaped block with 95% success rate across 20 trials. The policy exhibited remarkable robustness: it recovered from human-applied perturbations that displaced the block during execution and maintained performance even when the camera was deliberately occluded for 3 seconds. The policy also generalized to novel block starting positions not seen during training.

Mug flipping (Franka Panda)

This 6-DOF task required the robot to flip a mug upright. Diffusion policy achieved 90% success (18 out of 20 trials) and notably exhibited emergent behaviors not present in any training demonstration, such as performing multiple corrective pushes to align the mug handle or executing re-grasps when the initial grasp was suboptimal.

Sauce pouring and spreading (Franka Panda)

These tasks involve scooping sauce with a ladle, pouring it onto a plate, and spreading it evenly. The sauce pouring task achieved 79% success with 0.74 IoU (human baseline: 0.79). The sauce spreading task achieved 100% success with 0.77 coverage (human: 0.79). In contrast, LSTM-GMM failed to lift the ladle in 15 out of 20 pouring trials.

Bimanual manipulation (dual Franka Panda)

Diffusion policy was also tested on bimanual tasks requiring coordinated two-arm control:

Task	Demonstrations	Success rate
Egg beater	210	55%
Mat unrolling	162	75%
Shirt folding	284	75%

All bimanual tasks were trained without any task-specific hyperparameter adjustments, demonstrating the method's generality ^[1].

Ablation studies

The original paper includes extensive ablation experiments that reveal important design principles:

Position vs. velocity control

Switching from velocity control to position control yielded a 15-25% performance improvement for diffusion policy ^[1]. This improvement is attributed to two factors: position control has less pronounced multimodality (reaching a target position from the same start has fewer valid trajectories than reaching a target velocity), and position control has less compounding error when predicting sequences. Other baselines (LSTM-GMM, BET) showed 5-15% degradation with position control, suggesting this advantage is specific to the diffusion formulation.

Action horizon trade-off

The action execution horizon T_a controls the balance between temporal consistency and reactivity. Longer horizons produce smoother trajectories but reduce the robot's ability to react to disturbances. Shorter horizons increase reactivity but may lead to jerky movements or mode-switching. The authors found T_a = 8 to be optimal across most tasks ^[1].

Visual encoder comparison

An ablation on the Square task compared different visual encoder choices:

Visual encoder	Training strategy	Success rate
ViT-B/16 (CLIP)	Fine-tuned	98%
ResNet-34	Fine-tuned	94%
ResNet-18	End-to-end from scratch	94%
Frozen pretrained	Various	40-70%

Frozen pretrained encoders consistently underperformed, indicating that fine-tuning or end-to-end training of the visual backbone is important for diffusion policy ^[1].

Limitations and challenges

Despite its strong performance, diffusion policy has several notable limitations:

Inference latency

The iterative denoising process requires multiple forward passes through the neural network at inference time. Even with DDIM acceleration (reducing from 100 to 10 steps), inference takes approximately 0.1 seconds per action chunk ^[1]. This latency makes diffusion policy unsuitable for tasks requiring very high-frequency control (above approximately 10 Hz) or extremely fast reactions. Contact-rich or high-speed manipulation tasks may suffer from the delay between observation and action execution.

Computational requirements

Diffusion policy has higher computational costs than simpler methods like LSTM-GMM, both during training and inference. Standard architectures involve millions of parameters, leading to high memory usage. Deploying on resource-constrained platforms such as mobile robots or drones remains challenging due to the computational and memory footprint.

Mode oscillation during execution

While the ability to represent multimodal distributions is an advantage during training, it can cause issues during real-time execution. As the policy executes a trajectory, the mode should naturally collapse to a single consistent behavior. However, with limited conditional context, the policy may occasionally oscillate between different modes rather than committing to one, particularly at decision points.

Demonstration data requirements

Diffusion policy typically requires a substantial number of high-quality demonstrations (often 100-600+) to learn robust policies. This is more than some simpler methods require for basic tasks, although the quality and generalizability of the resulting policy tend to justify the increased data collection effort.

Extensions and follow-up work

Diffusion policy has inspired a large body of follow-up research addressing its limitations and extending its capabilities.

3D Diffusion Policy (DP3)

3D Diffusion Policy (DP3), published at RSS 2024 by Ze et al., extends diffusion policy by incorporating 3D visual representations extracted from sparse point clouds using an efficient point encoder ^[3]. DP3 handles most tasks with just 10 demonstrations and surpasses 2D baselines with a 24.2% relative improvement across 72 simulation tasks. In real-robot experiments, DP3 achieved an 85% success rate with only 40 demonstrations per task and demonstrated strong generalization across viewpoints, appearances, and spatial configurations ^[3]. An improved version (iDP3) was developed for humanoid manipulation and presented at IROS 2025.

Consistency Policy

Consistency Policy, published at RSS 2024, uses the Consistency Trajectory Model (CTM) to distill a pre-trained diffusion policy into a model that generates action sequences in a single denoising step ^[4]. This dramatically reduces inference time while maintaining comparable task performance, addressing the latency limitation.

One-Step Diffusion Policy (OneDP)

One-Step Diffusion Policy applies diffusion distillation to compress the multi-step denoising process into a single forward pass, achieving inference speeds suitable for higher-frequency control loops without significant performance degradation.

FlowPolicy

FlowPolicy, presented as an oral paper at AAAI 2025, replaces the diffusion process with conditional consistency flow matching ^[9]. By learning a straight-line flow that generates actions in a single step, FlowPolicy enables real-time inference while maintaining action quality comparable to multi-step diffusion policies.

Diffusion Policy Policy Optimization (DPPO)

DPPO, published at ICLR 2025, provides an algorithmic framework for fine-tuning pre-trained diffusion policies using reinforcement learning (specifically Proximal Policy Optimization) ^[5]. DPPO takes advantage of synergies between RL fine-tuning and the diffusion parameterization, enabling structured and on-manifold exploration with stable training. It outperforms alternative RL approaches (such as IDQL and DIPO) for fine-tuning diffusion-based policies on RoboMimic and other benchmarks.

Universal Manipulation Interface (UMI)

UMI, developed by Cheng Chi and colleagues and a Best Systems Paper Finalist at RSS 2024, uses diffusion policy as its core policy learning component ^[6]. UMI enables in-the-wild data collection using handheld grippers with GoPro cameras, with the resulting diffusion policies being hardware-agnostic and deployable across multiple robot platforms. UMI incorporates inference-time latency matching and relative-trajectory action representations for practical deployment.

Robotics Diffusion Transformer (RDT-1B)

RDT-1B is a 1.2 billion parameter diffusion foundation model for bimanual manipulation, and was the largest diffusion-based foundation model for robotic manipulation at the time of release ^[8]. It builds on diffusion policy principles at a much larger scale, was pre-trained on 46 robot datasets with more than 1 million episodes plus 6,000+ episodes collected on the ALOHA dual-arm robot, and can be fine-tuned to learn new skills from just 1-5 demonstrations while exhibiting zero-shot generalization to unseen objects and scenes ^[8]. RDT-1B was accepted at ICLR 2025.

Additional extensions

Other notable extensions include:

Extension	Key contribution	Venue
ManiCM	Accelerates 3D diffusion policy via consistency models	2024
Mamba Policy	Replaces backbone with Mamba selective state model	2024
Diffusion Transformer Policy	Adapts diffusion policy to Transformer-only architecture	2024
Reactive Diffusion Policy	Addresses real-time reactivity for contact-rich tasks	2025
D3P (Dynamic Denoising Diffusion Policy)	Adapts denoising schedule dynamically via RL	2025
FastDP	Optimizes on-device deployment for mobile platforms	2025

Adoption and ecosystem

Diffusion policy has seen broad adoption across the robotics research community and is increasingly used in industry:

Open-source implementations

The original implementation is available at the GitHub repository maintained by the authors at Columbia University (later Stanford University) ^[1]. The codebase provides both CNN-based and Transformer-based variants with support for state-based and image-based observations.

Diffusion policy is also integrated into Hugging Face's LeRobot framework, which aims to make robot learning accessible with end-to-end learning tools. LeRobot provides pre-trained diffusion policy models (e.g., for the Push-T task), standardized training pipelines, and integration with various robot hardware through a plugin system.

How did diffusion policy influence VLAs and robot foundation models?

Several robotics companies and large robot foundation models have built upon or been influenced by diffusion policy. Physical Intelligence developed pi0, a vision-language-action flow model for general robot control that extends the core idea of using generative models for action prediction; pi0 builds on a 3 billion parameter PaliGemma vision-language backbone and uses flow matching to output continuous actions at up to 50 Hz ^[7]. In comparative evaluations, pi0 outperforms standard diffusion policy and other methods like ACT, OpenVLA, and Octo when fine-tuned on small datasets ^[7]. Physical Intelligence raised $400M in late 2024 at a $2B valuation, reflecting the commercial potential of this line of research.

Octo, a 93M parameter generalist robot policy, uses diffusion action heads as its output representation, further demonstrating the adoption of diffusion-based action generation in foundation models for robotics. RDT-1B applies the same generative action-modeling philosophy at the 1.2 billion parameter scale for bimanual control ^[8].

Research community

As of 2025, the original diffusion policy paper had been cited more than two thousand times, and diffusion-based policies have become a default baseline in robot learning research ^[2]. Surveys on diffusion models for robotic manipulation document dozens of follow-up methods and applications spanning grasping, locomotion, navigation, and multi-agent coordination.

Relationship to other methods

Diffusion policy relates to several other lines of research in deep learning and embodied AI:

Score-based generative models: Diffusion policy directly builds on the theory of score-based generative modeling, where the model learns the gradient (score) of the data distribution and generates samples through Langevin dynamics.
Diffuser (Janner et al., 2022): Diffuser applies diffusion models to trajectory optimization in reinforcement learning, planning entire state-action trajectories ^[10]. Diffusion policy differs by focusing on policy learning (mapping observations to actions) rather than full trajectory planning, and by conditioning on visual observations.
Action Chunking with Transformers (ACT): ACT predicts action chunks using a variational autoencoder (VAE) with a Transformer. Like diffusion policy, ACT predicts multi-step action sequences, but uses a different generative model family.
Flow matching: Flow matching provides an alternative to the diffusion process that learns straight-line probability flows, enabling faster inference. Several follow-up works (FlowPolicy, pi0) replace diffusion with flow matching while retaining the overall policy structure ^[7]^[9].

References

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., & Song, S. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." Proceedings of Robotics: Science and Systems (RSS). arXiv:2303.04137. https://arxiv.org/abs/2303.04137 ↩
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., & Song, S. (2025). "Diffusion policy: Visuomotor policy learning via action diffusion." International Journal of Robotics Research (IJRR). doi:10.1177/02783649241273668. ↩
Ze, Y., Zhang, G., et al. (2024). "3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations." Proceedings of Robotics: Science and Systems (RSS). ↩
Prasad, P., Dass, S., & Scalise, R. (2024). "Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation." Proceedings of Robotics: Science and Systems (RSS). ↩
Ren, A., Lidec, J., et al. (2025). "Diffusion Policy Policy Optimization." Proceedings of the International Conference on Learning Representations (ICLR). ↩
Chi, C., Xu, Z., Pan, C., Cousineau, E., Burchfiel, B., Feng, S., Tedrake, R., & Song, S. (2024). "Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots." Proceedings of Robotics: Science and Systems (RSS). ↩
Black, K., et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." Physical Intelligence. arXiv:2410.24164. https://www.pi.website/download/pi0.pdf ↩
Liu, Y., et al. (2024). "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation." arXiv:2410.07864. https://arxiv.org/abs/2410.07864 ↩
Zhu, Q., et al. (2025). "FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation." Proceedings of AAAI. ↩
Janner, M., Du, Y., Tenenbaum, J., & Levine, S. (2022). "Planning with Diffusion for Flexible Behavior Synthesis." Proceedings of the International Conference on Machine Learning (ICML). ↩
Mandlekar, A., Xu, D., et al. (2022). "RoboMimic: A Framework for Robot Learning from Demonstration." Conference on Robot Learning (CoRL). ↩
Florence, P., Lynch, C., Zeng, A., Ramirez, O., Wahid, A., Downs, L., Wong, A., Lee, J., Mordatch, I., & Tompson, J. (2022). "Implicit Behavioral Cloning." Conference on Robot Learning (CoRL). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

ALOHA (robot system)ALOHA 2 Action Chunking with Transformers (ACT)Large Behavior Model OpenVLA Robotics Robotics Models Sunday Robotics Sunday Robotics MEMO Toyota (robotics)Toyota Research Institute π₀ (pi-zero)

What is diffusion policy?

When was diffusion policy released?

Background and motivation

How does diffusion policy work?

Core idea

Noise schedule

Action prediction with receding horizon control

Visual conditioning

Network architectures

CNN-based architecture (Diffusion Policy-C)

Transformer-based architecture (Diffusion Policy-T)

How fast is diffusion policy at inference? (DDPM vs. DDIM)

Why is diffusion policy better than earlier policy representations?

Multimodal action distribution modeling

Training stability

High-dimensional action sequences

Graceful handling of idle actions

Position control compatibility

Benchmark results

Robomimic benchmark

Push-T task

Block pushing

Franka Kitchen

Real-world robot experiments

Push-T (UR5 robot)

Mug flipping (Franka Panda)

Sauce pouring and spreading (Franka Panda)

Bimanual manipulation (dual Franka Panda)

Ablation studies

Position vs. velocity control

Action horizon trade-off

Visual encoder comparison

Limitations and challenges

Inference latency

Computational requirements

Mode oscillation during execution

Demonstration data requirements

Extensions and follow-up work

3D Diffusion Policy (DP3)

Consistency Policy

One-Step Diffusion Policy (OneDP)

FlowPolicy

Diffusion Policy Policy Optimization (DPPO)

Universal Manipulation Interface (UMI)

Robotics Diffusion Transformer (RDT-1B)

Additional extensions

Adoption and ecosystem

Open-source implementations

How did diffusion policy influence VLAs and robot foundation models?

Research community

Relationship to other methods

See also

References

Improve this article

Related Articles

Physical Intelligence

Embodied AI

Robot learning

Stable Diffusion

DALL-E

Midjourney

What links here

Related Articles

Physical Intelligence

Embodied AI

Robot learning

Stable Diffusion

DALL-E

Midjourney

What links here