Diffusion model
Last reviewed
May 18, 2026
Sources
43 citations
Review status
Source-backed
Revision
v7 ยท 8,339 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
43 citations
Review status
Source-backed
Revision
v7 ยท 8,339 words
Add missing citations, update stale details, or suggest a clearer explanation.
A diffusion model is a type of generative model that produces data by learning to reverse a gradual noising process. The model is trained on the principle that if you systematically add noise to a data sample until it becomes pure random noise, a neural network can learn to undo each step of that corruption. At generation time, the model starts from random noise and iteratively removes noise, step by step, until it produces a clean sample that resembles the training data.
Since 2020, diffusion models have become the dominant paradigm for image generation, overtaking generative adversarial networks (GANs) in both sample quality and diversity. They power the most widely used AI image generators, including Stable Diffusion, DALL-E 2 and 3, Imagen, and Midjourney. Beyond images, diffusion models have been extended to video, audio, 3D object generation, molecular design, protein structure prediction, robotic control, and even text generation.
The mathematical ideas underlying diffusion models predate their application to deep learning by several years. In 2005, Aapo Hyvarinen introduced score matching, a method for estimating the gradient of the log probability density of a distribution (called the "score function") without needing to compute an intractable normalizing constant. This technique allowed models to learn the shape of a probability distribution indirectly, by learning how the density changes at each point rather than computing the density itself.
In 2011, Pascal Vincent established a connection between score matching and denoising autoencoders. Vincent showed that training a denoising autoencoder is mathematically equivalent to performing score matching on a noise-perturbed version of the data distribution. This result, known as denoising score matching, later became one of the theoretical pillars of diffusion-based generative modeling.
The formal introduction of diffusion probabilistic models to machine learning came in 2015 with the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli, published at ICML. Drawing directly from non-equilibrium statistical physics, the authors proposed a framework in which structure in a data distribution is systematically destroyed through an iterative forward diffusion process modeled as a Markov chain that gradually adds Gaussian noise. A reverse diffusion process is then learned to restore structure, yielding a tractable generative model.
While the theoretical framework was sound, the generated image quality did not match GANs at the time, and the paper received relatively limited attention for several years. The key contribution was conceptual: demonstrating that the thermodynamic principle of reversible processes could be applied to generative modeling.
In 2019, Yang Song and Stefano Ermon proposed Noise Conditional Score Networks (NCSN), which estimated the score function at multiple noise levels and used Langevin dynamics to generate samples. Their paper, "Generative Modeling by Estimating Gradients of the Data Distribution," presented at NeurIPS 2019, showed that score-based generative modeling could produce competitive image samples. The approach worked by training a single network to predict the score function conditioned on different noise levels, then using annealed Langevin dynamics at generation time to progressively move from noisy to clean samples.
The modern era of diffusion models began with "Denoising Diffusion Probabilistic Models" (DDPM) by Jonathan Ho, Ajay Jain, and Pieter Abbeel, published at NeurIPS 2020. DDPM showed that diffusion models could generate images competitive with GANs while avoiding the training instability and mode collapse problems that plagued adversarial approaches.
The key insight in DDPM was a simplified training objective: instead of predicting the clean data directly, the network predicts the noise that was added at each step. This noise prediction objective proved both stable to train and effective at producing high-quality samples. DDPM achieved an FID score of 3.17 on CIFAR-10, which was state of the art for likelihood-based models at the time.
Progress accelerated rapidly after DDPM. In late 2020, Jiaming Song, Chenlin Meng, and Stefano Ermon introduced Denoising Diffusion Implicit Models (DDIM), which generalized the DDPM sampling process to non-Markovian forward processes, enabling deterministic sampling and producing high-quality images 10 to 50 times faster than DDPM.
In early 2021, Alex Nichol and Prafulla Dhariwal published "Improved Denoising Diffusion Probabilistic Models," introducing the cosine noise schedule and learned variance parameters that yielded better log-likelihood scores and sample quality.
A major theoretical advance came in 2021 when Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole published "Score-Based Generative Modeling through Stochastic Differential Equations." This paper unified DDPM and score-based models into a single framework based on continuous-time stochastic differential equations (SDEs). The forward noising process is described by an SDE, and generation follows the corresponding reverse-time SDE. This unification proved that DDPM and score-based approaches are two perspectives on the same underlying mathematical structure.
Also in 2021, Prafulla Dhariwal and Alex Nichol at OpenAI published "Diffusion Models Beat GANs on Image Synthesis," introducing classifier guidance and architectural improvements that let diffusion models surpass GANs on ImageNet generation for the first time. This result marked a turning point for the field.
In 2022, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer introduced Latent Diffusion Models (LDM) at CVPR, which run the diffusion process in the compressed latent space of a pretrained variational autoencoder (VAE) rather than on raw pixels. This approach reduced computational costs by roughly 48 times while maintaining high image quality, and it became the foundation for Stable Diffusion.
Beginning in 2022, diffusion models powered a wave of text-to-image systems that brought generative AI to mainstream attention. DALL-E 2 (OpenAI, April 2022), Imagen (Google Brain, May 2022), and Stable Diffusion (Stability AI / CompVis / Runway, August 2022) all demonstrated the ability to generate photorealistic images from text descriptions. Stable Diffusion's open-source release had a particularly large impact, enabling a broad ecosystem of fine-tuned models, LoRA adapters, ControlNet extensions, and custom pipelines.
In 2023, William Peebles and Saining Xie introduced the Diffusion Transformer (DiT), which replaced the U-Net backbone with a transformer-based architecture operating on image patches. DiT demonstrated clear scaling laws: larger models with more compute consistently achieved lower FID scores. This architecture has since been adopted by Stable Diffusion 3, Sora, and FLUX.
Diffusion models rest on two complementary processes: a forward process that gradually adds noise to data, and a reverse process that learns to remove the noise.
Given a data point x_0 sampled from the real data distribution q(x_0), the forward process defines a Markov chain that adds Gaussian noise over T steps:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
Here, beta_1, beta_2, ..., beta_T is a noise schedule that controls how much noise is added at each step. As t increases, the sample becomes progressively noisier. After sufficiently many steps, x_T is approximately standard Gaussian noise, and all information about the original data point has been destroyed.
A useful property of this formulation is that x_t can be sampled directly at any timestep without iterating through all previous steps. Defining alpha_t = 1 - beta_t and alpha_bar_t as the cumulative product of alpha_1 through alpha_t:
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
where epsilon is drawn from N(0, I). This closed-form expression is essential for efficient training, since the model can be trained on randomly sampled timesteps rather than requiring sequential computation through all steps.
The reverse process starts from Gaussian noise x_T and iteratively denoises to recover a data sample. The true reverse conditional q(x_{t-1} | x_t) is intractable in general, but when conditioned on the original data point x_0, the posterior q(x_{t-1} | x_t, x_0) is Gaussian and can be computed in closed form.
A neural network with parameters theta is trained to approximate the reverse transitions:
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)
In the DDPM formulation, the mean mu_theta is parameterized in terms of a noise prediction network epsilon_theta(x_t, t), which estimates the noise that was added to produce x_t from x_0. Given the predicted noise, the model can compute an estimate of x_0 and then derive the reverse step mean.
The standard DDPM training objective is a simplified form of the variational lower bound (VLB) on the data log-likelihood:
L_simple = E[|| epsilon - epsilon_theta(x_t, t) ||^2]
The expectation is over t sampled uniformly from {1, ..., T}, x_0 sampled from the training data, and epsilon sampled from N(0, I). In practice, each training step involves: (1) selecting a random training sample, (2) selecting a random timestep, (3) adding the corresponding amount of noise using the closed-form expression, and (4) training the network to predict the noise that was added.
Ho et al. found that this simplified mean squared error loss on noise prediction produced better sample quality than the full variational bound, likely because it places more weight on the perceptually important lower noise levels.
From the score matching viewpoint, the noise prediction network is closely related to the score function, defined as the gradient of the log probability density with respect to the data:
score(x) = gradient of log p(x) with respect to x
The noise prediction at timestep t is proportional to the score of the noisy data distribution at that noise level. The score function tells the model which direction to "push" a noisy sample to move it toward higher-probability (cleaner) regions of the data distribution. Yang Song and Stefano Ermon's score-based framework directly estimates this score function and generates samples using Langevin dynamics.
The SDE framework by Song et al. (2021) describes the forward process as a continuous-time stochastic differential equation:
dx = f(x, t) dt + g(t) dw
where f is the drift coefficient, g is the diffusion coefficient, and w is a standard Wiener process. The reverse-time SDE takes the form:
dx = [f(x, t) - g(t)^2 * score(x, t)] dt + g(t) dw_bar
where w_bar is a reverse-time Wiener process. This formulation allows the use of numerical SDE and ODE solvers for sampling, and it unifies DDPM and score-based models as different discretizations of the same continuous process.
In 2022, Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine of NVIDIA published "Elucidating the Design Space of Diffusion-Based Generative Models" (NeurIPS 2022), commonly referred to as the EDM framework.[27] The paper argued that the theory and practice of diffusion models had become "unnecessarily convoluted" and proposed a unified design space separating the choices of noise schedule, network preconditioning, training loss weighting, sampler, and noise distribution at training time. Using this framework, EDM achieved an FID of 1.79 on class-conditional CIFAR-10 and 1.97 unconditional, with only 35 network evaluations per image. The Heun second-order stochastic sampler and the sigma-based preconditioning introduced by EDM have been widely adopted in subsequent diffusion model implementations.
Beyond the SDE-based formulation, a family of closely related frameworks reformulates the generative problem as learning a continuous transport between a noise distribution and the data distribution.
These frameworks are mathematically equivalent to score-based diffusion in many practical settings but offer additional flexibility in choosing probability paths. Modern systems such as Stable Diffusion 3 and FLUX.1 adopt flow-matching style training objectives over the Gaussian diffusion losses used by earlier latent diffusion models.
The noise schedule determines how quickly the forward process destroys the data signal. Common choices include:
| Schedule | Description | Notes |
|---|---|---|
| Linear | beta_t increases linearly from beta_1 to beta_T | Used in the original DDPM (beta_1 = 0.0001, beta_T = 0.02, T = 1000) |
| Cosine | alpha_bar_t follows a cosine curve | Proposed by Nichol and Dhariwal (2021); adds noise more gradually at early steps |
| Scaled linear | Linear schedule adapted for latent space | Common in latent diffusion models |
| Sigmoid | beta_t follows a sigmoid curve | Used in some continuous-time formulations |
The cosine schedule generally produces better results than the linear schedule because it preserves more signal in intermediate steps, where much of the perceptually meaningful structure is learned.
The denoising network can parameterize its output in several equivalent ways:
| Parameterization | Description | Typical use |
|---|---|---|
| Epsilon prediction | Network predicts the noise epsilon added during forward process | DDPM, Stable Diffusion 1.x/2.x |
| x_0 prediction | Network directly predicts the clean data x_0 | Some early models; useful for certain loss formulations |
| v prediction | Network predicts velocity v = sqrt(alpha_bar_t) * epsilon - sqrt(1 - alpha_bar_t) * x_0 | Progressive distillation, Stable Diffusion 2.x |
All three are mathematically interconvertible, but they have different numerical properties affecting training stability and sample quality at different noise levels.
Most diffusion models through 2023 used a U-Net architecture as the denoising network. Originally designed for biomedical image segmentation, the U-Net features an encoder-decoder structure with skip connections between corresponding encoder and decoder layers. In the diffusion context, the U-Net takes a noisy input x_t and a timestep t, then predicts the noise epsilon.
The architecture typically includes:
The encoder progressively downsamples spatial resolution while increasing channels, and the decoder upsamples back. This multi-scale structure lets the network capture both fine textures and global composition.
The Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in 2023 (ICCV), replaces the U-Net with a Vision Transformer (ViT)-style architecture operating on sequences of image patches. Timestep and class conditioning are incorporated through adaptive layer normalization (adaLN).
The key finding was that DiT models follow clear scaling laws: more compute (measured in GFLOPs) consistently yields lower FID scores. The largest model, DiT-XL/2, achieved an FID of 2.27 on class-conditional ImageNet 256x256 generation. The DiT architecture has since been adopted by Stable Diffusion 3, Sora, and FLUX, reflecting a broader shift toward transformer-based architectures across modalities.
Latent Diffusion Models (LDM), introduced by Rombach et al. (2022), run the diffusion process in the compressed latent space of a pretrained VAE rather than directly on pixels. A VAE encoder compresses an image (for example, 512x512x3 pixels) into a smaller latent representation (for example, 64x64x4), and the diffusion model operates on this compact representation. After generation, the VAE decoder converts the latent code back to pixel space.
This approach offers several benefits:
Latent diffusion became the foundation for Stable Diffusion and influenced the design of many subsequent systems.
The original DDPM sampling requires T = 1000 sequential denoising steps, making generation slow. Each step applies the learned reverse transition to produce a slightly cleaner sample.
Denoising Diffusion Implicit Models (DDIM), proposed by Jiaming Song, Chenlin Meng, and Stefano Ermon (2020), generalize DDPM by constructing non-Markovian forward processes that share the same training objective. DDIM sampling is deterministic given a fixed initial noise vector, enabling consistent image generation from the same latent code and meaningful interpolation in latent space. DDIM can use as few as 10 to 50 steps with relatively minor quality loss compared to 1000-step DDPM.
DPM-Solver, introduced by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu in 2022, applies high-order ODE solvers to the diffusion sampling process. By analytically computing parts of the solution and using higher-order numerical methods for the remainder, DPM-Solver can generate high-quality samples in 10 to 25 steps. DPM-Solver++ further improved results for guided sampling. These solvers are now among the most commonly used in practice.
Consistency models, introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever in 2023 (ICML), learn to map any point along the diffusion trajectory directly to the clean data point in a single step. This allows high-quality generation in one or very few steps. Consistency models can be trained either by distilling a pretrained diffusion model (consistency distillation) or from scratch (consistency training).
Improved Consistency Training (iCT), published in 2024, achieved FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64x64 in a single sampling step. Easy Consistency Tuning (ECT), published at ICLR 2025, achieved a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU, matching performance that previously required hundreds of GPU hours.
Rectified flow, introduced by Xingchao Liu, Chengyue Gong, and Qiang Liu in "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (ICLR 2023 Spotlight), learns a transport map between noise and data along straight paths rather than the curved trajectories of standard diffusion.[25] Straighter paths require fewer discretization steps, making generation more efficient. The Stable Diffusion 3 paper by Esser et al. (2024) and the FLUX.1 model family from Black Forest Labs both adopt rectified flow as their formulation. Recent research has shown that flow matching and Gaussian diffusion are mathematically equivalent, though flow matching offers practical advantages in trajectory straightness and training simplicity.
Latent Consistency Models, introduced by Simian Luo and colleagues in October 2023, distill consistency-model behavior directly in the latent space of a pretrained latent diffusion model.[26] Viewing the guided reverse process as solving an augmented probability flow ODE in latent space, LCMs predict the ODE solution directly, enabling 2 to 4 step 768x768 image synthesis. A high-quality LCM took only 32 A100 GPU hours to train. The follow-up LCM-LoRA (Luo et al., November 2023) treats consistency distillation as a LoRA adapter that can be plugged into pretrained Stable Diffusion checkpoints without modifying the base weights, acting as a "universal acceleration module" for the open-source diffusion ecosystem.
Dhariwal and Nichol (2021) introduced classifier guidance to improve conditional generation quality. A separate classifier is trained on noisy images, and its gradients steer the diffusion sampling process toward a desired class:
epsilon_guided = epsilon_theta(x_t, t) - s * gradient of log p(y | x_t)
Higher values of the guidance scale s produce images more strongly associated with the target class but with reduced diversity. This approach requires training a separate classifier on noisy data, adding complexity.
Jonathan Ho and Tim Salimans proposed classifier-free guidance (CFG) in 2022, eliminating the need for a separate classifier. During training, the conditioning signal (for example, a text prompt) is randomly dropped for a fraction of examples, so the model learns both conditional and unconditional generation. At inference:
epsilon_guided = epsilon_unconditional + w * (epsilon_conditional - epsilon_unconditional)
When w = 1, this is standard conditional generation. Values of w > 1 amplify the influence of the conditioning signal, producing outputs more closely aligned with the input at the cost of reduced diversity. CFG has become the standard conditioning approach in virtually all modern text-to-image diffusion systems. Typical guidance scale values range from 7 to 15.
In text-conditioned diffusion models, the text prompt is first encoded into a sequence of embedding vectors using a text encoder such as CLIP or T5. These embeddings are injected into the denoising network through cross-attention layers, where image features act as queries (Q) and text embeddings provide keys (K) and values (V). This allows every spatial position in the image to attend to relevant parts of the text, enabling fine-grained alignment between the generated image and the prompt.
Different systems use different text encoders:
| System | Text encoder(s) |
|---|---|
| Stable Diffusion 1.x | CLIP ViT-L/14 |
| Stable Diffusion 2.x | OpenCLIP ViT-H/14 |
| Stable Diffusion XL | CLIP ViT-L + OpenCLIP ViT-bigG |
| Stable Diffusion 3 | Two CLIP models + T5-XXL |
| DALL-E 2 | CLIP |
| Imagen | T5-XXL (11B parameters) |
| FLUX.1 | T5-XXL |
In classifier-free guidance, the unconditional prediction can be replaced with a prediction conditioned on a negative prompt describing attributes the user wants to avoid:
epsilon_guided = epsilon_negative + w * (epsilon_positive - epsilon_negative)
This allows users to steer generation away from undesired features (for example, "blurry, low quality, distorted hands") while amplifying desired attributes.
| System | Organization | Year | Architecture | Key features |
|---|---|---|---|---|
| DALL-E 2 | OpenAI | 2022 | CLIP prior + cascaded diffusion (unCLIP) | Text-to-image, inpainting, image variations |
| Imagen | Google Brain | 2022 | T5-XXL + cascaded U-Net diffusion | Text-to-image at 1024x1024; showed scaling text encoder matters most |
| Stable Diffusion 1.5 | Stability AI / CompVis / Runway | 2022 | Latent diffusion, U-Net, CLIP, VAE | Open source; text-to-image, inpainting, img2img |
| Midjourney v4 | Midjourney, Inc. | 2022 | Proprietary diffusion model | Text-to-image via Discord |
| Stable Diffusion XL | Stability AI | 2023 | Larger U-Net, dual CLIP encoders | 1024x1024 native resolution |
| DALL-E 3 | OpenAI | 2023 | Improved diffusion + recaptioning pipeline | Strong text rendering and prompt following |
| Midjourney v6 | Midjourney, Inc. | 2023 | Third-generation model | Improved photorealism, text rendering |
| Stable Diffusion 3 | Stability AI | 2024 | Multimodal DiT (MMDiT) + rectified flow | Three text encoders, improved text rendering |
| FLUX.1 | Black Forest Labs | 2024 | 12B-parameter rectified flow transformer | Pro, Dev, and Schnell variants |
| Sora | OpenAI | 2024 | Diffusion transformer on spacetime patches | Text-to-video up to 1 minute at 1080p |
| HunyuanVideo | Tencent | 2024 | DiT + 3D causal VAE + MLLM text encoder | 13B parameters; largest open-weights video model at release |
| Movie Gen | Meta | 2024 | 30B DiT, 73K-token context | Joint video + synchronized audio + editing + personalization |
| Midjourney V7 | Midjourney, Inc. | 2025 | New architecture (proprietary) | Draft mode, improved coherence |
| Imagen 3 / Imagen 4 | Google DeepMind | 2024 / 2025 | Latent diffusion (details not public) | Production text-to-image on Vertex AI and consumer apps |
| Veo 3 | Google DeepMind | 2025 | Diffusion video model with joint audio | Natively synchronized dialogue, SFX, and ambient audio |
| Sora 2 | OpenAI | 2025 | Updated diffusion video model | Improved physics and synchronized audio |
| FLUX.2 | Black Forest Labs | 2025 | Rectified flow transformer + Mistral-3 24B VLM | Unified generation and editing |
DALL-E 2, released by OpenAI in April 2022, uses an approach called unCLIP. It consists of a CLIP text encoder, a prior model that maps CLIP text embeddings to CLIP image embeddings, and a diffusion decoder that generates images conditioned on the image embedding. Two cascaded super-resolution diffusion models upsample the output from 64x64 to 256x256 and then to 1024x1024. The paper, "Hierarchical Text-Conditional Image Generation with CLIP Latents," was authored by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
Imagen, introduced by Chitwan Saharia and colleagues at Google Brain in May 2022, demonstrated that scaling the text encoder (a frozen T5-XXL with 11 billion parameters) improved image quality and text alignment more effectively than scaling the diffusion model itself. Imagen uses a cascade of three diffusion models: a base model generating 64x64 images, and two super-resolution models upsampling to 256x256 and 1024x1024.
Stable Diffusion, first released in August 2022, is the most widely used open-source diffusion model. Built on the latent diffusion architecture, it operates in the latent space of a VAE using a U-Net (versions 1.x through XL) or a Diffusion Transformer (version 3 onward) as the denoising backbone.
Stable Diffusion's open-source release enabled a vast ecosystem of fine-tuned models, LoRA adapters, ControlNet extensions, and custom pipelines. Stable Diffusion 3, released in June 2024, replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) and adopted rectified flow training, as described by Esser et al. in "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis."[43] Stable Diffusion 3.5 followed in late 2024.
Sora, OpenAI's text-to-video model, was first previewed in February 2024 and released publicly in December 2024. It generates video by denoising spacetime patches in a latent space using a diffusion transformer architecture. Sora can produce up to one minute of 1080p video with coherent motion and scene consistency. The technical report, "Video Generation Models as World Simulators," describes its approach to jointly modeling spatial and temporal dimensions.
The most prominent application of diffusion models is generating images from text descriptions. Modern systems handle complex multi-object scenes, specific art styles, photorealistic rendering, and even legible text within images. All major commercial systems (DALL-E, Stable Diffusion, Midjourney, Imagen, FLUX) operate in this mode.
Diffusion models can transform existing images using text prompts. The SDEdit technique starts with a partially noised version of the input image (rather than pure noise) and denoises it according to a new prompt. The amount of initial noise controls the balance between preserving the original image and following the new instruction. This enables style transfer, content modification, and creative editing. InstructPix2Pix (Brooks et al., 2023) further extended this by training a diffusion model to follow explicit editing instructions.
Inpainting fills in masked regions of an image guided by text and surrounding context. Outpainting extends images beyond their original boundaries. Both tasks use the diffusion model's ability to generate content that is contextually consistent with existing pixels.
Diffusion models can upsample low-resolution images while adding realistic high-frequency detail. This is used both as a standalone application and within cascaded generation pipelines, where a base model generates a small image that is progressively upsampled by specialized super-resolution diffusion models (as in Imagen and DALL-E 2).
Diffusion models have been extended to video, where the denoising process operates on sequences of frames or spacetime latent patches. The main challenge is maintaining temporal coherence, with consistent objects and smooth motion across frames.
Stable Video Diffusion (Blattmann et al., November 2023) extended the latent diffusion architecture to video by adding temporal layers and identifying three stages of training: text-to-image pretraining, video pretraining on a curated dataset, and high-quality video finetuning.[30] It became the first widely used open-weights image-to-video model.
Sora, previewed by OpenAI in February 2024 and released in December 2024, generates up to one minute of 1080p video by denoising spacetime patches in a learned latent space using a diffusion transformer. Sora 2, released in late 2025, extended these capabilities with improved physical plausibility and synchronized audio.
Veo, Veo 2, and Veo 3 are Google DeepMind's text-to-video diffusion models. Veo 3, announced at Google I/O in May 2025, was notable for natively generating synchronized audio (dialogue, sound effects, and ambient sound) jointly with video frames in the same diffusion process.
Movie Gen (Polyak et al., Meta, October 2024) introduced a 30 billion parameter diffusion transformer trained with a maximum context length of 73K video tokens, corresponding to roughly 16 seconds at 16 fps, and a separate video-to-audio model.[31] The Movie Gen suite also includes models for video editing and personalization.
HunyuanVideo (Tencent, December 2024) released over 13 billion parameters of weights under an open license, combining a DiT backbone, an MLLM-based text encoder, and a 3D causal VAE.[32] At release it was the largest open-weights video generation model. CogVideoX (Zhipu AI), Wan 2.1 and Wan 2.5 (Alibaba), Kling (Kuaishou), and Runway Gen-3 and Runway Gen-4 are other notable systems, the latter two used in commercial film and advertising pipelines.
AudioLDM, introduced by Haohe Liu and colleagues in 2023 (ICML), applies the latent diffusion framework to audio. Using contrastive language-audio pretraining (CLAP) embeddings, it generates speech, sound effects, and music from text descriptions. Other notable audio diffusion systems include Riffusion (which generates music through spectrogram diffusion) and various diffusion-based text-to-speech systems.
OpenAI's Point-E (2022) generates 3D point clouds from text by first producing a synthetic 2D view using a text-to-image diffusion model, then converting it to a 3D point cloud using a second diffusion model. Shap-E (2023) improved on this by generating implicit 3D representations (NeRF weights and signed distance functions) conditioned on text or images. TripoSR, developed by Stability AI and Tripo AI in 2024, uses a feed-forward transformer to produce 3D meshes from single images in under a second on an NVIDIA A100 GPU.
A separate line of work uses 2D image-diffusion priors to optimize 3D representations without 3D training data:
Hunyuan 3D (Tencent, 2024 to 2025) is an example of a more recent open-weights image-to-3D diffusion system that operates directly on 3D shape latents rather than relying on 2D distillation.
Diffusion models have found significant applications in computational chemistry and drug design. DiffDock, introduced by Gabriele Corso, Hannes Stark, Bowen Jing, Regina Barzilay, and Tommi Jaakkola (2022), frames molecular docking as a generative modeling problem, using diffusion over translations, rotations, and torsion angles to predict how small molecules bind to protein targets. DiffDock achieved 38.2% of ligand poses with RMSD below 2 angstroms on the PDBBind benchmark, outperforming traditional docking methods. DiffDock-L, released in February 2024, further improved performance and generalization.
Other diffusion-based molecular generation systems include PMDM for structure-based drug design and various models for generating novel molecular geometries with specified physicochemical properties.
In structural biology, AlphaFold 3 (published in Nature, 2024) incorporates a diffusion-based module for predicting the structures of protein complexes, ligand-protein interactions, and nucleic acid structures. Diffusion models for protein design can generate novel protein sequences and structures with desired functional properties, with applications in drug development, vaccine research, and enzyme engineering.
Diffusion Policy, introduced by Cheng Chi, Zhenjia Xu, Siyuan Feng, and colleagues at Columbia University in 2023, applies diffusion models to visuomotor policy learning for robots. Instead of generating images, the diffusion process generates sequences of robot actions conditioned on visual observations. On benchmarks spanning 15 robot manipulation tasks, Diffusion Policy outperformed prior methods by an average of 46.9%.
Research in this area has expanded rapidly: as of 2025, diffusion-based policies have been applied to dexterous manipulation, long-horizon planning, and multi-modal input integration (combining point clouds with natural language instructions). Flow-matching-based variants have also emerged, incorporating second-order dynamics for smoother trajectories.
A line of work allows users to teach a pretrained diffusion model a new visual concept (a specific person, object, or style) from a small number of reference images:
Several methods add spatial or reference-image control beyond text:
| Method | Description | Control input |
|---|---|---|
| ControlNet | Adds conditional control to pretrained diffusion models via a zero-initialized trainable copy of the encoder; introduced by Zhang, Rao, and Agrawala at ICCV 2023[37] | Edge maps, depth maps, pose skeletons, segmentation maps |
| IP-Adapter | Decoupled cross-attention for image prompts; 22M parameters, plugs into existing checkpoints; Ye et al. (Tencent AI Lab), 2023[38] | Reference images for style or content |
| T2I-Adapter | Lightweight spatial conditioning alternative | Sketch, color, depth inputs |
| LoRA | Low-rank adaptation fine-tuning | Custom concepts, styles, or subjects with minimal data |
While diffusion models originated in continuous data domains like images, a parallel line of work extends the diffusion framework to discrete text generation, creating a new category sometimes called diffusion language models (d-LLMs). Because tokens are discrete symbols rather than continuous vectors, applying diffusion to text requires either embedding tokens into a continuous space and adding Gaussian noise (continuous diffusion) or defining a forward process directly on discrete tokens, typically through progressive masking or transitions between vocabulary items (discrete diffusion).
D3PM (Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg, NeurIPS 2021) generalized DDPM to discrete data by defining the forward process via transition matrices over a vocabulary.[39] Special choices of transition matrix (uniform, absorbing-state, nearest-neighbor in embedding space) recover or connect to existing approaches, including mask-based and autoregressive models.
SEDD (Aaron Lou, Chenlin Meng, Stefano Ermon, ICML 2024 Best Paper) introduced score entropy, a loss function that extends score matching to discrete spaces by modeling the ratios of the data distribution rather than its absolute density.[40] SEDD outperformed prior discrete diffusion approaches by 25 to 75 percent in perplexity and was competitive with similarly sized GPT-2 models while supporting controllable infilling without left-to-right constraints.
MDLM, introduced by Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov (NeurIPS 2024), showed that simple masked discrete diffusion is more effective than previously believed.[22] The model corrupts text by progressively masking tokens (a discrete analog of adding noise) and learns to predict the masked tokens conditioned on the remaining ones. MDLM demonstrated that with an effective training recipe and a simplified Rao-Blackwellized objective, masked diffusion models can approach autoregressive model quality on language benchmarks.
LLaDA (Large Language Diffusion Models, Nie et al., Renmin University of China, February 2025) is an 8 billion parameter masked-diffusion language model trained from scratch on 2.3 trillion tokens with 0.13 million H800 GPU hours, followed by supervised fine-tuning on 4.5 million pairs.[41] Reported to be competitive with LLaMA 3 8B on standard benchmarks, LLaDA was notable for addressing the so-called "reversal curse" (the asymmetry of autoregressive models when prompted with reversed information), in part because masked diffusion conditions on bidirectional context. The follow-up LLaDA-V (2025) extends the framework with visual instruction tuning for multimodal use.
Mercury, developed by Inception Labs, is described in their technical report as the first commercial-scale diffusion LLM family.[23] Mercury Coder Mini and Mercury Coder Small achieve throughputs of 1,109 and 737 tokens per second respectively on NVIDIA H100 GPUs, outperforming speed-optimized autoregressive models by up to 10 times while maintaining comparable quality. On the Copilot Arena coding benchmark, Mercury Coder ranked second in quality and was the fastest model overall.
The speed advantage of d-LLMs comes from their ability to generate or refine multiple tokens in parallel, rather than sequentially as in autoregressive models. Mercury 2, announced in February 2026, achieves approximately 1,000 tokens per second output throughput with reasoning capabilities.
Gemini Diffusion, announced by Google DeepMind at Google I/O on May 20, 2025, is an experimental text-diffusion language model that generates content by iteratively refining noise into coherent text or code rather than predicting one token at a time. Google reported throughputs of roughly 1,000 to 2,000 tokens per second, several times faster than the company's then-fastest production Gemini model, with comparable coding and reasoning performance. The model was initially released as a wait-listed demo.
Block Diffusion (ICLR 2025, oral), from Cornell University researchers, introduces a semi-autoregressive approach that generates blocks of tokens from left to right while allowing diffusion-based unmasking within each block. This combines the sequential coherence of autoregressive generation with the parallelism of diffusion.
Progressive distillation trains a student model to match the output of two teacher steps in a single step, repeatedly halving the number of required steps. After several rounds, the student can generate high-quality images in 4 to 8 steps.
Adversarial distillation uses a GAN-like discriminator to train a few-step generator from a pretrained diffusion teacher. Notable examples include SDXL Turbo (single-step generation at 512x512) and SDXL Lightning (high quality in 2 to 4 steps), both from Stability AI.
Distribution matching distillation minimizes the distributional distance between the teacher's multi-step output and the student's single-step output. This approach has been used to create fast variants of several production models.
As diffusion models have become production tools, methods originally developed for aligning language models have been adapted to fine-tune them on human preference data, aesthetic reward models, and prompt-following signals:
These methods have been used by commercial systems such as DALL-E 3 and Midjourney to improve prompt-following and aesthetic quality beyond what raw pretraining can achieve, although the exact recipes used are typically proprietary.
| Feature | Diffusion models | GANs | VAEs | Flow-based models |
|---|---|---|---|---|
| Training stability | Stable; single network trained with MSE loss | Unstable; requires balancing generator and discriminator | Stable; trained with ELBO | Stable; trained with exact log-likelihood |
| Sample quality | State of the art for images and video | High quality but prone to artifacts | Often blurry due to pixel-level reconstruction loss | Good but generally below diffusion and GANs |
| Sample diversity | High; good mode coverage | Susceptible to mode collapse | High diversity by design | High diversity |
| Generation speed | Slow (many iterative steps); accelerable with distillation | Fast (single forward pass) | Fast (single decoder pass) | Fast (single pass through invertible layers) |
| Likelihood estimation | Approximate (via variational bound) | Not available | Approximate (ELBO) | Exact (change of variables) |
| Conditioning | Flexible via CFG and cross-attention | Requires conditional architectures | Conditional VAE variants | Conditional flow variants |
Diffusion models have largely replaced GANs as the preferred approach for high-quality image generation. GANs remain useful for real-time applications and are sometimes used as discriminators or for distilling diffusion models into faster single-step generators. VAEs continue to play a supporting role as the encoder-decoder framework in latent diffusion architectures.
Despite their strong performance, diffusion models have several known limitations: