A diffusion model is a type of generative model that produces data by learning to reverse a gradual noising process. The model is trained on the principle that if you systematically add noise to a data sample until it becomes pure random noise, a neural network can learn to undo each step of that corruption. At generation time, the model starts from random noise and iteratively removes noise, step by step, until it produces a clean sample that resembles the training data.
Since 2020, diffusion models have become the dominant paradigm for image generation, overtaking generative adversarial networks (GANs) in both sample quality and diversity. They power the most widely used AI image generators, including Stable Diffusion, DALL-E 2 and 3, Imagen, and Midjourney. Beyond images, diffusion models have been extended to video, audio, 3D object generation, molecular design, protein structure prediction, robotic control, and even text generation.
The mathematical ideas underlying diffusion models predate their application to deep learning by several years. In 2005, Aapo Hyvarinen introduced score matching, a method for estimating the gradient of the log probability density of a distribution (called the "score function") without needing to compute an intractable normalizing constant. This technique allowed models to learn the shape of a probability distribution indirectly, by learning how the density changes at each point rather than computing the density itself.
In 2011, Pascal Vincent established a connection between score matching and denoising autoencoders. Vincent showed that training a denoising autoencoder is mathematically equivalent to performing score matching on a noise-perturbed version of the data distribution. This result, known as denoising score matching, later became one of the theoretical pillars of diffusion-based generative modeling.
The formal introduction of diffusion probabilistic models to machine learning came in 2015 with the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli, published at ICML. Drawing directly from non-equilibrium statistical physics, the authors proposed a framework in which structure in a data distribution is systematically destroyed through an iterative forward diffusion process modeled as a Markov chain that gradually adds Gaussian noise. A reverse diffusion process is then learned to restore structure, yielding a tractable generative model.
While the theoretical framework was sound, the generated image quality did not match GANs at the time, and the paper received relatively limited attention for several years. The key contribution was conceptual: demonstrating that the thermodynamic principle of reversible processes could be applied to generative modeling.
In 2019, Yang Song and Stefano Ermon proposed Noise Conditional Score Networks (NCSN), which estimated the score function at multiple noise levels and used Langevin dynamics to generate samples. Their paper, "Generative Modeling by Estimating Gradients of the Data Distribution," presented at NeurIPS 2019, showed that score-based generative modeling could produce competitive image samples. The approach worked by training a single network to predict the score function conditioned on different noise levels, then using annealed Langevin dynamics at generation time to progressively move from noisy to clean samples.
The modern era of diffusion models began with "Denoising Diffusion Probabilistic Models" (DDPM) by Jonathan Ho, Ajay Jain, and Pieter Abbeel, published at NeurIPS 2020. DDPM showed that diffusion models could generate images competitive with GANs while avoiding the training instability and mode collapse problems that plagued adversarial approaches.
The key insight in DDPM was a simplified training objective: instead of predicting the clean data directly, the network predicts the noise that was added at each step. This noise prediction objective proved both stable to train and effective at producing high-quality samples. DDPM achieved an FID score of 3.17 on CIFAR-10, which was state of the art for likelihood-based models at the time.
Progress accelerated rapidly after DDPM. In late 2020, Jiaming Song, Chenlin Meng, and Stefano Ermon introduced Denoising Diffusion Implicit Models (DDIM), which generalized the DDPM sampling process to non-Markovian forward processes, enabling deterministic sampling and producing high-quality images 10 to 50 times faster than DDPM.
In early 2021, Alex Nichol and Prafulla Dhariwal published "Improved Denoising Diffusion Probabilistic Models," introducing the cosine noise schedule and learned variance parameters that yielded better log-likelihood scores and sample quality.
A major theoretical advance came in 2021 when Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole published "Score-Based Generative Modeling through Stochastic Differential Equations." This paper unified DDPM and score-based models into a single framework based on continuous-time stochastic differential equations (SDEs). The forward noising process is described by an SDE, and generation follows the corresponding reverse-time SDE. This unification proved that DDPM and score-based approaches are two perspectives on the same underlying mathematical structure.
Also in 2021, Prafulla Dhariwal and Alex Nichol at OpenAI published "Diffusion Models Beat GANs on Image Synthesis," introducing classifier guidance and architectural improvements that let diffusion models surpass GANs on ImageNet generation for the first time. This result marked a turning point for the field.
In 2022, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer introduced Latent Diffusion Models (LDM) at CVPR, which run the diffusion process in the compressed latent space of a pretrained variational autoencoder (VAE) rather than on raw pixels. This approach reduced computational costs by roughly 48 times while maintaining high image quality, and it became the foundation for Stable Diffusion.
Beginning in 2022, diffusion models powered a wave of text-to-image systems that brought generative AI to mainstream attention. DALL-E 2 (OpenAI, April 2022), Imagen (Google Brain, May 2022), and Stable Diffusion (Stability AI / CompVis / Runway, August 2022) all demonstrated the ability to generate photorealistic images from text descriptions. Stable Diffusion's open-source release had a particularly large impact, enabling a broad ecosystem of fine-tuned models, LoRA adapters, ControlNet extensions, and custom pipelines.
In 2023, William Peebles and Saining Xie introduced the Diffusion Transformer (DiT), which replaced the U-Net backbone with a transformer-based architecture operating on image patches. DiT demonstrated clear scaling laws: larger models with more compute consistently achieved lower FID scores. This architecture has since been adopted by Stable Diffusion 3, Sora, and FLUX.
Diffusion models rest on two complementary processes: a forward process that gradually adds noise to data, and a reverse process that learns to remove the noise.
Given a data point x_0 sampled from the real data distribution q(x_0), the forward process defines a Markov chain that adds Gaussian noise over T steps:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
Here, beta_1, beta_2, ..., beta_T is a noise schedule that controls how much noise is added at each step. As t increases, the sample becomes progressively noisier. After sufficiently many steps, x_T is approximately standard Gaussian noise, and all information about the original data point has been destroyed.
A useful property of this formulation is that x_t can be sampled directly at any timestep without iterating through all previous steps. Defining alpha_t = 1 - beta_t and alpha_bar_t as the cumulative product of alpha_1 through alpha_t:
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
where epsilon is drawn from N(0, I). This closed-form expression is essential for efficient training, since the model can be trained on randomly sampled timesteps rather than requiring sequential computation through all steps.
The reverse process starts from Gaussian noise x_T and iteratively denoises to recover a data sample. The true reverse conditional q(x_{t-1} | x_t) is intractable in general, but when conditioned on the original data point x_0, the posterior q(x_{t-1} | x_t, x_0) is Gaussian and can be computed in closed form.
A neural network with parameters theta is trained to approximate the reverse transitions:
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)
In the DDPM formulation, the mean mu_theta is parameterized in terms of a noise prediction network epsilon_theta(x_t, t), which estimates the noise that was added to produce x_t from x_0. Given the predicted noise, the model can compute an estimate of x_0 and then derive the reverse step mean.
The standard DDPM training objective is a simplified form of the variational lower bound (VLB) on the data log-likelihood:
L_simple = E[|| epsilon - epsilon_theta(x_t, t) ||^2]
The expectation is over t sampled uniformly from {1, ..., T}, x_0 sampled from the training data, and epsilon sampled from N(0, I). In practice, each training step involves: (1) selecting a random training sample, (2) selecting a random timestep, (3) adding the corresponding amount of noise using the closed-form expression, and (4) training the network to predict the noise that was added.
Ho et al. found that this simplified mean squared error loss on noise prediction produced better sample quality than the full variational bound, likely because it places more weight on the perceptually important lower noise levels.
From the score matching viewpoint, the noise prediction network is closely related to the score function, defined as the gradient of the log probability density with respect to the data:
score(x) = gradient of log p(x) with respect to x
The noise prediction at timestep t is proportional to the score of the noisy data distribution at that noise level. The score function tells the model which direction to "push" a noisy sample to move it toward higher-probability (cleaner) regions of the data distribution. Yang Song and Stefano Ermon's score-based framework directly estimates this score function and generates samples using Langevin dynamics.
The SDE framework by Song et al. (2021) describes the forward process as a continuous-time stochastic differential equation:
dx = f(x, t) dt + g(t) dw
where f is the drift coefficient, g is the diffusion coefficient, and w is a standard Wiener process. The reverse-time SDE takes the form:
dx = [f(x, t) - g(t)^2 * score(x, t)] dt + g(t) dw_bar
where w_bar is a reverse-time Wiener process. This formulation allows the use of numerical SDE and ODE solvers for sampling, and it unifies DDPM and score-based models as different discretizations of the same continuous process.
The noise schedule determines how quickly the forward process destroys the data signal. Common choices include:
| Schedule | Description | Notes |
|---|---|---|
| Linear | beta_t increases linearly from beta_1 to beta_T | Used in the original DDPM (beta_1 = 0.0001, beta_T = 0.02, T = 1000) |
| Cosine | alpha_bar_t follows a cosine curve | Proposed by Nichol and Dhariwal (2021); adds noise more gradually at early steps |
| Scaled linear | Linear schedule adapted for latent space | Common in latent diffusion models |
| Sigmoid | beta_t follows a sigmoid curve | Used in some continuous-time formulations |
The cosine schedule generally produces better results than the linear schedule because it preserves more signal in intermediate steps, where much of the perceptually meaningful structure is learned.
The denoising network can parameterize its output in several equivalent ways:
| Parameterization | Description | Typical use |
|---|---|---|
| Epsilon prediction | Network predicts the noise epsilon added during forward process | DDPM, Stable Diffusion 1.x/2.x |
| x_0 prediction | Network directly predicts the clean data x_0 | Some early models; useful for certain loss formulations |
| v prediction | Network predicts velocity v = sqrt(alpha_bar_t) * epsilon - sqrt(1 - alpha_bar_t) * x_0 | Progressive distillation, Stable Diffusion 2.x |
All three are mathematically interconvertible, but they have different numerical properties affecting training stability and sample quality at different noise levels.
Most diffusion models through 2023 used a U-Net architecture as the denoising network. Originally designed for biomedical image segmentation, the U-Net features an encoder-decoder structure with skip connections between corresponding encoder and decoder layers. In the diffusion context, the U-Net takes a noisy input x_t and a timestep t, then predicts the noise epsilon.
The architecture typically includes:
The encoder progressively downsamples spatial resolution while increasing channels, and the decoder upsamples back. This multi-scale structure lets the network capture both fine textures and global composition.
The Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in 2023 (ICCV), replaces the U-Net with a Vision Transformer (ViT)-style architecture operating on sequences of image patches. Timestep and class conditioning are incorporated through adaptive layer normalization (adaLN).
The key finding was that DiT models follow clear scaling laws: more compute (measured in GFLOPs) consistently yields lower FID scores. The largest model, DiT-XL/2, achieved an FID of 2.27 on class-conditional ImageNet 256x256 generation. The DiT architecture has since been adopted by Stable Diffusion 3, Sora, and FLUX, reflecting a broader shift toward transformer-based architectures across modalities.
Latent Diffusion Models (LDM), introduced by Rombach et al. (2022), run the diffusion process in the compressed latent space of a pretrained VAE rather than directly on pixels. A VAE encoder compresses an image (for example, 512x512x3 pixels) into a smaller latent representation (for example, 64x64x4), and the diffusion model operates on this compact representation. After generation, the VAE decoder converts the latent code back to pixel space.
This approach offers several benefits:
Latent diffusion became the foundation for Stable Diffusion and influenced the design of many subsequent systems.
The original DDPM sampling requires T = 1000 sequential denoising steps, making generation slow. Each step applies the learned reverse transition to produce a slightly cleaner sample.
Denoising Diffusion Implicit Models (DDIM), proposed by Jiaming Song, Chenlin Meng, and Stefano Ermon (2020), generalize DDPM by constructing non-Markovian forward processes that share the same training objective. DDIM sampling is deterministic given a fixed initial noise vector, enabling consistent image generation from the same latent code and meaningful interpolation in latent space. DDIM can use as few as 10 to 50 steps with relatively minor quality loss compared to 1000-step DDPM.
DPM-Solver, introduced by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu in 2022, applies high-order ODE solvers to the diffusion sampling process. By analytically computing parts of the solution and using higher-order numerical methods for the remainder, DPM-Solver can generate high-quality samples in 10 to 25 steps. DPM-Solver++ further improved results for guided sampling. These solvers are now among the most commonly used in practice.
Consistency models, introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever in 2023 (ICML), learn to map any point along the diffusion trajectory directly to the clean data point in a single step. This allows high-quality generation in one or very few steps. Consistency models can be trained either by distilling a pretrained diffusion model (consistency distillation) or from scratch (consistency training).
Improved Consistency Training (iCT), published in 2024, achieved FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64x64 in a single sampling step. Easy Consistency Tuning (ECT), published at ICLR 2025, achieved a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU, matching performance that previously required hundreds of GPU hours.
Rectified flow learns a transport map between noise and data along straight paths rather than the curved trajectories of standard diffusion. Straighter paths require fewer discretization steps, making generation more efficient. Stable Diffusion 3 and FLUX both adopt rectified flow as their formulation. Recent research has shown that flow matching and Gaussian diffusion are mathematically equivalent, though flow matching offers practical advantages in trajectory straightness and training simplicity.
Dhariwal and Nichol (2021) introduced classifier guidance to improve conditional generation quality. A separate classifier is trained on noisy images, and its gradients steer the diffusion sampling process toward a desired class:
epsilon_guided = epsilon_theta(x_t, t) - s * gradient of log p(y | x_t)
Higher values of the guidance scale s produce images more strongly associated with the target class but with reduced diversity. This approach requires training a separate classifier on noisy data, adding complexity.
Jonathan Ho and Tim Salimans proposed classifier-free guidance (CFG) in 2022, eliminating the need for a separate classifier. During training, the conditioning signal (for example, a text prompt) is randomly dropped for a fraction of examples, so the model learns both conditional and unconditional generation. At inference:
epsilon_guided = epsilon_unconditional + w * (epsilon_conditional - epsilon_unconditional)
When w = 1, this is standard conditional generation. Values of w > 1 amplify the influence of the conditioning signal, producing outputs more closely aligned with the input at the cost of reduced diversity. CFG has become the standard conditioning approach in virtually all modern text-to-image diffusion systems. Typical guidance scale values range from 7 to 15.
In text-conditioned diffusion models, the text prompt is first encoded into a sequence of embedding vectors using a text encoder such as CLIP or T5. These embeddings are injected into the denoising network through cross-attention layers, where image features act as queries (Q) and text embeddings provide keys (K) and values (V). This allows every spatial position in the image to attend to relevant parts of the text, enabling fine-grained alignment between the generated image and the prompt.
Different systems use different text encoders:
| System | Text encoder(s) |
|---|---|
| Stable Diffusion 1.x | CLIP ViT-L/14 |
| Stable Diffusion 2.x | OpenCLIP ViT-H/14 |
| Stable Diffusion XL | CLIP ViT-L + OpenCLIP ViT-bigG |
| Stable Diffusion 3 | Two CLIP models + T5-XXL |
| DALL-E 2 | CLIP |
| Imagen | T5-XXL (11B parameters) |
| FLUX.1 | T5-XXL |
In classifier-free guidance, the unconditional prediction can be replaced with a prediction conditioned on a negative prompt describing attributes the user wants to avoid:
epsilon_guided = epsilon_negative + w * (epsilon_positive - epsilon_negative)
This allows users to steer generation away from undesired features (for example, "blurry, low quality, distorted hands") while amplifying desired attributes.
| System | Organization | Year | Architecture | Key features |
|---|---|---|---|---|
| DALL-E 2 | OpenAI | 2022 | CLIP prior + cascaded diffusion (unCLIP) | Text-to-image, inpainting, image variations |
| Imagen | Google Brain | 2022 | T5-XXL + cascaded U-Net diffusion | Text-to-image at 1024x1024; showed scaling text encoder matters most |
| Stable Diffusion 1.5 | Stability AI / CompVis / Runway | 2022 | Latent diffusion, U-Net, CLIP, VAE | Open source; text-to-image, inpainting, img2img |
| Midjourney v4 | Midjourney, Inc. | 2022 | Proprietary diffusion model | Text-to-image via Discord |
| Stable Diffusion XL | Stability AI | 2023 | Larger U-Net, dual CLIP encoders | 1024x1024 native resolution |
| DALL-E 3 | OpenAI | 2023 | Improved diffusion + recaptioning pipeline | Strong text rendering and prompt following |
| Midjourney v6 | Midjourney, Inc. | 2023 | Third-generation model | Improved photorealism, text rendering |
| Stable Diffusion 3 | Stability AI | 2024 | Multimodal DiT (MMDiT) + rectified flow | Three text encoders, improved text rendering |
| FLUX.1 | Black Forest Labs | 2024 | 12B-parameter rectified flow transformer | Pro, Dev, and Schnell variants |
| Sora | OpenAI | 2024 | Diffusion transformer on spacetime patches | Text-to-video up to 1 minute at 1080p |
| Midjourney v7 | Midjourney, Inc. | 2025 | New architecture (proprietary) | Draft mode, improved coherence |
| FLUX.2 | Black Forest Labs | 2025 | Rectified flow transformer + Mistral-3 24B VLM | Unified generation and editing |
DALL-E 2, released by OpenAI in April 2022, uses an approach called unCLIP. It consists of a CLIP text encoder, a prior model that maps CLIP text embeddings to CLIP image embeddings, and a diffusion decoder that generates images conditioned on the image embedding. Two cascaded super-resolution diffusion models upsample the output from 64x64 to 256x256 and then to 1024x1024. The paper, "Hierarchical Text-Conditional Image Generation with CLIP Latents," was authored by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
Imagen, introduced by Chitwan Saharia and colleagues at Google Brain in May 2022, demonstrated that scaling the text encoder (a frozen T5-XXL with 11 billion parameters) improved image quality and text alignment more effectively than scaling the diffusion model itself. Imagen uses a cascade of three diffusion models: a base model generating 64x64 images, and two super-resolution models upsampling to 256x256 and 1024x1024.
Stable Diffusion, first released in August 2022, is the most widely used open-source diffusion model. Built on the latent diffusion architecture, it operates in the latent space of a VAE using a U-Net (versions 1.x through XL) or a Diffusion Transformer (version 3 onward) as the denoising backbone.
Stable Diffusion's open-source release enabled a vast ecosystem of fine-tuned models, LoRA adapters, ControlNet extensions, and custom pipelines. Stable Diffusion 3, released in June 2024, replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) and adopted rectified flow.
Sora, OpenAI's text-to-video model, was first previewed in February 2024 and released publicly in December 2024. It generates video by denoising spacetime patches in a latent space using a diffusion transformer architecture. Sora can produce up to one minute of 1080p video with coherent motion and scene consistency. The technical report, "Video Generation Models as World Simulators," describes its approach to jointly modeling spatial and temporal dimensions.
The most prominent application of diffusion models is generating images from text descriptions. Modern systems handle complex multi-object scenes, specific art styles, photorealistic rendering, and even legible text within images. All major commercial systems (DALL-E, Stable Diffusion, Midjourney, Imagen, FLUX) operate in this mode.
Diffusion models can transform existing images using text prompts. The SDEdit technique starts with a partially noised version of the input image (rather than pure noise) and denoises it according to a new prompt. The amount of initial noise controls the balance between preserving the original image and following the new instruction. This enables style transfer, content modification, and creative editing. InstructPix2Pix (Brooks et al., 2023) further extended this by training a diffusion model to follow explicit editing instructions.
Inpainting fills in masked regions of an image guided by text and surrounding context. Outpainting extends images beyond their original boundaries. Both tasks use the diffusion model's ability to generate content that is contextually consistent with existing pixels.
Diffusion models can upsample low-resolution images while adding realistic high-frequency detail. This is used both as a standalone application and within cascaded generation pipelines, where a base model generates a small image that is progressively upsampled by specialized super-resolution diffusion models (as in Imagen and DALL-E 2).
Diffusion models have been extended to video, where the denoising process operates on sequences of frames or spacetime latent patches. Sora (OpenAI), Runway Gen-2 and Gen-3, Stable Video Diffusion (Stability AI), and Kling (Kuaishou) are prominent examples. The main challenge is maintaining temporal coherence, with consistent objects and smooth motion across frames.
AudioLDM, introduced by Haohe Liu and colleagues in 2023 (ICML), applies the latent diffusion framework to audio. Using contrastive language-audio pretraining (CLAP) embeddings, it generates speech, sound effects, and music from text descriptions. Other notable audio diffusion systems include Riffusion (which generates music through spectrogram diffusion) and various diffusion-based text-to-speech systems.
OpenAI's Point-E (2022) generates 3D point clouds from text by first producing a synthetic 2D view using a text-to-image diffusion model, then converting it to a 3D point cloud using a second diffusion model. Shap-E (2023) improved on this by generating implicit 3D representations (NeRF weights and signed distance functions) conditioned on text or images. TripoSR, developed by Stability AI and Tripo AI in 2024, uses a feed-forward transformer to produce 3D meshes from single images in under a second on an NVIDIA A100 GPU.
Diffusion models have found significant applications in computational chemistry and drug design. DiffDock, introduced by Gabriele Corso, Hannes Stark, Bowen Jing, Regina Barzilay, and Tommi Jaakkola (2022), frames molecular docking as a generative modeling problem, using diffusion over translations, rotations, and torsion angles to predict how small molecules bind to protein targets. DiffDock achieved 38.2% of ligand poses with RMSD below 2 angstroms on the PDBBind benchmark, outperforming traditional docking methods. DiffDock-L, released in February 2024, further improved performance and generalization.
Other diffusion-based molecular generation systems include PMDM for structure-based drug design and various models for generating novel molecular geometries with specified physicochemical properties.
In structural biology, AlphaFold 3 (published in Nature, 2024) incorporates a diffusion-based module for predicting the structures of protein complexes, ligand-protein interactions, and nucleic acid structures. Diffusion models for protein design can generate novel protein sequences and structures with desired functional properties, with applications in drug development, vaccine research, and enzyme engineering.
Diffusion Policy, introduced by Cheng Chi, Zhenjia Xu, Siyuan Feng, and colleagues at Columbia University in 2023, applies diffusion models to visuomotor policy learning for robots. Instead of generating images, the diffusion process generates sequences of robot actions conditioned on visual observations. On benchmarks spanning 15 robot manipulation tasks, Diffusion Policy outperformed prior methods by an average of 46.9%.
Research in this area has expanded rapidly: as of 2025, diffusion-based policies have been applied to dexterous manipulation, long-horizon planning, and multi-modal input integration (combining point clouds with natural language instructions). Flow-matching-based variants have also emerged, incorporating second-order dynamics for smoother trajectories.
Several methods add spatial control beyond text:
| Method | Description | Control input |
|---|---|---|
| ControlNet | Adds conditional control to pretrained diffusion models | Edge maps, depth maps, pose skeletons, segmentation maps |
| IP-Adapter | Conditions generation on reference images | Reference images for style or content |
| T2I-Adapter | Lightweight spatial conditioning alternative | Sketch, color, depth inputs |
| LoRA | Low-rank adaptation fine-tuning | Custom concepts, styles, or subjects with minimal data |
While diffusion models originated in continuous data domains like images, recent work has extended the diffusion framework to discrete text generation, creating a new category called diffusion language models (d-LLMs).
MDLM, introduced by Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov in 2024, showed that simple masked discrete diffusion is more effective than previously believed. The model corrupts text by progressively masking tokens (a discrete analog of adding noise) and learns to predict the masked tokens conditioned on the remaining ones. MDLM demonstrated that with an effective training recipe and a simplified Rao-Blackwellized objective, masked diffusion models can approach autoregressive model quality on language benchmarks.
Mercury, developed by Inception Labs, is the first commercial-scale diffusion language model. Mercury Coder Mini and Mercury Coder Small achieve throughputs of 1,109 and 737 tokens per second respectively on NVIDIA H100 GPUs, outperforming speed-optimized autoregressive models by up to 10 times while maintaining comparable quality. On the Copilot Arena coding benchmark, Mercury Coder Mini tied for second place, surpassing GPT-4o Mini and Gemini-1.5-Flash.
The speed advantage of d-LLMs comes from their ability to generate multiple tokens in parallel, rather than sequentially as in autoregressive models. Mercury 2, announced in February 2026, achieves approximately 1,000 tokens per second output throughput with reasoning capabilities.
Block Diffusion (ICLR 2025, oral), proposed by researchers at Cornell University, introduces a semi-autoregressive approach that generates blocks of tokens from left to right while allowing diffusion-based unmasking within each block. This combines the sequential coherence of autoregressive generation with the parallelism of diffusion.
Progressive distillation trains a student model to match the output of two teacher steps in a single step, repeatedly halving the number of required steps. After several rounds, the student can generate high-quality images in 4 to 8 steps.
Adversarial distillation uses a GAN-like discriminator to train a few-step generator from a pretrained diffusion teacher. Notable examples include SDXL Turbo (single-step generation at 512x512) and SDXL Lightning (high quality in 2 to 4 steps), both from Stability AI.
Distribution matching distillation minimizes the distributional distance between the teacher's multi-step output and the student's single-step output. This approach has been used to create fast variants of several production models.
| Feature | Diffusion models | GANs | VAEs | Flow-based models |
|---|---|---|---|---|
| Training stability | Stable; single network trained with MSE loss | Unstable; requires balancing generator and discriminator | Stable; trained with ELBO | Stable; trained with exact log-likelihood |
| Sample quality | State of the art for images and video | High quality but prone to artifacts | Often blurry due to pixel-level reconstruction loss | Good but generally below diffusion and GANs |
| Sample diversity | High; good mode coverage | Susceptible to mode collapse | High diversity by design | High diversity |
| Generation speed | Slow (many iterative steps); accelerable with distillation | Fast (single forward pass) | Fast (single decoder pass) | Fast (single pass through invertible layers) |
| Likelihood estimation | Approximate (via variational bound) | Not available | Approximate (ELBO) | Exact (change of variables) |
| Conditioning | Flexible via CFG and cross-attention | Requires conditional architectures | Conditional VAE variants | Conditional flow variants |
Diffusion models have largely replaced GANs as the preferred approach for high-quality image generation. GANs remain useful for real-time applications and are sometimes used as discriminators or for distilling diffusion models into faster single-step generators. VAEs continue to play a supporting role as the encoder-decoder framework in latent diffusion architectures.
Despite their strong performance, diffusion models have several known limitations: