# Diffusion model

> Source: https://aiwiki.ai/wiki/diffusion_model
> Updated: 2026-07-11
> Categories: Computer Vision, Deep Learning, Generative AI, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **diffusion model** is a type of [generative model](/wiki/generative_model) that produces data by learning to reverse a gradual noising process: it is trained so that if Gaussian noise is added to a data sample step by step until the sample becomes pure random noise, a [neural network](/wiki/neural_network) can learn to undo each step, and at generation time it starts from random noise and iteratively denoises until it produces a clean sample resembling the training data. Diffusion models were introduced to [machine learning](/wiki/machine_learning) in 2015 and became state of the art for image synthesis with the 2020 "Denoising Diffusion Probabilistic Models" (DDPM) paper by Jonathan Ho, Ajay Jain, and Pieter Abbeel, which reported an Inception score of 9.46 and an FID of 3.17 on unconditional CIFAR-10. [44]

The DDPM authors described the method as "a class of latent variable models inspired by considerations from nonequilibrium thermodynamics." [44] Since 2020, diffusion models have become the dominant paradigm for image generation, overtaking [generative adversarial networks](/wiki/gan) (GANs) in both sample quality and diversity. They power the most widely used AI image generators, including [Stable Diffusion](/wiki/stable_diffusion), [DALL-E](/wiki/dall-e) 2 and 3, [Imagen](/wiki/imagen), and [Midjourney](/wiki/midjourney). Stable Diffusion alone reached more than 10 million users within two months of its August 2022 open release. [48] Beyond images, diffusion models have been extended to video, audio, 3D object generation, molecular design, protein structure prediction, robotic control, and even text generation.

## What is a diffusion model in simple terms?

A diffusion model can be understood as a denoiser that has been trained at every level of corruption. During training, a clean image (or other data sample) is partially destroyed by adding a known amount of random noise, and the network learns to predict that noise. To generate a new sample, the trained network is handed pure noise and asked, repeatedly, "what noise would I remove here?", subtracting a little each step until a coherent image emerges. The two halves of the process are a fixed **forward process** that adds noise and a learned **reverse process** that removes it.

## When were diffusion models invented?

### Origins in statistical physics and score matching (2005 to 2014)

The mathematical ideas underlying diffusion models predate their application to [deep learning](/wiki/deep_learning) by several years. In 2005, Aapo Hyvarinen introduced **score matching**, a method for estimating the gradient of the log probability density of a distribution (called the "score function") without needing to compute an intractable normalizing constant. This technique allowed models to learn the shape of a probability distribution indirectly, by learning how the density changes at each point rather than computing the density itself.

In 2011, Pascal Vincent established a connection between score matching and [denoising](/wiki/denoising) autoencoders. Vincent showed that training a denoising [autoencoder](/wiki/autoencoder) is mathematically equivalent to performing score matching on a noise-perturbed version of the data distribution. This result, known as **denoising score matching**, later became one of the theoretical pillars of diffusion-based generative modeling.

### The first diffusion probabilistic model (2015)

The formal introduction of diffusion probabilistic models to [machine learning](/wiki/machine_learning) came in 2015 with the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli, published at ICML. Drawing directly from non-equilibrium statistical physics, the authors proposed a framework in which structure in a data distribution is systematically destroyed through an iterative forward diffusion process modeled as a Markov chain that gradually adds Gaussian noise. A reverse diffusion process is then learned to restore structure, yielding a tractable generative model. As the paper put it, "The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process." [45]

While the theoretical framework was sound, the generated image quality did not match GANs at the time, and the paper received relatively limited attention for several years. The key contribution was conceptual: demonstrating that the thermodynamic principle of reversible processes could be applied to generative modeling.

### Score-based generative models (2019)

In 2019, Yang Song and Stefano Ermon proposed **Noise Conditional Score Networks (NCSN)**, which estimated the score function at multiple noise levels and used Langevin dynamics to generate samples. Their paper, "Generative Modeling by Estimating Gradients of the Data Distribution," presented at [NeurIPS](/wiki/neurips) 2019, showed that score-based generative modeling could produce competitive image samples. The approach worked by training a single network to predict the score function conditioned on different noise levels, then using annealed Langevin dynamics at generation time to progressively move from noisy to clean samples.

### The DDPM breakthrough (2020)

The modern era of diffusion models began with "Denoising Diffusion Probabilistic Models" (DDPM) by Jonathan Ho, Ajay Jain, and Pieter Abbeel, published at NeurIPS 2020. DDPM showed that diffusion models could generate images competitive with GANs while avoiding the training instability and mode collapse problems that plagued adversarial approaches.

The key insight in DDPM was a simplified training objective: instead of predicting the clean data directly, the network predicts the noise that was added at each step. This noise prediction objective proved both stable to train and effective at producing high-quality samples. DDPM achieved an Inception score of 9.46 and an FID score of 3.17 on unconditional CIFAR-10, which was state of the art for likelihood-based models at the time. [44] On 256x256 LSUN, the model obtained sample quality similar to ProgressiveGAN. [44]

### Rapid progress and unification (2020 to 2022)

Progress accelerated rapidly after DDPM. In late 2020, Jiaming Song, Chenlin Meng, and Stefano Ermon introduced **Denoising Diffusion Implicit Models (DDIM)**, which generalized the DDPM sampling process to non-Markovian forward processes, enabling deterministic sampling and producing high-quality images 10 to 50 times faster than DDPM.

In early 2021, Alex Nichol and Prafulla Dhariwal published "Improved Denoising Diffusion Probabilistic Models," introducing the cosine noise schedule and learned variance parameters that yielded better log-likelihood scores and sample quality.

A major theoretical advance came in 2021 when Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole published "Score-Based Generative Modeling through Stochastic Differential Equations." This paper unified DDPM and score-based models into a single framework based on continuous-time stochastic differential equations (SDEs). The forward noising process is described by an SDE, and generation follows the corresponding reverse-time SDE. This unification proved that DDPM and score-based approaches are two perspectives on the same underlying mathematical structure.

Also in 2021, Prafulla Dhariwal and Alex Nichol at [OpenAI](/wiki/openai) published "Diffusion Models Beat GANs on Image Synthesis," introducing **classifier guidance** and architectural improvements that let diffusion models surpass GANs on [ImageNet](/wiki/imagenet) generation for the first time. The paper reported state-of-the-art FID scores of 2.97 on ImageNet 128x128, 4.59 on ImageNet 256x256, and 7.72 on ImageNet 512x512, stating: "We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models." [46] This result marked a turning point for the field.

In 2022, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer introduced **Latent Diffusion Models (LDM)** at CVPR, which run the diffusion process in the compressed latent space of a pretrained [variational autoencoder](/wiki/variational_autoencoder) (VAE) rather than on raw pixels. The authors framed the motivation directly: "To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders." [47] This approach reduced computational costs by roughly 48 times while maintaining high image quality, and it became the foundation for [Stable Diffusion](/wiki/stable_diffusion).

### The text-to-image era (2022 to present)

Beginning in 2022, diffusion models powered a wave of text-to-image systems that brought generative AI to mainstream attention. DALL-E 2 (OpenAI, April 2022), Imagen (Google Brain, May 2022), and Stable Diffusion (Stability AI / CompVis / Runway, August 2022) all demonstrated the ability to generate photorealistic images from text descriptions. Stable Diffusion's open release on August 22, 2022 had a particularly large impact, enabling a broad ecosystem of fine-tuned models, [LoRA](/wiki/lora) adapters, ControlNet extensions, and custom pipelines; the community grew to roughly 270,000 members on the Stable Diffusion Discord within the first year. [48]

In 2023, William Peebles and Saining Xie introduced the **Diffusion Transformer (DiT)**, which replaced the U-Net backbone with a [transformer](/wiki/transformer)-based architecture operating on image patches. DiT demonstrated clear scaling laws: larger models with more compute consistently achieved lower FID scores. This architecture has since been adopted by Stable Diffusion 3, [Sora](/wiki/sora), and FLUX.

## How does a diffusion model work mathematically?

Diffusion models rest on two complementary processes: a forward process that gradually adds noise to data, and a reverse process that learns to remove the noise.

### Forward diffusion process

Given a data point **x_0** sampled from the real data distribution $$q(x_0)$$, the forward process defines a Markov chain that adds Gaussian noise over T steps:

$$
q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)
$$

Here, $$\beta_1, \beta_2, \ldots, \beta_T$$ is a **noise schedule** that controls how much noise is added at each step. As t increases, the sample becomes progressively noisier. After sufficiently many steps, $$x_T$$ is approximately standard Gaussian noise, and all information about the original data point has been destroyed.

A useful property of this formulation is that x_t can be sampled directly at any timestep without iterating through all previous steps. Defining $$\alpha_t = 1 - \beta_t$$ and $$\bar\alpha_t$$ as the cumulative product of $$\alpha_1$$ through $$\alpha_t$$:

$$
x_t = \sqrt{\bar\alpha_t} \, x_0 + \sqrt{1 - \bar\alpha_t} \, \epsilon
$$

where $$\epsilon$$ is drawn from $$\mathcal{N}(0, I)$$. This closed-form expression is essential for efficient training, since the model can be trained on randomly sampled timesteps rather than requiring sequential computation through all steps.

### Reverse diffusion process

The reverse process starts from Gaussian noise $$x_T$$ and iteratively denoises to recover a data sample. The true reverse conditional $$q(x_{t-1} \mid x_t)$$ is intractable in general, but when conditioned on the original data point $$x_0$$, the posterior $$q(x_{t-1} \mid x_t, x_0)$$ is Gaussian and can be computed in closed form.

A neural network with parameters $$\theta$$ is trained to approximate the reverse transitions:

$$
p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)
$$

In the DDPM formulation, the mean $$\mu_\theta$$ is parameterized in terms of a noise prediction network $$\epsilon_\theta(x_t, t)$$, which estimates the noise that was added to produce $$x_t$$ from $$x_0$$. Given the predicted noise, the model can compute an estimate of x_0 and then derive the reverse step mean.

### Training objective

The standard DDPM training objective is a simplified form of the variational lower bound (VLB) on the data log-likelihood:

$$
L_{\text{simple}} = \mathbb{E}\left[\lVert \epsilon - \epsilon_\theta(x_t, t) \rVert^2\right]
$$

The expectation is over t sampled uniformly from $$\{1, \ldots, T\}$$, $$x_0$$ sampled from the training data, and $$\epsilon$$ sampled from $$\mathcal{N}(0, I)$$. In practice, each training step involves: (1) selecting a random training sample, (2) selecting a random timestep, (3) adding the corresponding amount of noise using the closed-form expression, and (4) training the network to predict the noise that was added.

Ho et al. found that this simplified mean squared error [loss](/wiki/loss_function) on noise prediction produced better sample quality than the full variational bound, likely because it places more weight on the perceptually important lower noise levels. They reported that the best results came from "training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics." [44]

### Score matching perspective

From the score matching viewpoint, the noise prediction network is closely related to the **score function**, defined as the gradient of the log probability density with respect to the data:

$$
\mathrm{score}(x) = \nabla_x \log p(x)
$$

The noise prediction at timestep t is proportional to the score of the noisy data distribution at that noise level. The score function tells the model which direction to "push" a noisy sample to move it toward higher-probability (cleaner) regions of the data distribution. Yang Song and Stefano Ermon's score-based framework directly estimates this score function and generates samples using Langevin dynamics.

### Continuous-time SDE formulation

The SDE framework by Song et al. (2021) describes the forward process as a continuous-time stochastic differential equation:

$$
dx = f(x, t) \, dt + g(t) \, dw
$$

where f is the drift coefficient, g is the diffusion coefficient, and w is a standard Wiener process. The reverse-time SDE takes the form:

$$
dx = \left[f(x, t) - g(t)^2 \, \mathrm{score}(x, t)\right] dt + g(t) \, d\bar{w}
$$

where $$\bar{w}$$ is a reverse-time Wiener process. This formulation allows the use of numerical SDE and ODE solvers for sampling, and it unifies DDPM and score-based models as different discretizations of the same continuous process.

### Elucidating the design space (EDM)

In 2022, Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine of NVIDIA published "Elucidating the Design Space of Diffusion-Based Generative Models" (NeurIPS 2022), commonly referred to as the **EDM framework**.[^27] The paper argued that the theory and practice of diffusion models had become "unnecessarily convoluted" and proposed a unified design space separating the choices of noise schedule, network preconditioning, training loss weighting, sampler, and noise distribution at training time. Using this framework, EDM achieved an FID of 1.79 on class-conditional CIFAR-10 and 1.97 unconditional, with only 35 network evaluations per image. The Heun second-order stochastic sampler and the sigma-based preconditioning introduced by EDM have been widely adopted in subsequent diffusion model implementations.

### Flow matching and stochastic interpolants

Beyond the SDE-based formulation, a family of closely related frameworks reformulates the generative problem as learning a continuous transport between a noise distribution and the data distribution.

- **Flow matching**, introduced by Yaron Lipman, Ricky T.Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le (ICLR 2023), trains a continuous normalizing flow by regressing onto a target vector field that defines a fixed conditional probability path between noise and data.[^28] The training objective is simulation-free, sharing the simplicity of diffusion training, but allows the use of probability paths beyond the variance-preserving Gaussian paths implied by DDPM, including optimal-transport (OT) displacement interpolants, which yield straighter trajectories and faster sampling.
- **Rectified flow** (Liu et al., 2022) can be viewed as a specific instance of the flow-matching family with linear interpolation between noise and data and an iterative "reflow" procedure that further straightens learned trajectories.
- **Stochastic interpolants**, introduced by Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden (2023), unify flows and diffusions by constructing stochastic processes that bridge two arbitrary densities in finite time, with a tunable noise level along the bridge.[^29] Choosing pure determinism recovers a probability flow ODE; adding noise recovers an SDE.

These frameworks are mathematically equivalent to score-based diffusion in many practical settings but offer additional flexibility in choosing probability paths. Modern systems such as Stable Diffusion 3 and FLUX.1 adopt flow-matching style training objectives over the Gaussian diffusion losses used by earlier latent diffusion models.

### Noise schedules

The noise schedule determines how quickly the forward process destroys the data signal. Common choices include:

| Schedule | Description | Notes |
|---|---|---|
| Linear | $$\beta_t$$ increases linearly from $$\beta_1$$ to $$\beta_T$$ | Used in the original DDPM ($$\beta_1 = 0.0001$$, $$\beta_T = 0.02$$, $$T = 1000$$) |
| Cosine | $$\bar\alpha_t$$ follows a cosine curve | Proposed by Nichol and Dhariwal (2021); adds noise more gradually at early steps |
| Scaled linear | Linear schedule adapted for latent space | Common in latent diffusion models |
| Sigmoid | $$\beta_t$$ follows a sigmoid curve | Used in some continuous-time formulations |

The cosine schedule generally produces better results than the linear schedule because it preserves more signal in intermediate steps, where much of the perceptually meaningful structure is learned.

### Output parameterizations

The denoising network can parameterize its output in several equivalent ways:

| Parameterization | Description | Typical use |
|---|---|---|
| Epsilon prediction | Network predicts the noise $$\epsilon$$ added during forward process | DDPM, Stable Diffusion 1.x/2.x |
| x_0 prediction | Network directly predicts the clean data $$x_0$$ | Some early models; useful for certain loss formulations |
| v prediction | Network predicts velocity $$v = \sqrt{\bar\alpha_t} \, \epsilon - \sqrt{1 - \bar\alpha_t} \, x_0$$ | Progressive distillation, Stable Diffusion 2.x |

All three are mathematically interconvertible, but they have different numerical properties affecting training stability and sample quality at different noise levels.

## Key architectures

### U-Net backbone

Most diffusion models through 2023 used a [U-Net](/wiki/unet) architecture as the denoising network. Originally designed for biomedical image segmentation, the U-Net features an encoder-decoder structure with skip connections between corresponding encoder and decoder layers. In the diffusion context, the U-Net takes a noisy input $$x_t$$ and a timestep t, then predicts the noise $$\epsilon$$.

The architecture typically includes:

- **[ResNet](/wiki/resnet) blocks** with group normalization at each resolution level
- **Timestep embeddings** injected as additive bias into residual blocks, informing the network of the current noise level
- **[Self-attention](/wiki/self_attention) layers** at lower resolutions (typically 16x16 and 8x8) to capture long-range dependencies
- **[Cross-attention](/wiki/cross_attention) layers** for conditioning on external inputs such as text embeddings from [CLIP](/wiki/clip) or T5
- **Skip connections** between encoder and decoder that preserve spatial details

The encoder progressively downsamples spatial resolution while increasing channels, and the decoder upsamples back. This multi-scale structure lets the network capture both fine textures and global composition.

### Diffusion Transformer (DiT)

The **Diffusion Transformer (DiT)**, introduced by William Peebles and Saining Xie in 2023 (ICCV), replaces the U-Net with a [Vision Transformer](/wiki/vision_transformer) (ViT)-style architecture operating on sequences of image patches. Timestep and class conditioning are incorporated through adaptive layer normalization (adaLN).

The key finding was that DiT models follow clear scaling laws: more compute (measured in GFLOPs) consistently yields lower FID scores. The largest model, DiT-XL/2, achieved an FID of 2.27 on class-conditional ImageNet 256x256 generation. The DiT architecture has since been adopted by Stable Diffusion 3, [Sora](/wiki/sora), and FLUX, reflecting a broader shift toward transformer-based architectures across modalities.

### Latent diffusion architecture

**Latent Diffusion Models (LDM)**, introduced by Rombach et al. (2022), run the diffusion process in the compressed latent space of a pretrained [VAE](/wiki/variational_autoencoder) rather than directly on pixels. A VAE encoder compresses an image (for example, 512x512x3 pixels) into a smaller latent representation (for example, 64x64x4), and the diffusion model operates on this compact representation. After generation, the VAE decoder converts the latent code back to pixel space.

This approach offers several benefits:

- Computational cost is greatly reduced, since the diffusion model works on representations roughly 48 times smaller than the original image
- The VAE handles perceptually irrelevant detail, letting the diffusion model focus on semantically meaningful structure
- Conditioning mechanisms (text, spatial maps, class labels) integrate through cross-attention layers in the denoising network

Latent diffusion became the foundation for [Stable Diffusion](/wiki/stable_diffusion) and influenced the design of many subsequent systems.

## Sampling methods

### DDPM sampling

The original DDPM sampling requires T = 1000 sequential denoising steps, making generation slow. Each step applies the learned reverse transition to produce a slightly cleaner sample.

### DDIM sampling

**Denoising Diffusion Implicit Models (DDIM)**, proposed by Jiaming Song, Chenlin Meng, and Stefano Ermon (2020), generalize DDPM by constructing non-Markovian forward processes that share the same training objective. DDIM sampling is deterministic given a fixed initial noise vector, enabling consistent image generation from the same latent code and meaningful interpolation in latent space. DDIM can use as few as 10 to 50 steps with relatively minor quality loss compared to 1000-step DDPM.

### DPM-Solver and higher-order solvers

DPM-Solver, introduced by Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu in 2022, applies high-order ODE solvers to the diffusion sampling process. By analytically computing parts of the solution and using higher-order numerical methods for the remainder, DPM-Solver can generate high-quality samples in 10 to 25 steps. DPM-Solver++ further improved results for guided sampling. These solvers are now among the most commonly used in practice.

### Can diffusion models generate images in a single step?

**Consistency models**, introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever in 2023 (ICML), learn to map any point along the diffusion trajectory directly to the clean data point in a single step. This allows high-quality generation in one or very few steps. Consistency models can be trained either by distilling a pretrained diffusion model (consistency distillation) or from scratch (consistency training).

Improved Consistency Training (iCT), published in 2024, achieved FID scores of 2.51 on CIFAR-10 and 3.25 on ImageNet 64x64 in a single sampling step. Easy Consistency Tuning (ECT), published at ICLR 2025, achieved a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU, matching performance that previously required hundreds of GPU hours.

### Rectified flow

**Rectified flow**, introduced by Xingchao Liu, Chengyue Gong, and Qiang Liu in "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (ICLR 2023 Spotlight), learns a transport map between noise and data along straight paths rather than the curved trajectories of standard diffusion.[^25] Straighter paths require fewer discretization steps, making generation more efficient. The Stable Diffusion 3 paper by Esser et al. (2024) and the FLUX.1 model family from Black Forest Labs both adopt rectified flow as their formulation. Recent research has shown that flow matching and Gaussian diffusion are mathematically equivalent, though flow matching offers practical advantages in trajectory straightness and training simplicity.

### Latent Consistency Models (LCM)

**Latent Consistency Models**, introduced by Simian Luo and colleagues in October 2023, distill consistency-model behavior directly in the latent space of a pretrained latent diffusion model.[^26] Viewing the guided reverse process as solving an augmented probability flow ODE in latent space, LCMs predict the ODE solution directly, enabling 2 to 4 step 768x768 image synthesis. A high-quality LCM took only 32 A100 GPU hours to train. The follow-up **LCM-LoRA** (Luo et al., November 2023) treats consistency distillation as a [LoRA](/wiki/lora) adapter that can be plugged into pretrained Stable Diffusion checkpoints without modifying the base weights, acting as a "universal acceleration module" for the open-source diffusion ecosystem.

## Conditioning and guidance

### Classifier guidance

Dhariwal and Nichol (2021) introduced **classifier guidance** to improve conditional generation quality. A separate classifier is trained on noisy images, and its gradients steer the diffusion sampling process toward a desired class:

$$
\epsilon_{\text{guided}} = \epsilon_\theta(x_t, t) - s \, \nabla_{x_t} \log p(y \mid x_t)
$$

Higher values of the guidance scale s produce images more strongly associated with the target class but with reduced diversity. This approach requires training a separate classifier on noisy data, adding complexity.

### What is classifier-free guidance?

Jonathan Ho and Tim Salimans proposed **classifier-free guidance (CFG)** in 2022, eliminating the need for a separate classifier. During training, the conditioning signal (for example, a text prompt) is randomly dropped for a fraction of examples, so the model learns both conditional and unconditional generation. At inference:

$$
\epsilon_{\text{guided}} = \epsilon_{\text{unconditional}} + w (\epsilon_{\text{conditional}} - \epsilon_{\text{unconditional}})
$$

When $$w = 1$$, this is standard conditional generation. Values of $$w > 1$$ amplify the influence of the conditioning signal, producing outputs more closely aligned with the input at the cost of reduced diversity. CFG has become the standard conditioning approach in virtually all modern text-to-image diffusion systems. Typical guidance scale values range from 7 to 15.

### Text conditioning via cross-attention

In text-conditioned diffusion models, the text prompt is first encoded into a sequence of embedding vectors using a text encoder such as [CLIP](/wiki/clip) or T5. These embeddings are injected into the denoising network through cross-attention layers, where image features act as queries (Q) and text embeddings provide keys (K) and values (V). This allows every spatial position in the image to attend to relevant parts of the text, enabling fine-grained alignment between the generated image and the prompt.

Different systems use different text encoders:

| System | Text encoder(s) |
|---|---|
| [Stable Diffusion](/wiki/stable_diffusion) 1.x | CLIP ViT-L/14 |
| Stable Diffusion 2.x | OpenCLIP ViT-H/14 |
| Stable Diffusion XL | CLIP ViT-L + OpenCLIP ViT-bigG |
| Stable Diffusion 3 | Two CLIP models + T5-XXL |
| [DALL-E 2](/wiki/dall-e) | CLIP |
| [Imagen](/wiki/imagen) | T5-XXL (11B parameters) |
| FLUX.1 | T5-XXL |

### Negative prompts

In classifier-free guidance, the unconditional prediction can be replaced with a prediction conditioned on a **negative prompt** describing attributes the user wants to avoid:

$$
\epsilon_{\text{guided}} = \epsilon_{\text{negative}} + w (\epsilon_{\text{positive}} - \epsilon_{\text{negative}})
$$

This allows users to steer generation away from undesired features (for example, "blurry, low quality, distorted hands") while amplifying desired attributes.

## Major diffusion-based systems

| System | Organization | Year | Architecture | Key features |
|---|---|---|---|---|
| [DALL-E 2](/wiki/dall-e) | [OpenAI](/wiki/openai) | 2022 | CLIP prior + cascaded diffusion (unCLIP) | Text-to-image, inpainting, image variations |
| [Imagen](/wiki/imagen) | Google Brain | 2022 | T5-XXL + cascaded U-Net diffusion | Text-to-image at 1024x1024; showed scaling text encoder matters most |
| [Stable Diffusion](/wiki/stable_diffusion) 1.5 | [Stability AI](/wiki/stability_ai) / CompVis / Runway | 2022 | Latent diffusion, U-Net, CLIP, VAE | Open source; text-to-image, inpainting, img2img |
| [Midjourney](/wiki/midjourney) v4 | Midjourney, Inc. | 2022 | Proprietary diffusion model | Text-to-image via Discord |
| Stable Diffusion XL | [Stability AI](/wiki/stability_ai) | 2023 | Larger U-Net, dual CLIP encoders | 1024x1024 native resolution |
| [DALL-E 3](/wiki/dall-e) | [OpenAI](/wiki/openai) | 2023 | Improved diffusion + recaptioning pipeline | Strong text rendering and prompt following |
| [Midjourney](/wiki/midjourney) v6 | Midjourney, Inc. | 2023 | Third-generation model | Improved photorealism, text rendering |
| [Stable Diffusion 3](/wiki/stable_diffusion_3) | [Stability AI](/wiki/stability_ai) | 2024 | Multimodal DiT (MMDiT) + rectified flow | Three text encoders, improved text rendering |
| [FLUX.1](/wiki/flux_1) | [Black Forest Labs](/wiki/black_forest_labs) | 2024 | 12B-parameter rectified flow transformer | Pro, Dev, and Schnell variants |
| [Sora](/wiki/sora) | [OpenAI](/wiki/openai) | 2024 | Diffusion transformer on spacetime patches | Text-to-video up to 1 minute at 1080p |
| [HunyuanVideo](/wiki/hunyuan_video) | Tencent | 2024 | DiT + 3D causal VAE + MLLM text encoder | 13B parameters; largest open-weights video model at release |
| [Movie Gen](https://arxiv.org/abs/2410.13720) | Meta | 2024 | 30B DiT, 73K-token context | Joint video + synchronized audio + editing + personalization |
| [Midjourney V7](/wiki/midjourney_v7) | Midjourney, Inc. | 2025 | New architecture (proprietary) | Draft mode, improved coherence |
| [Imagen 3](/wiki/imagen_3) / [Imagen 4](/wiki/imagen_4) | Google DeepMind | 2024 / 2025 | Latent diffusion (details not public) | Production text-to-image on Vertex AI and consumer apps |
| [Veo 3](/wiki/veo_3) | Google DeepMind | 2025 | Diffusion video model with joint audio | Natively synchronized dialogue, SFX, and ambient audio |
| [Sora 2](/wiki/sora_2) | [OpenAI](/wiki/openai) | 2025 | Updated diffusion video model | Improved physics and synchronized audio |
| [FLUX.2](/wiki/flux_2) | [Black Forest Labs](/wiki/black_forest_labs) | 2025 | Rectified flow transformer + Mistral-3 24B VLM | Unified generation and editing |

### DALL-E 2

DALL-E 2, released by [OpenAI](/wiki/openai) in April 2022, uses an approach called **unCLIP**. It consists of a [CLIP](/wiki/clip) text encoder, a prior model that maps CLIP text embeddings to CLIP image embeddings, and a diffusion decoder that generates images conditioned on the image embedding. Two cascaded super-resolution diffusion models upsample the output from 64x64 to 256x256 and then to 1024x1024. The paper, "Hierarchical Text-Conditional Image Generation with CLIP Latents," was authored by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.

### Imagen

Imagen, introduced by Chitwan Saharia and colleagues at Google Brain in May 2022, demonstrated that scaling the text encoder (a frozen T5-XXL with 11 billion parameters) improved image quality and text alignment more effectively than scaling the diffusion model itself. Imagen uses a cascade of three diffusion models: a base model generating 64x64 images, and two super-resolution models upsampling to 256x256 and 1024x1024.

### Stable Diffusion

[Stable Diffusion](/wiki/stable_diffusion), first released on August 22, 2022 under the permissive Creative ML OpenRAIL-M license, is the most widely used open-source diffusion model and reached more than 10 million users within roughly two months of launch. [48] Built on the latent diffusion architecture, it operates in the latent space of a VAE using a U-Net (versions 1.x through XL) or a Diffusion Transformer (version 3 onward) as the denoising backbone.

Stable Diffusion's open-source release enabled a vast ecosystem of fine-tuned models, [LoRA](/wiki/lora) adapters, [ControlNet](/wiki/controlnet) extensions, and custom pipelines. [Stable Diffusion 3](/wiki/stable_diffusion_3), released in June 2024, replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) and adopted rectified flow training, as described by Esser et al. in "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis."[^43] [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5) followed in late 2024.

### Sora

[Sora](/wiki/sora), OpenAI's text-to-video model, was first previewed in February 2024 and released publicly in December 2024. It generates video by denoising spacetime patches in a latent space using a diffusion transformer architecture. Sora can produce up to one minute of 1080p video with coherent motion and scene consistency. The technical report, "Video Generation Models as World Simulators," describes its approach to jointly modeling spatial and temporal dimensions.

## What are diffusion models used for?

### Text-to-image generation

The most prominent application of diffusion models is generating images from text descriptions. Modern systems handle complex multi-object scenes, specific art styles, photorealistic rendering, and even legible text within images. All major commercial systems (DALL-E, Stable Diffusion, [Midjourney](/wiki/midjourney), Imagen, FLUX) operate in this mode.

### Image editing and translation

Diffusion models can transform existing images using text prompts. The **SDEdit** technique starts with a partially noised version of the input image (rather than pure noise) and denoises it according to a new prompt. The amount of initial noise controls the balance between preserving the original image and following the new instruction. This enables style transfer, content modification, and creative editing. **InstructPix2Pix** (Brooks et al., 2023) further extended this by training a diffusion model to follow explicit editing instructions.

### Inpainting and outpainting

Inpainting fills in masked regions of an image guided by text and surrounding context. Outpainting extends images beyond their original boundaries. Both tasks use the diffusion model's ability to generate content that is contextually consistent with existing pixels.

### Super-resolution

Diffusion models can upsample low-resolution images while adding realistic high-frequency detail. This is used both as a standalone application and within cascaded generation pipelines, where a base model generates a small image that is progressively upsampled by specialized super-resolution diffusion models (as in Imagen and DALL-E 2).

### Text-to-video generation

Diffusion models have been extended to video, where the denoising process operates on sequences of frames or spacetime latent patches. The main challenge is maintaining temporal coherence, with consistent objects and smooth motion across frames.

**Stable Video Diffusion** (Blattmann et al., November 2023) extended the latent diffusion architecture to video by adding temporal layers and identifying three stages of training: text-to-image pretraining, video pretraining on a curated dataset, and high-quality video finetuning.[^30] It became the first widely used open-weights image-to-video model.

[**Sora**](/wiki/sora), previewed by [OpenAI](/wiki/openai) in February 2024 and released in December 2024, generates up to one minute of 1080p video by denoising spacetime patches in a learned latent space using a diffusion transformer. [Sora 2](/wiki/sora_2), released in late 2025, extended these capabilities with improved physical plausibility and synchronized audio.

[**Veo**](/wiki/veo), [**Veo 2**](/wiki/veo_2), and [**Veo 3**](/wiki/veo_3) are Google DeepMind's text-to-video diffusion models. Veo 3, announced at Google I/O in May 2025, was notable for natively generating synchronized audio (dialogue, sound effects, and ambient sound) jointly with video frames in the same diffusion process.

**Movie Gen** (Polyak et al., Meta, October 2024) introduced a 30 billion parameter diffusion transformer trained with a maximum context length of 73K video tokens, corresponding to roughly 16 seconds at 16 fps, and a separate video-to-audio model.[^31] The Movie Gen suite also includes models for video editing and personalization.

[**HunyuanVideo**](/wiki/hunyuan_video) (Tencent, December 2024) released over 13 billion parameters of weights under an open license, combining a DiT backbone, an MLLM-based text encoder, and a 3D causal VAE.[^32] At release it was the largest open-weights video generation model. [**CogVideoX**](/wiki/cogvideo) (Zhipu AI), [**Wan 2.1**](/wiki/wan_2_1) and [**Wan 2.5**](/wiki/wan_2_5) (Alibaba), [**Kling**](/wiki/kling) (Kuaishou), and [**Runway Gen-3**](/wiki/runway_gen_3) and [**Runway Gen-4**](/wiki/runway_gen_4) are other notable systems, the latter two used in commercial film and advertising pipelines.

### Text-to-audio and music generation

**AudioLDM**, introduced by Haohe Liu and colleagues in 2023 (ICML), applies the latent diffusion framework to audio. Using contrastive language-audio pretraining (CLAP) embeddings, it generates speech, sound effects, and music from text descriptions. Other notable audio diffusion systems include **Riffusion** (which generates music through spectrogram diffusion) and various diffusion-based text-to-speech systems.

### 3D object generation

OpenAI's **Point-E** (2022) generates 3D point clouds from text by first producing a synthetic 2D view using a text-to-image diffusion model, then converting it to a 3D point cloud using a second diffusion model. **Shap-E** (2023) improved on this by generating implicit 3D representations ([NeRF](/wiki/nerf) weights and signed distance functions) conditioned on text or images. **TripoSR**, developed by Stability AI and Tripo AI in 2024, uses a feed-forward transformer to produce 3D meshes from single images in under a second on an NVIDIA A100 GPU.

A separate line of work uses 2D image-diffusion priors to optimize 3D representations without 3D training data:

- **DreamFusion** (Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall, 2022) introduced **Score Distillation Sampling (SDS)**, a loss based on probability density distillation that lets a frozen pretrained 2D diffusion model act as a prior to optimize the parameters of a [NeRF](/wiki/nerf) (or any differentiable image renderer) via gradient descent.[^33] Each iteration renders 2D views of the current 3D scene, perturbs them with noise, and uses the diffusion model's predicted noise as a gradient signal to update the 3D parameters.
- **Magic3D** (NVIDIA, 2022) extended SDS with a coarse-to-fine pipeline that first optimizes a NeRF and then a textured mesh, producing higher-resolution 3D models.
- **ProlificDreamer** (Wang et al., NeurIPS 2023) introduced **Variational Score Distillation (VSD)**, treating the 3D scene as a random variable rather than a single point estimate.[^34] VSD addressed the over-saturation, over-smoothing, and low-diversity problems of SDS, producing higher-fidelity textured meshes that work with typical CFG weights.

[**Hunyuan 3D**](/wiki/hunyuan_3d) (Tencent, 2024 to 2025) is an example of a more recent open-weights image-to-3D diffusion system that operates directly on 3D shape latents rather than relying on 2D distillation.

### Molecular generation and drug discovery

Diffusion models have found significant applications in computational chemistry and drug design. **DiffDock**, introduced by Gabriele Corso, Hannes Stark, Bowen Jing, Regina Barzilay, and Tommi Jaakkola (2022), frames molecular docking as a generative modeling problem, using diffusion over translations, rotations, and torsion angles to predict how small molecules bind to protein targets. DiffDock achieved 38.2% of ligand poses with RMSD below 2 angstroms on the PDBBind benchmark, outperforming traditional docking methods. DiffDock-L, released in February 2024, further improved performance and generalization.

Other diffusion-based molecular generation systems include **PMDM** for structure-based drug design and various models for generating novel molecular geometries with specified physicochemical properties.

### Protein structure and design

In structural biology, [AlphaFold](/wiki/alphafold) 3 (published in Nature, 2024) incorporates a diffusion-based module for predicting the structures of protein complexes, ligand-protein interactions, and nucleic acid structures. Diffusion models for protein design can generate novel protein sequences and structures with desired functional properties, with applications in drug development, vaccine research, and enzyme engineering.

### Robotic control (Diffusion Policy)

**Diffusion Policy**, introduced by Cheng Chi, Zhenjia Xu, Siyuan Feng, and colleagues at Columbia University in 2023, applies diffusion models to visuomotor policy learning for robots. Instead of generating images, the diffusion process generates sequences of robot actions conditioned on visual observations. On benchmarks spanning 15 robot manipulation tasks, Diffusion Policy outperformed prior methods by an average of 46.9%.

Research in this area has expanded rapidly: as of 2025, diffusion-based policies have been applied to dexterous manipulation, long-horizon planning, and multi-modal input integration (combining point clouds with natural language instructions). Flow-matching-based variants have also emerged, incorporating second-order dynamics for smoother trajectories.

### Personalization and subject-driven generation

A line of work allows users to teach a pretrained diffusion model a new visual concept (a specific person, object, or style) from a small number of reference images:

- **Textual Inversion** (Gal et al., ICLR 2023) freezes the entire diffusion model and learns a single new embedding vector in the text encoder's vocabulary that represents the target concept, given only 3 to 5 reference images.[^35] The new "word" can then be composed with arbitrary text prompts.
- **DreamBooth** (Ruiz et al., CVPR 2023) fine-tunes the diffusion model itself to associate a rare unique identifier (for example "sks dog") with the target subject, using a class-specific prior preservation loss to avoid catastrophic forgetting of the broader class concept.[^36] It typically requires several minutes of fine-tuning on consumer hardware to teach a single subject.
- [**LoRA**](/wiki/lora) (low-rank adaptation) fine-tunes only small low-rank updates to the cross-attention or other weight matrices, producing compact (typically 10 to 200 MB) adapter files that can be mixed and matched at inference time. LoRA adapters are widely used to teach diffusion models specific characters, art styles, and concepts.

### Controllable generation extensions

Several methods add spatial or reference-image control beyond text:

| Method | Description | Control input |
|---|---|---|
| [ControlNet](/wiki/controlnet) | Adds conditional control to pretrained diffusion models via a zero-initialized trainable copy of the encoder; introduced by Zhang, Rao, and Agrawala at ICCV 2023[^37] | Edge maps, depth maps, pose skeletons, segmentation maps |
| IP-Adapter | Decoupled cross-attention for image prompts; 22M parameters, plugs into existing checkpoints; Ye et al. (Tencent AI Lab), 2023[^38] | Reference images for style or content |
| T2I-Adapter | Lightweight spatial conditioning alternative | Sketch, color, depth inputs |
| [LoRA](/wiki/lora) | Low-rank adaptation fine-tuning | Custom concepts, styles, or subjects with minimal data |

## Can diffusion models generate text?

While diffusion models originated in continuous data domains like images, a parallel line of work extends the diffusion framework to discrete text generation, creating a new category sometimes called **diffusion language models (d-LLMs)**. Because tokens are discrete symbols rather than continuous vectors, applying diffusion to text requires either embedding tokens into a continuous space and adding Gaussian noise (continuous diffusion) or defining a forward process directly on discrete tokens, typically through progressive masking or transitions between vocabulary items (discrete diffusion).

### Discrete diffusion foundations

**D3PM** (Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg, NeurIPS 2021) generalized DDPM to discrete data by defining the forward process via transition matrices over a vocabulary.[^39] Special choices of transition matrix (uniform, absorbing-state, nearest-neighbor in embedding space) recover or connect to existing approaches, including mask-based and autoregressive models.

**SEDD** (Aaron Lou, Chenlin Meng, Stefano Ermon, ICML 2024 Best Paper) introduced **score entropy**, a loss function that extends score matching to discrete spaces by modeling the ratios of the data distribution rather than its absolute density.[^40] SEDD outperformed prior discrete diffusion approaches by 25 to 75 percent in perplexity and was competitive with similarly sized GPT-2 models while supporting controllable infilling without left-to-right constraints.

### Masked Diffusion Language Models (MDLM)

**MDLM**, introduced by Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and Volodymyr Kuleshov (NeurIPS 2024), showed that simple masked discrete diffusion is more effective than previously believed.[^22] The model corrupts text by progressively masking tokens (a discrete analog of adding noise) and learns to predict the masked tokens conditioned on the remaining ones. MDLM demonstrated that with an effective training recipe and a simplified Rao-Blackwellized objective, masked diffusion models can approach autoregressive model quality on language benchmarks.

### LLaDA

**LLaDA** (Large Language Diffusion Models, Nie et al., Renmin University of China, February 2025) is an 8 billion parameter masked-diffusion language model trained from scratch on 2.3 trillion tokens with 0.13 million H800 GPU hours, followed by supervised fine-tuning on 4.5 million pairs.[^41] Reported to be competitive with LLaMA 3 8B on standard benchmarks, LLaDA was notable for addressing the so-called "reversal curse" (the asymmetry of autoregressive models when prompted with reversed information), in part because masked diffusion conditions on bidirectional context. The follow-up **LLaDA-V** (2025) extends the framework with visual instruction tuning for multimodal use.

### Mercury

**Mercury**, developed by Inception Labs, is described in their technical report as the first commercial-scale diffusion [LLM](/wiki/llm) family.[^23] Mercury Coder Mini and Mercury Coder Small achieve throughputs of 1,109 and 737 tokens per second respectively on NVIDIA H100 GPUs, outperforming speed-optimized autoregressive models by up to 10 times while maintaining comparable quality. On the Copilot Arena coding benchmark, Mercury Coder ranked second in quality and was the fastest model overall.

The speed advantage of d-LLMs comes from their ability to generate or refine multiple tokens in parallel, rather than sequentially as in autoregressive models. **Mercury 2**, announced in February 2026, achieves approximately 1,000 tokens per second output throughput with reasoning capabilities.

### Gemini Diffusion

**Gemini Diffusion**, announced by Google DeepMind at Google I/O on May 20, 2025, is an experimental text-diffusion language model that generates content by iteratively refining noise into coherent text or code rather than predicting one token at a time. Google reported throughputs of roughly 1,000 to 2,000 tokens per second, several times faster than the company's then-fastest production [Gemini](/wiki/gemini) model, with comparable coding and reasoning performance. The model was initially released as a wait-listed demo.

### Block Diffusion

**Block Diffusion** (ICLR 2025, oral), from Cornell University researchers, introduces a semi-autoregressive approach that generates blocks of tokens from left to right while allowing diffusion-based unmasking within each block. This combines the sequential coherence of autoregressive generation with the parallelism of diffusion.

## Acceleration and distillation

### Progressive distillation

Progressive distillation trains a student model to match the output of two teacher steps in a single step, repeatedly halving the number of required steps. After several rounds, the student can generate high-quality images in 4 to 8 steps.

### Adversarial distillation

Adversarial distillation uses a GAN-like discriminator to train a few-step generator from a pretrained diffusion teacher. Notable examples include **SDXL Turbo** (single-step generation at 512x512) and **SDXL Lightning** (high quality in 2 to 4 steps), both from Stability AI.

### Distribution matching distillation

Distribution matching distillation minimizes the distributional distance between the teacher's multi-step output and the student's single-step output. This approach has been used to create fast variants of several production models.

## Reward fine-tuning and alignment

As diffusion models have become production tools, methods originally developed for aligning language models have been adapted to fine-tune them on human preference data, aesthetic reward models, and prompt-following signals:

- **ReFL** (Reward Feedback Learning, 2023) and **DPOK** (Diffusion Policy Optimization with KL, Fan et al., 2023) treat the diffusion sampling process as a multi-step decision problem and fine-tune denoising parameters by backpropagating reward gradients through a small number of sampling steps.
- **Diffusion-DPO** (Wallace et al., 2023) adapts [Direct Preference Optimization (DPO)](/wiki/direct_preference_optimization_dpo) from language model alignment to diffusion models, reformulating the DPO objective in terms of the diffusion evidence lower bound.[^42] Fine-tuned on the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, Diffusion-DPO produced an SDXL variant that significantly outperformed the base SDXL in human evaluations of visual appeal and prompt alignment.
- **Score Identity Distillation (SiD)** and related methods combine distillation with reward signals so that a fast student matches both the teacher distribution and an external reward model.

These methods have been used by commercial systems such as DALL-E 3 and Midjourney to improve prompt-following and aesthetic quality beyond what raw pretraining can achieve, although the exact recipes used are typically proprietary.

## How do diffusion models differ from GANs and VAEs?

| Feature | Diffusion models | [GANs](/wiki/gan) | [VAEs](/wiki/variational_autoencoder) | Flow-based models |
|---|---|---|---|---|
| Training stability | Stable; single network trained with MSE loss | Unstable; requires balancing generator and discriminator | Stable; trained with ELBO | Stable; trained with exact log-likelihood |
| Sample quality | State of the art for images and video | High quality but prone to artifacts | Often blurry due to pixel-level reconstruction loss | Good but generally below diffusion and GANs |
| Sample diversity | High; good mode coverage | Susceptible to mode collapse | High diversity by design | High diversity |
| Generation speed | Slow (many iterative steps); accelerable with distillation | Fast (single forward pass) | Fast (single decoder pass) | Fast (single pass through invertible layers) |
| Likelihood estimation | Approximate (via variational bound) | Not available | Approximate (ELBO) | Exact (change of variables) |
| Conditioning | Flexible via CFG and cross-attention | Requires conditional architectures | Conditional VAE variants | Conditional flow variants |

Diffusion models have largely replaced GANs as the preferred approach for high-quality image generation. The 2021 "Diffusion Models Beat GANs on Image Synthesis" paper was the first to report image sample quality "superior to the current state-of-the-art generative models" on ImageNet, a result the field treats as the moment diffusion overtook adversarial methods. [46] GANs remain useful for real-time applications and are sometimes used as discriminators or for distilling diffusion models into faster single-step generators. VAEs continue to play a supporting role as the encoder-decoder framework in latent diffusion architectures.

## Limitations

Despite their strong performance, diffusion models have several known limitations:

- **Generation speed.** Even with acceleration techniques, diffusion models are slower than single-pass methods like GANs. Real-time generation remains challenging for high-resolution outputs, though distillation methods have narrowed this gap considerably.
- **Computational cost.** Training large diffusion models requires substantial GPU resources. Stable Diffusion XL was trained on clusters of hundreds of A100 GPUs, and larger models like FLUX.1 (12 billion parameters) require even more compute.
- **Anatomical artifacts.** Current models still sometimes produce anatomically incorrect human hands, inconsistent text, and errors in complex multi-object scenes with spatial relationships, though each generation of models has improved on these issues.
- **Memorization and copyright concerns.** Studies have shown that diffusion models can sometimes reproduce near-copies of training images, raising copyright and privacy concerns. This is particularly relevant for models trained on large web-scraped datasets.
- **Evaluation gaps.** Standard metrics like FID and CLIP score do not fully capture perceptual quality, prompt alignment, or artifact presence. Human evaluation remains important but is expensive and subjective.
- **Discrete data challenges.** Applying diffusion to discrete domains like text requires workarounds such as embedding into continuous space or using masked or absorbing-state discrete diffusion. While diffusion language models such as SEDD, MDLM, LLaDA, Mercury Coder, and Gemini Diffusion have closed much of the historical gap with autoregressive models on coding and language benchmarks, frontier general-purpose chat performance is still dominated by autoregressive [LLMs](/wiki/llm).

## See also

- [Visual Autoregressive modeling (VAR)](/wiki/visual_autoregressive_modeling)
- [Masked Autoregressive (MAR) generation](/wiki/masked_autoregressive_model)
- [Diffusion Forcing](/wiki/diffusion_forcing)
- [Visual Autoregressive modeling (VAR)](/wiki/visual_autoregressive_modeling)
- [Masked Autoregressive (MAR) generation](/wiki/masked_autoregressive_model)
- [Diffusion Forcing](/wiki/diffusion_forcing)

## References

1. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." Proceedings of the 32nd International Conference on Machine Learning (ICML). https://arxiv.org/abs/1503.03585

2. Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." Advances in Neural Information Processing Systems 33 (NeurIPS). https://arxiv.org/abs/2006.11239

3. Song, Y., & Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." Advances in Neural Information Processing Systems 32 (NeurIPS). https://arxiv.org/abs/1907.05600

4. Song, J., Meng, C., & Ermon, S. (2020). "Denoising Diffusion Implicit Models." International Conference on Learning Representations (ICLR 2021). https://arxiv.org/abs/2010.02502

5. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." International Conference on Learning Representations (ICLR 2021). https://arxiv.org/abs/2011.13456

6. Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." Advances in Neural Information Processing Systems 34 (NeurIPS). https://arxiv.org/abs/2105.05233

7. Ho, J. & Salimans, T. (2022). "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. https://arxiv.org/abs/2207.12598

8. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022). https://arxiv.org/abs/2112.10752

9. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents." https://arxiv.org/abs/2204.06125

10. Saharia, C., Chan, W., Saxena, S., et al. (2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding." Advances in Neural Information Processing Systems 35 (NeurIPS). https://arxiv.org/abs/2205.11487

11. Peebles, W. & Xie, S. (2023). "Scalable Diffusion Models with Transformers." IEEE/CVF International Conference on Computer Vision (ICCV 2023). https://arxiv.org/abs/2212.09748

12. Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Consistency Models." Proceedings of the 40th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2303.01469

13. Betker, J., et al. (2023). "Improving Image Generation with Better Captions." OpenAI. https://cdn.openai.com/papers/dall-e-3.pdf

14. Hyvarinen, A. (2005). "Estimation of Non-Normalized Statistical Models by Score Matching." Journal of Machine Learning Research, 6, 695-709.

15. Vincent, P. (2011). "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation, 23(7), 1661-1674.

16. Liu, H., et al. (2023). "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models." Proceedings of the 40th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2301.12503

17. Nichol, A. & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." Proceedings of the 38th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2102.09672

18. Brooks, T., et al. (2024). "Video Generation Models as World Simulators." OpenAI Technical Report. https://openai.com/index/video-generation-models-as-world-simulators/

19. Corso, G., Stark, H., Jing, B., Barzilay, R., & Jaakkola, T. (2022). "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking." International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.01776

20. Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., & Song, S. (2023). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." Robotics: Science and Systems (RSS 2023). https://diffusion-policy.cs.columbia.edu/

21. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps." Advances in Neural Information Processing Systems 35 (NeurIPS). https://arxiv.org/abs/2206.00927

22. Sahoo, S.S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J.T., Rush, A., & Kuleshov, V. (2024). "Simple and Effective Masked Diffusion Language Models." Advances in Neural Information Processing Systems 37 (NeurIPS). https://arxiv.org/abs/2406.07524

23. Inception Labs. (2025). "Mercury: Ultra-Fast Language Models Based on Diffusion." https://arxiv.org/abs/2506.17298

24. Abramson, J., Adler, J., Dunger, J., et al. (2024). "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature, 630, 493-500.

25. Liu, X., Gong, C., & Liu, Q. (2022). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." International Conference on Learning Representations (ICLR 2023, Spotlight). https://arxiv.org/abs/2209.03003

26. Luo, S., Tan, Y., Huang, L., Li, J., & Zhao, H. (2023). "Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference." https://arxiv.org/abs/2310.04378

27. Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2206.00364

28. Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., & Le, M. (2022). "Flow Matching for Generative Modeling." International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2210.02747

29. Albergo, M.S., Boffi, N.M., & Vanden-Eijnden, E. (2023). "Stochastic Interpolants: A Unifying Framework for Flows and Diffusions." https://arxiv.org/abs/2303.08797

30. Blattmann, A., Dockhorn, T., Kulal, S., et al. (2023). "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." https://arxiv.org/abs/2311.15127

31. Polyak, A., et al. (2024). "Movie Gen: A Cast of Media Foundation Models." Meta AI. https://arxiv.org/abs/2410.13720

32. Kong, W., et al. (2024). "HunyuanVideo: A Systematic Framework For Large Video Generative Models." Tencent. https://arxiv.org/abs/2412.03603

33. Poole, B., Jain, A., Barron, J.T., & Mildenhall, B. (2022). "DreamFusion: Text-to-3D using 2D Diffusion." International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2209.14988

34. Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., & Zhu, J. (2023). "ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation." Advances in Neural Information Processing Systems 36 (NeurIPS 2023, Spotlight). https://arxiv.org/abs/2305.16213

35. Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., & Cohen-Or, D. (2022). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." International Conference on Learning Representations (ICLR 2023). https://arxiv.org/abs/2208.01618

36. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2022). "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023). https://arxiv.org/abs/2208.12242

37. Zhang, L., Rao, A., & Agrawala, M. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models." IEEE/CVF International Conference on Computer Vision (ICCV 2023). https://arxiv.org/abs/2302.05543

38. Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models." Tencent AI Lab. https://arxiv.org/abs/2308.06721

39. Austin, J., Johnson, D.D., Ho, J., Tarlow, D., & van den Berg, R. (2021). "Structured Denoising Diffusion Models in Discrete State-Spaces." Advances in Neural Information Processing Systems 34 (NeurIPS 2021). https://arxiv.org/abs/2107.03006

40. Lou, A., Meng, C., & Ermon, S. (2023). "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution." Proceedings of the 41st International Conference on Machine Learning (ICML 2024, Best Paper). https://arxiv.org/abs/2310.16834

41. Nie, S., Zhu, F., et al. (2025). "Large Language Diffusion Models." https://arxiv.org/abs/2502.09992

42. Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., & Naik, N. (2023). "Diffusion Model Alignment Using Direct Preference Optimization." https://arxiv.org/abs/2311.12908

43. Esser, P., Kulal, S., Blattmann, A., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." International Conference on Machine Learning (ICML 2024). https://arxiv.org/abs/2403.03206

44. Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." Advances in Neural Information Processing Systems 33 (NeurIPS). Abstract: Inception score 9.46 and state-of-the-art FID 3.17 on unconditional CIFAR-10; sample quality similar to ProgressiveGAN on 256x256 LSUN. https://arxiv.org/abs/2006.11239

45. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML 2015. Abstract: "The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process." https://arxiv.org/abs/1503.03585

46. Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. Reported FID of 2.97 (ImageNet 128x128), 4.59 (256x256), and 7.72 (512x512). https://arxiv.org/abs/2105.05233

47. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. Abstract: "To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders." https://arxiv.org/abs/2112.10752

48. Stability AI. (2022-2023). "Stable Diffusion Public Release" (Aug 22, 2022, Creative ML OpenRAIL-M license) and "Celebrating one year of Stable Diffusion" (more than 10 million users globally in two months; nearly 270,000 members on the Stable Diffusion Discord). https://stability.ai/news/stable-diffusion-public-release