# DDPM

> Source: https://aiwiki.ai/wiki/ddpm
> Updated: 2026-07-11
> Categories: Deep Learning, Generative AI, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Denoising Diffusion Probabilistic Models (DDPM)** are a class of [generative model](/wiki/generative_model) introduced by Jonathan Ho, Ajay Jain, and Pieter Abbeel of UC Berkeley in their June 2020 paper "Denoising Diffusion Probabilistic Models" (arXiv:2006.11239)[^1]. DDPMs learn to produce data samples by reversing a gradual noising process: a fixed forward [Markov chain](/wiki/markov_chain) progressively corrupts a data point with Gaussian noise over many timesteps, and a learned reverse chain, parameterized by a neural network, is trained to subtract that noise step by step until a clean sample is recovered. The work was foundational because it demonstrated, for the first time at scale, that diffusion-based generation could match the image quality of state-of-the-art [generative adversarial networks](/wiki/generative_adversarial_network) (GANs), reporting an unconditional CIFAR-10 [FID](/wiki/frechet_inception_distance) of 3.17 and an Inception Score of 9.46[^1]. The paper's abstract states plainly: "We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics."[^1]

DDPM combined and refined ideas from Sohl-Dickstein et al.'s 2015 nonequilibrium thermodynamics framework[^2] and Song & Ermon's 2019 score-matching with Langevin dynamics[^3] into a simple, stable training recipe: a [U-Net](/wiki/u_net) denoiser trained with a mean-squared-error loss to predict the noise added to each training example. That recipe became the template for nearly every major image, audio, video, and 3D diffusion model that followed, including [DDIM](/wiki/ddim), [classifier-free guidance](/wiki/classifier_free_guidance), [Stable Diffusion](/wiki/stable_diffusion), [DALL-E 2](/wiki/dalle_2), and [Imagen](/wiki/imagen). As of 2026, the DDPM paper remains one of the most influential generative modeling publications of the deep learning era, cited tens of thousands of times[^1][^4].

## Background

### Sohl-Dickstein's 2015 framework

The conceptual origin of diffusion-based generative models is the 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli (arXiv:1503.03585)[^2]. Drawing on ideas from non-equilibrium statistical physics, the authors proposed to "systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process" and then learn a reverse diffusion process that restores structure, yielding a flexible and tractable generative model[^2]. The framework supported sampling, likelihood evaluation, and conditional inference, but the sample quality reported in 2015 was modest compared to contemporaneous GANs and [variational autoencoders](/wiki/variational_autoencoder) (VAEs), and the work did not attract sustained attention from the broader research community for several years.

### GAN dominance and the search for stable alternatives

Between 2014 and 2019, image generation was dominated by [generative adversarial networks](/wiki/gan), following Goodfellow et al.'s 2014 introduction of the GAN framework. Architectures such as DCGAN, Progressive GAN, StyleGAN, and BigGAN produced increasingly photorealistic samples, especially after the latter's class-conditional ImageNet results in 2018. However, GAN training was notoriously unstable: the alternating optimization of generator and discriminator was sensitive to hyperparameters, prone to mode collapse, and difficult to scale reliably. VAEs offered stable training but produced visibly blurrier samples. The field thus had an open need for a likelihood-based generative model that combined GAN-level fidelity with VAE-level training stability.

### Score-based models and Langevin dynamics

In parallel, Yang Song and Stefano Ermon's 2019 paper "Generative Modeling by Estimating Gradients of the Data Distribution" (arXiv:1907.05600) introduced Noise Conditional Score Networks (NCSN)[^3]. Instead of learning a density directly, NCSN learned the *score function* (the gradient of the log probability density) at multiple noise levels, and then used annealed Langevin dynamics to sample by following the score from high noise back to clean data. This score-matching approach independently arrived at many of the same structural ideas that DDPM would crystallize: a noise hierarchy, a denoising network parameterized by noise level, and an iterative sampling procedure. The DDPM and score-matching threads were later unified in a continuous-time stochastic differential equation framework by Song et al. (2021)[^5].

## When was DDPM published and who created it?

The DDPM paper had three authors, all affiliated with UC Berkeley at the time of publication[^1]:

- **Jonathan Ho** was a PhD student at UC Berkeley working with Pieter Abbeel. He later moved to Google Research, where he co-led work on classifier-free guidance, Imagen, video diffusion models, and progressive distillation. Ho's name appears on many of the most influential diffusion papers of the early 2020s.
- **Ajay Jain** was also a UC Berkeley graduate student. He subsequently worked on text-to-3D generation (DreamFusion) and co-founded Genmo, a video diffusion startup.
- **Pieter Abbeel** is a professor at UC Berkeley, co-founder of Covariant, and a prominent figure in deep reinforcement learning and robotics. The DDPM paper sits somewhat outside his primary research stream, reflecting the breadth of generative modeling work emerging from the Berkeley AI Research lab in the late 2010s.

The paper was published at NeurIPS 2020 (the 34th Conference on Neural Information Processing Systems) and the preprint was posted to arXiv on 19 June 2020[^1].

## What is the forward process in DDPM?

The forward (or diffusion) process is a fixed [Markov chain](/wiki/markov_chain) that takes a data sample $$x_0$$ drawn from the real distribution $$q(x_0)$$ and produces a sequence $$x_1, x_2, \ldots, x_T$$ of increasingly noisy versions. At each step, isotropic Gaussian noise is added according to a predefined variance schedule $$\{\beta_1, \beta_2, \ldots, \beta_T\}$$:

$$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t}\, x_{t-1}, \beta_t I)$$[^1]

By convention, $$T = 1000$$ in the original DDPM experiments[^1]. The schedule is chosen so that by step T the data has been almost entirely replaced by standard Gaussian noise; that is, $$q(x_T)$$ approaches $$\mathcal{N}(0, I)$$ regardless of $$x_0$$.

A crucial algebraic property of this Gaussian chain is that $$x_t$$ admits a closed-form marginal in terms of $$x_0$$. Letting $$\alpha_t = 1 - \beta_t$$ and $$\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$$, one can derive:

$$q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\, x_0, (1 - \bar{\alpha}_t) I)$$[^1]

This means a noisy sample at any timestep can be generated in one shot:

$$
x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
$$

That closed form is the foundation of efficient training: rather than simulating a 1000-step chain for every gradient update, the model only needs to sample a random timestep t, draw fresh noise $$\epsilon$$, and compute $$x_t$$ directly[^1]. Because the forward process has no learnable parameters, it acts only as a data augmentation that pairs each clean image with a noisy counterpart at a random noise level.

## How does the reverse process generate samples?

The generative part of the model is the *reverse* Markov chain, which starts from pure Gaussian noise $$x_T \sim \mathcal{N}(0, I)$$ and iteratively denoises it back to a sample $$x_0$$ from (approximately) the data distribution. Each reverse step is a learned Gaussian transition parameterized by a [neural network](/wiki/neural_network) with weights $$\theta$$:

$$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$[^1]

In the original DDPM, the variance was *not* learned. Ho et al. fixed $$\Sigma_\theta(x_t, t) = \sigma_t^2 I$$ to one of two schedule-dependent values (either $$\beta_t$$ or $$\tilde{\beta}_t$$, the posterior variance), and trained the network to predict only the mean $$\mu_\theta$$[^1]. Nichol and Dhariwal (2021) later showed that learning a per-step interpolation between these two bounds improves log-likelihood without harming sample quality[^4].

In principle, $$\mu_\theta$$ could be regressed directly. The key practical insight of DDPM, however, is that the mean has a particularly simple form in terms of the noise that was added during the forward process. Specifically, if the network $$\epsilon_\theta(x_t, t)$$ predicts the noise $$\epsilon$$ that was injected to produce $$x_t$$ from $$x_0$$, then the reverse-step mean is determined analytically by:

$$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right)$$[^1]

Reparameterizing the network as a *noise predictor* rather than a *mean predictor* is the single architectural change that, together with a simplified loss, makes DDPM training stable and effective[^1].

## How is a DDPM trained?

### Variational lower bound

Like a VAE, DDPM is a latent-variable model and can be trained by maximizing a variational lower bound (VLB, sometimes called the ELBO) on the data log-likelihood. The negative VLB decomposes into a sum of KL divergences between the forward posteriors $$q(x_{t-1} \mid x_t, x_0)$$ and the learned reverse transitions $$p_\theta(x_{t-1} \mid x_t)$$, plus a small reconstruction term[^1]:

$$
\begin{aligned}
L_{\text{VLB}} = \mathbb{E}_q \Big[ & D_{\mathrm{KL}}(q(x_T \mid x_0) \parallel p(x_T)) \quad \text{(prior matching)} \\
&+ \sum_{t=2}^{T} D_{\mathrm{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_\theta(x_{t-1} \mid x_t)) \quad \text{(denoising matching)} \\
&- \log p_\theta(x_0 \mid x_1) \Big] \quad \text{(reconstruction)}
\end{aligned}
$$

Each KL term is between Gaussians and has a closed-form expression in terms of the means and variances.[^1] In principle this objective is directly optimizable, and indeed Sohl-Dickstein et al. (2015) used a closely related formulation[^2].

### The simplified L_simple objective

The central practical contribution of Ho et al. (2020) was to show that an *unweighted* mean-squared-error objective on the noise prediction produces dramatically better samples than the proper VLB[^1]:

$$L_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \lVert \epsilon - \epsilon_\theta\left( \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon , t \right) \rVert^2 \right]$$[^1]

In words: sample a clean image x_0, sample a random timestep t uniformly from $$\{1, \ldots, T\}$$, sample standard Gaussian noise $$\epsilon$$, build the noisy image $$x_t$$ in closed form, and train the network to recover $$\epsilon$$ in the $$L^2$$ sense. This is mathematically equivalent to a reweighted form of the VLB, specifically $$L_{\text{VLB}}$$ with each timestep's KL term scaled by a factor that downweights very small t. Ho et al. found that this reweighting "emphasizes more difficult denoising tasks at larger t" and empirically yields lower FID even though it slightly worsens log-likelihood[^1].

The simplified training loop is just five steps:

1. Sample a data point $$x_0$$ from the training set.
2. Sample a timestep $$t \sim \text{Uniform}(\{1, \ldots, T\})$$.
3. Sample noise $$\epsilon \sim \mathcal{N}(0, I)$$.
4. Compute $$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$$.
5. Minimize $$\lVert \epsilon - \epsilon_\theta(x_t, t) \rVert^2$$ with stochastic gradient descent.

There is no adversarial discriminator to balance, no posterior collapse to manage, no second network. The training stability of this recipe, combined with its sample quality, is the principal reason diffusion models displaced GANs as the default for high-fidelity image synthesis between 2020 and 2022.

## What architecture does DDPM use?

DDPM uses a [U-Net](/wiki/u_net) denoiser based on the PixelCNN++ backbone introduced by Salimans et al. (2017) and refined by the Wide ResNet style of Zagoruyko and Komodakis (2016). The U-Net itself originates from Ronneberger et al.'s 2015 work on biomedical image segmentation[^1].

Architectural details specific to DDPM include[^1]:

- **Encoder-decoder structure** with skip connections at matching resolutions, so fine spatial detail can flow directly from early to late layers.
- **Residual blocks** at each resolution, with Group Normalization (Wu and He, 2018) used in place of Batch Normalization to behave well at small batch sizes.
- **Timestep embedding.** The integer timestep t is mapped to a sinusoidal [positional embedding](/wiki/positional_encoding) (as in the [Transformer](/wiki/transformer)), passed through a small MLP, and added into every residual block. This lets a single set of weights handle all 1000 noise levels.
- **Self-attention** layers at the 16x16 feature-map resolution (and in some variants additional resolutions), enabling the model to capture global structure that pure convolutions miss[^1].
- **Multi-scale design.** The DDPM CIFAR-10 model has roughly 35 M parameters; the LSUN models are larger.

The choice of a convolutional U-Net (rather than a [transformer](/wiki/transformer)) was important historically: it tied diffusion models to a well-understood image architecture and made the field accessible to researchers without TPU-scale compute. Transformer-based denoisers, most notably the Diffusion Transformer (DiT) by Peebles and Xie (2023), gained ground only after 2022, once the diffusion paradigm itself was well established.

## How does DDPM sampling work?

Sampling from a trained DDPM follows the *ancestral* (Markov) reverse chain[^1]:

1. Draw $$x_T \sim \mathcal{N}(0, I)$$.
2. For $$t = T, T-1, \ldots, 1$$:
   a. Compute the predicted noise $$\epsilon_\theta(x_t, t)$$.
   b. Compute the reverse-step mean $$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right)$$.
   c. Sample $$z \sim \mathcal{N}(0, I)$$ for $$t > 1$$, otherwise $$z = 0$$.
   d. Set $$x_{t-1} = \mu_\theta(x_t, t) + \sigma_t z$$.
3. Return $$x_0$$.

Because $$T = 1000$$ and each step requires a full forward pass through the denoiser, generating a *single* image costs approximately 1000 network evaluations[^1]. On 2020-era GPUs this translated to tens of seconds per CIFAR-10 sample and minutes per high-resolution LSUN sample, far slower than a one-shot GAN. Reducing this sampling cost became one of the central research directions of the next several years (see *Follow-up improvements* below).

The ancestral sampler is *stochastic*: the noise injection at step (c) ensures that running the same trained model from the same random seed produces a fresh trajectory each time. Replacing that stochasticity with a deterministic update yields DDIM, discussed below.

## Noise schedule

### Linear beta schedule

In the original DDPM, the variance schedule $$\{\beta_t\}$$ is linear in t, increasing from $$\beta_1 = 10^{-4}$$ to $$\beta_T = 0.02$$ with $$T = 1000$$[^1]. These small $$\beta$$ values were chosen so that the reverse-process Gaussian assumption (that $$q(x_{t-1} \mid x_t)$$ is approximately Gaussian) holds tightly. The resulting cumulative product $$\bar{\alpha}_t$$ starts near 1 (almost no noise) at $$t = 0$$ and approaches 0 (almost pure noise) at $$t = T$$.

### Cosine schedule

Nichol and Dhariwal (2021), in "Improved Denoising Diffusion Probabilistic Models" (arXiv:2102.09672), observed that the linear schedule destroys information *too quickly* on lower-resolution images: by roughly the first 20 % of timesteps the signal-to-noise ratio is already very low, leaving many early reverse steps with little to do[^4]. They proposed a *cosine* schedule defined indirectly through:

$$
\bar{\alpha}_t = \left(\frac{f(t)}{f(0)}\right)^2, \quad \text{where } f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)
$$

with $$s = 0.008$$ a small offset to prevent $$\beta_t$$ from being too small near $$t = 0$$. The cosine schedule changes more slowly near both endpoints, giving more uniform information destruction and meaningfully improving FID, especially at 64x64[^4].

### Other schedules

Subsequent work introduced sigmoid schedules (Jabri et al., 2022) and *learned* schedules (Kingma et al., 2021's Variational Diffusion Models), as well as resolution-dependent rescalings. The general pattern is that the noise schedule must be calibrated to the spatial resolution and effective image statistics; what works for CIFAR-10 does not necessarily work for 1024x1024 images.

| Schedule | Defining quantity | Behavior | First used in |
|---|---|---|---|
| Linear | $$\beta$$ linearly from $$10^{-4}$$ to 0.02 | Fast information loss early | DDPM (Ho et al. 2020)[^1] |
| Cosine | $$\bar{\alpha}_t = \cos^2\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)$$ | Uniform information loss | iDDPM (Nichol & Dhariwal 2021)[^4] |
| Sigmoid | S-shaped $$\beta$$ | Smooth midpoint transition | Jabri et al. 2022 |
| Learned | Optimized end-to-end | Adaptive | VDM (Kingma et al. 2021) |

## What results did DDPM report?

### CIFAR-10

The headline result of the DDPM paper was on unconditional CIFAR-10 (32x32 natural images): a Fréchet Inception Distance (FID) of **3.17** and an Inception Score of **9.46**[^1]. At the time of publication this FID was state-of-the-art for unconditional CIFAR-10, better than the best GAN result available, and the Inception Score was competitive with the leading GAN models. Crucially, both numbers were achieved without any adversarial training, without truncation tricks, and without hyperparameter tuning peculiar to each metric.

### LSUN

DDPM was also evaluated on 256x256 images from several LSUN categories, with the paper reporting results on LSUN Bedroom, LSUN Church (also called Church Outdoor), and LSUN Cat[^1]. The paper summarized the outcome as obtaining "sample quality similar to ProgressiveGAN" on 256x256 LSUN[^1]; quality was competitive with ProgressiveGAN and StyleGAN baselines, though not strictly state-of-the-art on every category. The visual fidelity of DDPM LSUN samples, particularly the church outdoor scenes, was an important demonstration that the diffusion framework scaled beyond toy resolutions.

### Likelihoods

Following the simplified L_simple training, DDPM's variational lower bound on test log-likelihood was *worse* than that of explicit likelihood models of the time, even though sample quality was higher[^1]. This sample-quality / likelihood tension was a recurring theme in early diffusion work and partly motivated the hybrid loss introduced by Nichol and Dhariwal (2021)[^4].

## Theoretical connections

### Score matching

There is a tight equivalence between the DDPM noise-prediction objective and *denoising score matching* (Vincent, 2011; Song & Ermon, 2019)[^3]. Given the closed-form forward distribution $$q(x_t \mid x_0)$$, the score of $$q(x_t)$$ at a noisy point $$x_t$$ is:

$$
\nabla_{x_t} \log q(x_t \mid x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t}\, x_0}{1 - \bar{\alpha}_t} = -\frac{\epsilon}{\sqrt{1 - \bar{\alpha}_t}}
$$

Therefore predicting the noise $$\epsilon$$ is equivalent (up to a constant scaling that depends on t) to predicting the score. The DDPM noise network is, in effect, a score model at every noise level; the DDPM ancestral sampler is a discretization of a particular reverse-time SDE that uses the score[^5].

Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole (2021)'s "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) made this unification explicit, deriving DDPM as a discretization of a *Variance Preserving* (VP) SDE and NCSN as a discretization of a *Variance Exploding* (VE) SDE, both governed by a single continuous-time formulation with the same score-matching loss[^5].

### Variational autoencoders

DDPM is also formally a deep hierarchical [variational autoencoder](/wiki/variational_autoencoder) with a fixed, non-learned encoder (the forward Markov chain) and a Gaussian-Markov decoder (the reverse chain)[^1]. This perspective makes the VLB derivation natural and connects DDPM to the broader VAE literature. The key innovation over earlier hierarchical VAEs is that the encoder is hand-designed and noise-only, eliminating the optimization difficulties (such as posterior collapse) that plagued learned hierarchical posteriors.

### Energy-based models and Langevin dynamics

The reverse-time SDE view also clarifies the link to energy-based models: the score is the gradient of an implicit log-density, and DDPM's iterative denoising is a stabilized, annealed analogue of Langevin sampling from an energy-based model.

## What are the limitations of DDPM?

The original DDPM has several well-documented limitations that became the agenda of subsequent diffusion research:

### Why is DDPM sampling slow?

The defining cost of DDPM is its **1000-step sampling chain**: each image requires roughly T = 1000 forward passes through the U-Net[^1]. On 2020-era hardware this made DDPM orders of magnitude slower than GANs at inference time, and it remains the principal disadvantage of diffusion-based generation. Subsequent work attacked this bottleneck through faster solvers (DDIM, DPM-Solver), distillation (progressive distillation, [consistency models](/wiki/consistency_models)), and architectural shortcuts (latent diffusion).

### Pixel-space cost

DDPM operates directly on raw pixels. For a 256x256 RGB image, every U-Net forward pass processes 196,608 input values, and the same is true for every one of the 1000 sampling steps. Scaling DDPM to 1024x1024 or video resolutions is prohibitively expensive without first compressing the data, a problem that Latent Diffusion Models (Rombach et al., 2022) solved by running the diffusion process in the latent space of a pretrained autoencoder[^6].

### Sample quality / log-likelihood mismatch

L_simple sacrifices log-likelihood for sample quality. Models trained with L_simple have worse density estimation than VAEs and PixelRNNs of comparable size, even though their samples look better[^1]. Nichol and Dhariwal (2021) partially closed this gap with a hybrid loss, but the underlying tension between perceptual fidelity and likelihood remains[^4].

### Limited controllability

The original DDPM is fully unconditional. Practical text-to-image generation required two further ingredients: classifier guidance (Dhariwal and Nichol, 2021)[^7] and especially classifier-free guidance (Ho and Salimans, 2021)[^8], the latter of which is now standard in essentially every conditional diffusion model.

### Sensitivity to noise schedule

The linear schedule that works on CIFAR-10 does not work as well on lower-resolution images or on images with very different statistics, and there is no principled choice of schedule from theory alone. Schedule design has remained an active research topic since 2020[^4].

## Follow-up improvements

The DDPM paper opened a research program that has now produced dozens of major follow-ups. The most influential are summarized below.

### Improved DDPM (Nichol and Dhariwal, 2021)

Same authors, same model family. Three changes, the cosine schedule, learned variances, and a hybrid VLB+L_simple loss, improved both FID and log-likelihood, and a strided sampling schedule cut inference cost by roughly an order of magnitude with negligible quality loss[^4].

### How does DDPM differ from DDIM?

Denoising Diffusion Implicit Models (arXiv:2010.02502), by Song, Meng, and Ermon (2020), defined a family of *non-Markovian* reverse processes that share DDPM's marginals $$q(x_t \mid x_0)$$ and therefore can be sampled from any DDPM-trained model without retraining[^9]. With the stochasticity parameter set to zero, DDIM yields a deterministic ODE-like sampler that produces high-quality samples in 50-100 steps, roughly 10x-20x fewer than DDPM[^9]. DDIM also gives the model a meaningful latent space: the same initial noise vector always maps to the same image, enabling interpolation and inversion.

| Property | DDPM | DDIM ($$\sigma = 0$$) |
|---|---|---|
| Sampling process | Stochastic (SDE) | Deterministic (ODE) |
| Typical steps | ~1000 | 50-100 (sometimes 10-20) |
| Same noise -> same image | No | Yes |
| Latent space interpolation | No | Yes |
| Retraining required | No | None |

### Score SDE framework (Song et al., 2021)

The unification of DDPM and NCSN under a continuous-time SDE, with a corresponding *probability-flow ODE* that produces deterministic, exact-likelihood sampling and connects diffusion models to the wider literature on normalizing flows[^5].

### Diffusion Models Beat GANs (Dhariwal and Nichol, 2021)

This paper (arXiv:2105.05233) introduced *classifier guidance*, which uses the gradient of a separately trained noise-aware classifier to push samples toward a desired class, and used it to set new ImageNet FID records, decisively beating BigGAN on class-conditional generation[^7]. The title "Diffusion Models Beat GANs" became a slogan for the broader shift.

### Classifier-free guidance (Ho and Salimans, 2021)

Classifier-Free Diffusion Guidance (arXiv:2207.12598) eliminated the separate classifier by training the denoiser jointly as both conditional and unconditional model (with the conditioning randomly dropped during training)[^8]. At sample time, the guided noise prediction is extrapolated away from the unconditional prediction toward the conditional one:

$$\epsilon_{\text{guided}} = \epsilon_\theta(x_t, t) + w \left(\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t)\right)$$[^8]

where w is the guidance scale. CFG has become the universal mechanism by which text-to-image models such as [Stable Diffusion](/wiki/stable_diffusion), [DALL-E 2](/wiki/dalle_2), and [Imagen](/wiki/imagen) obtain strong prompt adherence[^6][^8].

### Latent Diffusion / Stable Diffusion (Rombach et al., 2022)

"High-Resolution Image Synthesis with Latent Diffusion Models" (arXiv:2112.10752) ran the DDPM process not on pixels but in the latent space of a pretrained autoencoder, typically compressing 512x512 images down to 64x64 latents before any diffusion[^6]. This reduced the compute cost of every sampling step by 1-2 orders of magnitude, enabling high-resolution text-to-image generation on consumer GPUs. The publicly released **Stable Diffusion** model (Stability AI, August 2022) is the most widely used instantiation of this Latent Diffusion framework and is a direct architectural descendant of DDPM[^6].

### Consistency models and distillation

Progressive distillation (Salimans and Ho, 2022) and [consistency models](/wiki/consistency_models) (Song, Dhariwal, Chen, Sutskever, 2023) compressed multi-step diffusion samplers into models that produce high-quality samples in 1-4 steps[^10]. Consistency distillation works by training a student model to map *any* point on a diffusion trajectory directly to its endpoint, eliminating the iterative chain at inference time.

### Flow matching (Lipman et al., 2023)

Flow Matching for Generative Modeling (arXiv:2210.02747) generalized the diffusion idea to *arbitrary* transport flows: instead of fixing a noise-adding forward process, the model learns a velocity field that transports samples from a source distribution to the data distribution along any chosen interpolant[^11]. For the Gaussian-source special case, flow matching is mathematically equivalent to a Gaussian diffusion model with appropriate parameterization[^12]. Stable Diffusion 3 (Esser et al., 2024) and several Meta video models are trained with flow matching.

## Is DDPM still used in 2026?

Within roughly eighteen months of the DDPM preprint, diffusion models had displaced GANs as the default backbone for high-fidelity image generation. By mid-2022, three of the most-discussed AI systems of the year were all diffusion models: DALL-E 2 (OpenAI, April 2022), Imagen (Google, May 2022), and Stable Diffusion (Stability AI, August 2022), trained on the DDPM recipe (with classifier-free guidance and, for Stable Diffusion, latent-space efficiency)[^6][^8]. The DDPM paper itself, by 2026, is one of the most cited generative modeling publications of the deep learning era, with citations crossing into the tens of thousands[^1].

The influence has propagated well beyond static images:

- **Audio.** DiffWave (Kong et al., 2021) and Grad-TTS apply DDPM-style training to raw audio and mel-spectrograms; modern text-to-speech and music systems such as MusicLM and Stable Audio are diffusion-based.
- **Video.** Imagen Video, Make-A-Video, [Sora](/wiki/sora), and Veo are 3D U-Net or DiT diffusion models that extend DDPM to space-time.
- **3D.** DreamFusion (Poole et al., 2022) uses a pretrained 2D diffusion model and *score distillation sampling* (SDS) to optimize 3D radiance fields without 3D training data.
- **Molecular and protein design.** Models such as RFdiffusion (Watson et al., 2023) apply DDPM-style noising and denoising to protein backbones, producing novel functional proteins.
- **Robotics.** Diffusion Policy (Chi et al., 2023) treats action sequences as the data and uses a DDPM-style conditional denoiser to generate robot control policies.

Methodologically, DDPM cemented a broader shift: generative modeling no longer required adversarial training, and likelihood-based or score-based objectives could deliver state-of-the-art perceptual quality with much greater training stability. That recipe (a U-Net or transformer denoiser, an epsilon-prediction MSE loss, classifier-free guidance, and a latent-space backbone) is, in 2026, the most copied architecture in generative AI.

## See also

- [Diffusion model](/wiki/diffusion_model): the general class of generative model that DDPM defines.
- [DDIM](/wiki/ddim): faster deterministic sampler for DDPM-trained models.
- [Stable Diffusion](/wiki/stable_diffusion): latent-diffusion text-to-image system built on DDPM concepts.
- [DALL-E 2](/wiki/dalle_2): OpenAI's diffusion-based text-to-image model.
- [Imagen](/wiki/imagen): Google's diffusion-based text-to-image model.
- [Classifier-free guidance](/wiki/classifier_free_guidance): guidance technique used in nearly all modern conditional diffusion models.
- [Score matching](/wiki/score_matching): alternative formulation of diffusion-based generation.
- [U-Net](/wiki/u_net): convolutional architecture used as the original DDPM denoiser.
- [Variational autoencoder](/wiki/variational_autoencoder): generative-model class that DDPM generalizes.
- [Generative adversarial network](/wiki/generative_adversarial_network): prior dominant approach to image generation that DDPM displaced.
- [Markov chain](/wiki/markov_chain): formal structure of both forward and reverse DDPM processes.
- [Consistency models](/wiki/consistency_models): few-step distillation of diffusion models.
- [Flow matching](/wiki/flow_matching): generalization of diffusion to arbitrary transport flows.

## References

[^1]: Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020. https://arxiv.org/abs/2006.11239

[^2]: Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML 2015. https://arxiv.org/abs/1503.03585

[^3]: Song, Y., & Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS 2019. https://arxiv.org/abs/1907.05600

[^4]: Nichol, A., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML 2021. https://arxiv.org/abs/2102.09672

[^5]: Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021. https://arxiv.org/abs/2011.13456

[^6]: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. https://arxiv.org/abs/2112.10752

[^7]: Dhariwal, P., & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. https://arxiv.org/abs/2105.05233

[^8]: Ho, J., & Salimans, T. (2022). "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. https://arxiv.org/abs/2207.12598

[^9]: Song, J., Meng, C., & Ermon, S. (2020). "Denoising Diffusion Implicit Models." ICLR 2021. https://arxiv.org/abs/2010.02502

[^10]: Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Consistency Models." ICML 2023. https://arxiv.org/abs/2303.01469

[^11]: Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). "Flow Matching for Generative Modeling." ICLR 2023. https://arxiv.org/abs/2210.02747

[^12]: Lilian Weng (2021). "What are Diffusion Models?" https://lilianweng.github.io/posts/2021-07-11-diffusion-models/