DDPM
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 4,502 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 4,502 words
Add missing citations, update stale details, or suggest a clearer explanation.
Denoising Diffusion Probabilistic Models (DDPM) are a class of generative model introduced by Jonathan Ho, Ajay Jain, and Pieter Abbeel of UC Berkeley in their June 2020 paper "Denoising Diffusion Probabilistic Models" (arXiv:2006.11239)[^1]. DDPMs learn to produce data samples by reversing a gradual noising process: a fixed forward Markov chain progressively corrupts a data point with Gaussian noise over many timesteps, and a learned reverse chain — parameterized by a neural network — is trained to subtract that noise step by step until a clean sample is recovered. The work was foundational because it demonstrated, for the first time at scale, that diffusion-based generation could match the image quality of state-of-the-art generative adversarial networks (GANs), reporting an unconditional CIFAR-10 FID of 3.17 and an Inception Score of 9.46[^1].
DDPM combined and refined ideas from Sohl-Dickstein et al.'s 2015 nonequilibrium thermodynamics framework[^2] and Song & Ermon's 2019 score-matching with Langevin dynamics[^3] into a simple, stable training recipe: a U-Net denoiser trained with a mean-squared-error loss to predict the noise added to each training example. That recipe became the template for nearly every major image, audio, video, and 3D diffusion model that followed, including DDIM, classifier-free guidance, Stable Diffusion, DALL-E 2, and Imagen. As of 2026, the DDPM paper remains one of the most influential generative modeling publications of the deep learning era[^1][^4].
The conceptual origin of diffusion-based generative models is the 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli (arXiv:1503.03585)[^2]. Drawing on ideas from non-equilibrium statistical physics, the authors proposed to "systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process" and then learn a reverse diffusion process that restores structure, yielding a flexible and tractable generative model[^2]. The framework supported sampling, likelihood evaluation, and conditional inference, but the sample quality reported in 2015 was modest compared to contemporaneous GANs and variational autoencoders (VAEs), and the work did not attract sustained attention from the broader research community for several years.
Between 2014 and 2019, image generation was dominated by generative adversarial networks, following Goodfellow et al.'s 2014 introduction of the GAN framework. Architectures such as DCGAN, Progressive GAN, StyleGAN, and BigGAN produced increasingly photorealistic samples, especially after the latter's class-conditional ImageNet results in 2018. However, GAN training was notoriously unstable: the alternating optimization of generator and discriminator was sensitive to hyperparameters, prone to mode collapse, and difficult to scale reliably. VAEs offered stable training but produced visibly blurrier samples. The field thus had an open need for a likelihood-based generative model that combined GAN-level fidelity with VAE-level training stability.
In parallel, Yang Song and Stefano Ermon's 2019 paper "Generative Modeling by Estimating Gradients of the Data Distribution" (arXiv:1907.05600) introduced Noise Conditional Score Networks (NCSN)[^3]. Instead of learning a density directly, NCSN learned the score function — the gradient of the log probability density — at multiple noise levels, and then used annealed Langevin dynamics to sample by following the score from high noise back to clean data. This score-matching approach independently arrived at many of the same structural ideas that DDPM would crystallize: a noise hierarchy, a denoising network parameterized by noise level, and an iterative sampling procedure. The DDPM and score-matching threads were later unified in a continuous-time stochastic differential equation framework by Song et al. (2021)[^5].
The DDPM paper had three authors, all affiliated with UC Berkeley at the time of publication[^1]:
The paper was published at NeurIPS 2020 (the 34th Conference on Neural Information Processing Systems) and the preprint was posted to arXiv on 19 June 2020[^1].
The forward — or diffusion — process is a fixed Markov chain that takes a data sample x_0 drawn from the real distribution q(x_0) and produces a sequence x_1, x_2, ..., x_T of increasingly noisy versions. At each step, isotropic Gaussian noise is added according to a predefined variance schedule {β_1, β_2, ..., β_T}:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 − β_t) · x_{t-1}, β_t · I)[^1]
By convention, T = 1000 in the original DDPM experiments[^1]. The schedule is chosen so that by step T the data has been almost entirely replaced by standard Gaussian noise — that is, q(x_T) approaches N(0, I) regardless of x_0.
A crucial algebraic property of this Gaussian chain is that x_t admits a closed-form marginal in terms of x_0. Letting α_t = 1 − β_t and ᾱ_t = ∏_{s=1}^{t} α_s, one can derive:
q(x_t | x_0) = N(x_t; sqrt(ᾱ_t) · x_0, (1 − ᾱ_t) · I)[^1]
This means a noisy sample at any timestep can be generated in one shot:
x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε, ε ~ N(0, I)
That closed form is the foundation of efficient training: rather than simulating a 1000-step chain for every gradient update, the model only needs to sample a random timestep t, draw fresh noise ε, and compute x_t directly[^1]. Because the forward process has no learnable parameters, it acts only as a data augmentation that pairs each clean image with a noisy counterpart at a random noise level.
The generative part of the model is the reverse Markov chain, which starts from pure Gaussian noise x_T ~ N(0, I) and iteratively denoises it back to a sample x_0 from (approximately) the data distribution. Each reverse step is a learned Gaussian transition parameterized by a neural network with weights θ:
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))[^1]
In the original DDPM, the variance was not learned. Ho et al. fixed Σ_θ(x_t, t) = σ_t² · I to one of two schedule-dependent values (either β_t or β̃_t, the posterior variance), and trained the network to predict only the mean μ_θ[^1]. Nichol and Dhariwal (2021) later showed that learning a per-step interpolation between these two bounds improves log-likelihood without harming sample quality[^4].
In principle, μ_θ could be regressed directly. The key practical insight of DDPM, however, is that the mean has a particularly simple form in terms of the noise that was added during the forward process. Specifically, if the network ε_θ(x_t, t) predicts the noise ε that was injected to produce x_t from x_0, then the reverse-step mean is determined analytically by:
μ_θ(x_t, t) = (1 / sqrt(α_t)) · (x_t − (β_t / sqrt(1 − ᾱ_t)) · ε_θ(x_t, t))[^1]
Reparameterizing the network as a noise predictor rather than a mean predictor is the single architectural change that — together with a simplified loss — makes DDPM training stable and effective[^1].
Like a VAE, DDPM is a latent-variable model and can be trained by maximizing a variational lower bound (VLB, sometimes called the ELBO) on the data log-likelihood. The negative VLB decomposes into a sum of KL divergences between the forward posteriors q(x_{t-1} | x_t, x_0) and the learned reverse transitions p_θ(x_{t-1} | x_t), plus a small reconstruction term[^1]:
L_VLB = E_q [ D_KL(q(x_T | x_0) || p(x_T)) (prior matching) + Σ_{t=2}^{T} D_KL(q(x_{t-1} | x_t, x_0) || p_θ(x_{t-1} | x_t)) (denoising matching) − log p_θ(x_0 | x_1) ] (reconstruction)[^1]
Each KL term is between Gaussians and has a closed-form expression in terms of the means and variances. In principle this objective is directly optimizable, and indeed Sohl-Dickstein et al. (2015) used a closely related formulation[^2].
The central practical contribution of Ho et al. (2020) was to show that an unweighted mean-squared-error objective on the noise prediction produces dramatically better samples than the proper VLB[^1]:
L_simple(θ) = E_{t, x_0, ε} [ || ε − ε_θ( sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε , t ) ||² ][^1]
In words: sample a clean image x_0, sample a random timestep t uniformly from {1, ..., T}, sample standard Gaussian noise ε, build the noisy image x_t in closed form, and train the network to recover ε in the L² sense. This is mathematically equivalent to a reweighted form of the VLB — specifically, L_VLB with each timestep's KL term scaled by a factor that downweights very small t. Ho et al. found that this reweighting "emphasizes more difficult denoising tasks at larger t" and empirically yields lower FID even though it slightly worsens log-likelihood[^1].
The simplified training loop is just five steps:
There is no adversarial discriminator to balance, no posterior collapse to manage, no second network. The training stability of this recipe — combined with its sample quality — is the principal reason diffusion models displaced GANs as the default for high-fidelity image synthesis between 2020 and 2022.
DDPM uses a U-Net denoiser based on the PixelCNN++ backbone introduced by Salimans et al. (2017) and refined by the Wide ResNet style of Zagoruyko and Komodakis (2016). The U-Net itself originates from Ronneberger et al.'s 2015 work on biomedical image segmentation[^1].
Architectural details specific to DDPM include[^1]:
The choice of a convolutional U-Net (rather than a transformer) was important historically: it tied diffusion models to a well-understood image architecture and made the field accessible to researchers without TPU-scale compute. Transformer-based denoisers — most notably the Diffusion Transformer (DiT) by Peebles and Xie (2023) — gained ground only after 2022, once the diffusion paradigm itself was well established.
Sampling from a trained DDPM follows the ancestral (Markov) reverse chain[^1]:
Because T = 1000 and each step requires a full forward pass through the denoiser, generating a single image costs approximately 1000 network evaluations[^1]. On 2020-era GPUs this translated to tens of seconds per CIFAR-10 sample and minutes per high-resolution LSUN sample — far slower than a one-shot GAN. Reducing this sampling cost became one of the central research directions of the next several years (see Follow-up improvements below).
The ancestral sampler is stochastic: the noise injection at step (c) ensures that running the same trained model from the same random seed produces a fresh trajectory each time. Replacing that stochasticity with a deterministic update yields DDIM, discussed below.
In the original DDPM, the variance schedule {β_t} is linear in t, increasing from β_1 = 10⁻⁴ to β_T = 0.02 with T = 1000[^1]. These small β values were chosen so that the reverse-process Gaussian assumption — that q(x_{t-1} | x_t) is approximately Gaussian — holds tightly. The resulting cumulative product ᾱ_t starts near 1 (almost no noise) at t = 0 and approaches 0 (almost pure noise) at t = T.
Nichol and Dhariwal (2021), in "Improved Denoising Diffusion Probabilistic Models" (arXiv:2102.09672), observed that the linear schedule destroys information too quickly on lower-resolution images: by roughly the first 20 % of timesteps the signal-to-noise ratio is already very low, leaving many early reverse steps with little to do[^4]. They proposed a cosine schedule defined indirectly through:
ᾱ_t = (f(t) / f(0))² , where f(t) = cos( ((t/T + s) / (1 + s)) · π/2 )
with s = 0.008 a small offset to prevent β_t from being too small near t = 0. The cosine schedule changes more slowly near both endpoints, giving more uniform information destruction and meaningfully improving FID, especially at 64×64[^4].
Subsequent work introduced sigmoid schedules (Jabri et al., 2022) and learned schedules (Kingma et al., 2021's Variational Diffusion Models), as well as resolution-dependent rescalings. The general pattern is that the noise schedule must be calibrated to the spatial resolution and effective image statistics; what works for CIFAR-10 does not necessarily work for 1024×1024 images.
| Schedule | Defining quantity | Behavior | First used in |
|---|---|---|---|
| Linear | β linearly from 10⁻⁴ to 0.02 | Fast information loss early | DDPM (Ho et al. 2020)[^1] |
| Cosine | ᾱ_t = cos²((t/T + s)/(1+s) · π/2) | Uniform information loss | iDDPM (Nichol & Dhariwal 2021)[^4] |
| Sigmoid | S-shaped β | Smooth midpoint transition | Jabri et al. 2022 |
| Learned | Optimized end-to-end | Adaptive | VDM (Kingma et al. 2021) |
The headline result of the DDPM paper was on unconditional CIFAR-10 (32×32 natural images): a Fréchet Inception Distance (FID) of 3.17 and an Inception Score of 9.46[^1]. At the time of publication this FID was state-of-the-art for unconditional CIFAR-10 — better than the best GAN result available — and the Inception Score was competitive with the leading GAN models. Crucially, both numbers were achieved without any adversarial training, without truncation tricks, and without hyperparameter tuning peculiar to each metric.
DDPM was also evaluated on 256×256 images from several LSUN categories, with the paper reporting results on LSUN Bedroom, LSUN Church (also called Church Outdoor), and LSUN Cat[^1]. Sample quality was competitive with ProgressiveGAN and StyleGAN baselines, though not strictly state-of-the-art on every category. The visual fidelity of DDPM LSUN samples — particularly the church outdoor scenes — was an important demonstration that the diffusion framework scaled beyond toy resolutions.
Following the simplified L_simple training, DDPM's variational lower bound on test log-likelihood was worse than that of explicit likelihood models of the time, even though sample quality was higher[^1]. This sample-quality / likelihood tension was a recurring theme in early diffusion work and partly motivated the hybrid loss introduced by Nichol and Dhariwal (2021)[^4].
There is a tight equivalence between the DDPM noise-prediction objective and denoising score matching (Vincent, 2011; Song & Ermon, 2019)[^3]. Given the closed-form forward distribution q(x_t | x_0), the score of q(x_t) at a noisy point x_t is:
∇_{x_t} log q(x_t | x_0) = − (x_t − sqrt(ᾱ_t) · x_0) / (1 − ᾱ_t) = − ε / sqrt(1 − ᾱ_t)
Therefore predicting the noise ε is equivalent (up to a constant scaling that depends on t) to predicting the score. The DDPM noise network is, in effect, a score model at every noise level — and the DDPM ancestral sampler is a discretization of a particular reverse-time SDE that uses the score[^5].
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole (2021)'s "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) made this unification explicit, deriving DDPM as a discretization of a Variance Preserving (VP) SDE and NCSN as a discretization of a Variance Exploding (VE) SDE, both governed by a single continuous-time formulation with the same score-matching loss[^5].
DDPM is also formally a deep hierarchical variational autoencoder with a fixed, non-learned encoder (the forward Markov chain) and a Gaussian-Markov decoder (the reverse chain)[^1]. This perspective makes the VLB derivation natural and connects DDPM to the broader VAE literature. The key innovation over earlier hierarchical VAEs is that the encoder is hand-designed and noise-only, eliminating the optimization difficulties (such as posterior collapse) that plagued learned hierarchical posteriors.
The reverse-time SDE view also clarifies the link to energy-based models: the score is the gradient of an implicit log-density, and DDPM's iterative denoising is a stabilized, annealed analogue of Langevin sampling from an energy-based model.
The original DDPM has several well-documented limitations that became the agenda of subsequent diffusion research:
The defining cost of DDPM is its 1000-step sampling chain: each image requires roughly T = 1000 forward passes through the U-Net[^1]. On 2020-era hardware this made DDPM orders of magnitude slower than GANs at inference time, and it remains the principal disadvantage of diffusion-based generation. Subsequent work attacked this bottleneck through faster solvers (DDIM, DPM-Solver), distillation (progressive distillation, consistency models), and architectural shortcuts (latent diffusion).
DDPM operates directly on raw pixels. For a 256×256 RGB image, every U-Net forward pass processes 196,608 input values, and the same is true for every one of the 1000 sampling steps. Scaling DDPM to 1024×1024 or video resolutions is prohibitively expensive without first compressing the data — a problem that Latent Diffusion Models (Rombach et al., 2022) solved by running the diffusion process in the latent space of a pretrained autoencoder[^6].
L_simple sacrifices log-likelihood for sample quality. Models trained with L_simple have worse density estimation than VAEs and PixelRNNs of comparable size, even though their samples look better[^1]. Nichol and Dhariwal (2021) partially closed this gap with a hybrid loss, but the underlying tension between perceptual fidelity and likelihood remains[^4].
The original DDPM is fully unconditional. Practical text-to-image generation required two further ingredients: classifier guidance (Dhariwal and Nichol, 2021)[^7] and especially classifier-free guidance (Ho and Salimans, 2021)[^8], the latter of which is now standard in essentially every conditional diffusion model.
The linear schedule that works on CIFAR-10 does not work as well on lower-resolution images or on images with very different statistics, and there is no principled choice of schedule from theory alone. Schedule design has remained an active research topic since 2020[^4].
The DDPM paper opened a research program that has now produced dozens of major follow-ups. The most influential are summarized below.
Same authors, same model family. Three changes — the cosine schedule, learned variances, and a hybrid VLB+L_simple loss — improved both FID and log-likelihood, and a strided sampling schedule cut inference cost by roughly an order of magnitude with negligible quality loss[^4].
Denoising Diffusion Implicit Models (arXiv:2010.02502) defined a family of non-Markovian reverse processes that share DDPM's marginals q(x_t | x_0) and therefore can be sampled from any DDPM-trained model without retraining[^9]. With the stochasticity parameter set to zero, DDIM yields a deterministic ODE-like sampler that produces high-quality samples in 50–100 steps — roughly 10×–20× fewer than DDPM[^9]. DDIM also gives the model a meaningful latent space: the same initial noise vector always maps to the same image, enabling interpolation and inversion.
| Property | DDPM | DDIM (σ = 0) |
|---|---|---|
| Sampling process | Stochastic (SDE) | Deterministic (ODE) |
| Typical steps | ~1000 | 50–100 (sometimes 10–20) |
| Same noise → same image | No | Yes |
| Latent space interpolation | No | Yes |
| Retraining required | — | None |
The unification of DDPM and NCSN under a continuous-time SDE, with a corresponding probability-flow ODE that produces deterministic, exact-likelihood sampling and connects diffusion models to the wider literature on normalizing flows[^5].
This paper (arXiv:2105.05233) introduced classifier guidance — using the gradient of a separately trained noise-aware classifier to push samples toward a desired class — and used it to set new ImageNet FID records, decisively beating BigGAN on class-conditional generation[^7]. The title "Diffusion Models Beat GANs" became a slogan for the broader shift.
Classifier-Free Diffusion Guidance (arXiv:2207.12598) eliminated the separate classifier by training the denoiser jointly as both conditional and unconditional model (with the conditioning randomly dropped during training)[^8]. At sample time, the guided noise prediction is extrapolated away from the unconditional prediction toward the conditional one:
ε_guided = ε_θ(x_t, t) + w · (ε_θ(x_t, t, c) − ε_θ(x_t, t))[^8]
where w is the guidance scale. CFG has become the universal mechanism by which text-to-image models such as Stable Diffusion, DALL-E 2, and Imagen obtain strong prompt adherence[^6][^8].
"High-Resolution Image Synthesis with Latent Diffusion Models" (arXiv:2112.10752) ran the DDPM process not on pixels but in the latent space of a pretrained autoencoder, typically compressing 512×512 images down to 64×64 latents before any diffusion[^6]. This reduced the compute cost of every sampling step by 1–2 orders of magnitude, enabling high-resolution text-to-image generation on consumer GPUs. The publicly released Stable Diffusion model (Stability AI, August 2022) is the most widely used instantiation of this Latent Diffusion framework and is a direct architectural descendant of DDPM[^6].
Progressive distillation (Salimans and Ho, 2022) and consistency models (Song, Dhariwal, Chen, Sutskever, 2023) compressed multi-step diffusion samplers into models that produce high-quality samples in 1–4 steps[^10]. Consistency distillation works by training a student model to map any point on a diffusion trajectory directly to its endpoint, eliminating the iterative chain at inference time.
Flow Matching for Generative Modeling (arXiv:2210.02747) generalized the diffusion idea to arbitrary transport flows: instead of fixing a noise-adding forward process, the model learns a velocity field that transports samples from a source distribution to the data distribution along any chosen interpolant[^11]. For the Gaussian-source special case, flow matching is mathematically equivalent to a Gaussian diffusion model with appropriate parameterization[^12]. Stable Diffusion 3 (Esser et al., 2024) and several Meta video models are trained with flow matching.
Within roughly eighteen months of the DDPM preprint, diffusion models had displaced GANs as the default backbone for high-fidelity image generation. By mid-2022, three of the most-discussed AI systems of the year — DALL-E 2 (OpenAI, April 2022), Imagen (Google, May 2022), and Stable Diffusion (Stability AI, August 2022) — were all diffusion models trained on the DDPM recipe (with classifier-free guidance and, for Stable Diffusion, latent-space efficiency)[^6][^8]. The DDPM paper itself, by 2026, is one of the most cited generative modeling publications of the deep learning era, with citations crossing into the tens of thousands[^1].
The influence has propagated well beyond static images:
Methodologically, DDPM cemented a broader shift: generative modeling no longer required adversarial training, and likelihood-based or score-based objectives could deliver state-of-the-art perceptual quality with much greater training stability. That recipe — a U-Net or transformer denoiser, an ε-prediction MSE loss, classifier-free guidance, and a latent-space backbone — is, in 2026, the most copied architecture in generative AI.