DDPM

Denoising Diffusion Probabilistic Models (DDPM) are a class of generative model introduced by Jonathan Ho, Ajay Jain, and Pieter Abbeel of UC Berkeley in their June 2020 paper "Denoising Diffusion Probabilistic Models" (arXiv:2006.11239)[^1]. DDPMs learn to produce data samples by reversing a gradual noising process: a fixed forward Markov chain progressively corrupts a data point with Gaussian noise over many timesteps, and a learned reverse chain — parameterized by a neural network — is trained to subtract that noise step by step until a clean sample is recovered. The work was foundational because it demonstrated, for the first time at scale, that diffusion-based generation could match the image quality of state-of-the-art generative adversarial networks (GANs), reporting an unconditional CIFAR-10 FID of 3.17 and an Inception Score of 9.46[^1].

DDPM combined and refined ideas from Sohl-Dickstein et al.'s 2015 nonequilibrium thermodynamics framework[^2] and Song & Ermon's 2019 score-matching with Langevin dynamics[^3] into a simple, stable training recipe: a U-Net denoiser trained with a mean-squared-error loss to predict the noise added to each training example. That recipe became the template for nearly every major image, audio, video, and 3D diffusion model that followed, including DDIM, classifier-free guidance, Stable Diffusion, DALL-E 2, and Imagen. As of 2026, the DDPM paper remains one of the most influential generative modeling publications of the deep learning era[^1][^4].

Background

Sohl-Dickstein's 2015 framework

The conceptual origin of diffusion-based generative models is the 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" by Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli (arXiv:1503.03585)[^2]. Drawing on ideas from non-equilibrium statistical physics, the authors proposed to "systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process" and then learn a reverse diffusion process that restores structure, yielding a flexible and tractable generative model[^2]. The framework supported sampling, likelihood evaluation, and conditional inference, but the sample quality reported in 2015 was modest compared to contemporaneous GANs and variational autoencoders (VAEs), and the work did not attract sustained attention from the broader research community for several years.

GAN dominance and the search for stable alternatives

Between 2014 and 2019, image generation was dominated by generative adversarial networks, following Goodfellow et al.'s 2014 introduction of the GAN framework. Architectures such as DCGAN, Progressive GAN, StyleGAN, and BigGAN produced increasingly photorealistic samples, especially after the latter's class-conditional ImageNet results in 2018. However, GAN training was notoriously unstable: the alternating optimization of generator and discriminator was sensitive to hyperparameters, prone to mode collapse, and difficult to scale reliably. VAEs offered stable training but produced visibly blurrier samples. The field thus had an open need for a likelihood-based generative model that combined GAN-level fidelity with VAE-level training stability.

Score-based models and Langevin dynamics

In parallel, Yang Song and Stefano Ermon's 2019 paper "Generative Modeling by Estimating Gradients of the Data Distribution" (arXiv:1907.05600) introduced Noise Conditional Score Networks (NCSN)[^3]. Instead of learning a density directly, NCSN learned the score function — the gradient of the log probability density — at multiple noise levels, and then used annealed Langevin dynamics to sample by following the score from high noise back to clean data. This score-matching approach independently arrived at many of the same structural ideas that DDPM would crystallize: a noise hierarchy, a denoising network parameterized by noise level, and an iterative sampling procedure. The DDPM and score-matching threads were later unified in a continuous-time stochastic differential equation framework by Song et al. (2021)[^5].

Authors

The DDPM paper had three authors, all affiliated with UC Berkeley at the time of publication[^1]:

Jonathan Ho was a PhD student at UC Berkeley working with Pieter Abbeel. He later moved to Google Research, where he co-led work on classifier-free guidance, Imagen, video diffusion models, and progressive distillation. Ho's name appears on many of the most influential diffusion papers of the early 2020s.
Ajay Jain was also a UC Berkeley graduate student. He subsequently worked on text-to-3D generation (DreamFusion) and co-founded Genmo, a video diffusion startup.
Pieter Abbeel is a professor at UC Berkeley, co-founder of Covariant, and a prominent figure in deep reinforcement learning and robotics. The DDPM paper sits somewhat outside his primary research stream, reflecting the breadth of generative modeling work emerging from the Berkeley AI Research lab in the late 2010s.

The paper was published at NeurIPS 2020 (the 34th Conference on Neural Information Processing Systems) and the preprint was posted to arXiv on 19 June 2020[^1].

Forward process

The forward — or diffusion — process is a fixed Markov chain that takes a data sample x_0 drawn from the real distribution q(x_0) and produces a sequence x_1, x_2, ..., x_T of increasingly noisy versions. At each step, isotropic Gaussian noise is added according to a predefined variance schedule {β_1, β_2, ..., β_T}:

q(x_t | x_{t-1}) = N(x_t; sqrt(1 − β_t) · x_{t-1}, β_t · I)[^1]

By convention, T = 1000 in the original DDPM experiments[^1]. The schedule is chosen so that by step T the data has been almost entirely replaced by standard Gaussian noise — that is, q(x_T) approaches N(0, I) regardless of x_0.

A crucial algebraic property of this Gaussian chain is that x_t admits a closed-form marginal in terms of x_0. Letting α_t = 1 − β_t and ᾱ_t = ∏_{s=1}^{t} α_s, one can derive:

q(x_t | x_0) = N(x_t; sqrt(ᾱ_t) · x_0, (1 − ᾱ_t) · I)[^1]

This means a noisy sample at any timestep can be generated in one shot:

x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε, ε ~ N(0, I)

That closed form is the foundation of efficient training: rather than simulating a 1000-step chain for every gradient update, the model only needs to sample a random timestep t, draw fresh noise ε, and compute x_t directly[^1]. Because the forward process has no learnable parameters, it acts only as a data augmentation that pairs each clean image with a noisy counterpart at a random noise level.

Reverse process

The generative part of the model is the reverse Markov chain, which starts from pure Gaussian noise x_T ~ N(0, I) and iteratively denoises it back to a sample x_0 from (approximately) the data distribution. Each reverse step is a learned Gaussian transition parameterized by a neural network with weights θ:

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))[^1]

In the original DDPM, the variance was not learned. Ho et al. fixed Σ_θ(x_t, t) = σ_t² · I to one of two schedule-dependent values (either β_t or β̃_t, the posterior variance), and trained the network to predict only the mean μ_θ[^1]. Nichol and Dhariwal (2021) later showed that learning a per-step interpolation between these two bounds improves log-likelihood without harming sample quality[^4].

In principle, μ_θ could be regressed directly. The key practical insight of DDPM, however, is that the mean has a particularly simple form in terms of the noise that was added during the forward process. Specifically, if the network ε_θ(x_t, t) predicts the noise ε that was injected to produce x_t from x_0, then the reverse-step mean is determined analytically by:

μ_θ(x_t, t) = (1 / sqrt(α_t)) · (x_t − (β_t / sqrt(1 − ᾱ_t)) · ε_θ(x_t, t))[^1]

Reparameterizing the network as a noise predictor rather than a mean predictor is the single architectural change that — together with a simplified loss — makes DDPM training stable and effective[^1].

Training objective

Variational lower bound

Like a VAE, DDPM is a latent-variable model and can be trained by maximizing a variational lower bound (VLB, sometimes called the ELBO) on the data log-likelihood. The negative VLB decomposes into a sum of KL divergences between the forward posteriors q(x_{t-1} | x_t, x_0) and the learned reverse transitions p_θ(x_{t-1} | x_t), plus a small reconstruction term[^1]:

L_VLB = E_q [ D_KL(q(x_T | x_0) || p(x_T)) (prior matching) + Σ_{t=2}^{T} D_KL(q(x_{t-1} | x_t, x_0) || p_θ(x_{t-1} | x_t)) (denoising matching) − log p_θ(x_0 | x_1) ] (reconstruction)[^1]

Each KL term is between Gaussians and has a closed-form expression in terms of the means and variances. In principle this objective is directly optimizable, and indeed Sohl-Dickstein et al. (2015) used a closely related formulation[^2].

The simplified L_simple objective

The central practical contribution of Ho et al. (2020) was to show that an unweighted mean-squared-error objective on the noise prediction produces dramatically better samples than the proper VLB[^1]:

L_simple(θ) = E_{t, x_0, ε} [ || ε − ε_θ( sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε , t ) ||² ][^1]

In words: sample a clean image x_0, sample a random timestep t uniformly from {1, ..., T}, sample standard Gaussian noise ε, build the noisy image x_t in closed form, and train the network to recover ε in the L² sense. This is mathematically equivalent to a reweighted form of the VLB — specifically, L_VLB with each timestep's KL term scaled by a factor that downweights very small t. Ho et al. found that this reweighting "emphasizes more difficult denoising tasks at larger t" and empirically yields lower FID even though it slightly worsens log-likelihood[^1].

The simplified training loop is just five steps:

Sample a data point x_0 from the training set.
Sample a timestep t ~ Uniform({1, ..., T}).
Sample noise ε ~ N(0, I).
Compute x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε.
Minimize || ε − ε_θ(x_t, t) ||² with stochastic gradient descent.

There is no adversarial discriminator to balance, no posterior collapse to manage, no second network. The training stability of this recipe — combined with its sample quality — is the principal reason diffusion models displaced GANs as the default for high-fidelity image synthesis between 2020 and 2022.

U-Net architecture used

DDPM uses a U-Net denoiser based on the PixelCNN++ backbone introduced by Salimans et al. (2017) and refined by the Wide ResNet style of Zagoruyko and Komodakis (2016). The U-Net itself originates from Ronneberger et al.'s 2015 work on biomedical image segmentation[^1].

Architectural details specific to DDPM include[^1]:

Encoder–decoder structure with skip connections at matching resolutions, so fine spatial detail can flow directly from early to late layers.
Residual blocks at each resolution, with Group Normalization (Wu and He, 2018) used in place of Batch Normalization to behave well at small batch sizes.
Timestep embedding. The integer timestep t is mapped to a sinusoidal positional embedding (as in the Transformer), passed through a small MLP, and added into every residual block. This lets a single set of weights handle all 1000 noise levels.
Self-attention layers at the 16×16 feature-map resolution (and in some variants additional resolutions), enabling the model to capture global structure that pure convolutions miss[^1].
Multi-scale design. The DDPM CIFAR-10 model has roughly 35 M parameters; the LSUN models are larger.

The choice of a convolutional U-Net (rather than a transformer) was important historically: it tied diffusion models to a well-understood image architecture and made the field accessible to researchers without TPU-scale compute. Transformer-based denoisers — most notably the Diffusion Transformer (DiT) by Peebles and Xie (2023) — gained ground only after 2022, once the diffusion paradigm itself was well established.

Sampling procedure

Sampling from a trained DDPM follows the ancestral (Markov) reverse chain[^1]:

Draw x_T ~ N(0, I).
For t = T, T−1, ..., 1: a. Compute the predicted noise ε_θ(x_t, t). b. Compute the reverse-step mean μ_θ(x_t, t) = (1 / sqrt(α_t)) · (x_t − (β_t / sqrt(1 − ᾱ_t)) · ε_θ(x_t, t)). c. Sample z ~ N(0, I) for t > 1, otherwise z = 0. d. Set x_{t-1} = μ_θ(x_t, t) + σ_t · z.
Return x_0.

Because T = 1000 and each step requires a full forward pass through the denoiser, generating a single image costs approximately 1000 network evaluations[^1]. On 2020-era GPUs this translated to tens of seconds per CIFAR-10 sample and minutes per high-resolution LSUN sample — far slower than a one-shot GAN. Reducing this sampling cost became one of the central research directions of the next several years (see Follow-up improvements below).

The ancestral sampler is stochastic: the noise injection at step (c) ensures that running the same trained model from the same random seed produces a fresh trajectory each time. Replacing that stochasticity with a deterministic update yields DDIM, discussed below.

Noise schedule

Linear β schedule

In the original DDPM, the variance schedule {β_t} is linear in t, increasing from β_1 = 10⁻⁴ to β_T = 0.02 with T = 1000[^1]. These small β values were chosen so that the reverse-process Gaussian assumption — that q(x_{t-1} | x_t) is approximately Gaussian — holds tightly. The resulting cumulative product ᾱ_t starts near 1 (almost no noise) at t = 0 and approaches 0 (almost pure noise) at t = T.

Cosine schedule

Nichol and Dhariwal (2021), in "Improved Denoising Diffusion Probabilistic Models" (arXiv:2102.09672), observed that the linear schedule destroys information too quickly on lower-resolution images: by roughly the first 20 % of timesteps the signal-to-noise ratio is already very low, leaving many early reverse steps with little to do[^4]. They proposed a cosine schedule defined indirectly through:

ᾱ_t = (f(t) / f(0))² , where f(t) = cos( ((t/T + s) / (1 + s)) · π/2 )

with s = 0.008 a small offset to prevent β_t from being too small near t = 0. The cosine schedule changes more slowly near both endpoints, giving more uniform information destruction and meaningfully improving FID, especially at 64×64[^4].

Other schedules

Subsequent work introduced sigmoid schedules (Jabri et al., 2022) and learned schedules (Kingma et al., 2021's Variational Diffusion Models), as well as resolution-dependent rescalings. The general pattern is that the noise schedule must be calibrated to the spatial resolution and effective image statistics; what works for CIFAR-10 does not necessarily work for 1024×1024 images.

Schedule	Defining quantity	Behavior	First used in
Linear	β linearly from 10⁻⁴ to 0.02	Fast information loss early	DDPM (Ho et al. 2020)[^1]
Cosine	ᾱ_t = cos²((t/T + s)/(1+s) · π/2)	Uniform information loss	iDDPM (Nichol & Dhariwal 2021)[^4]
Sigmoid	S-shaped β	Smooth midpoint transition	Jabri et al. 2022
Learned	Optimized end-to-end	Adaptive	VDM (Kingma et al. 2021)

Empirical results

CIFAR-10

The headline result of the DDPM paper was on unconditional CIFAR-10 (32×32 natural images): a Fréchet Inception Distance (FID) of 3.17 and an Inception Score of 9.46[^1]. At the time of publication this FID was state-of-the-art for unconditional CIFAR-10 — better than the best GAN result available — and the Inception Score was competitive with the leading GAN models. Crucially, both numbers were achieved without any adversarial training, without truncation tricks, and without hyperparameter tuning peculiar to each metric.

LSUN

DDPM was also evaluated on 256×256 images from several LSUN categories, with the paper reporting results on LSUN Bedroom, LSUN Church (also called Church Outdoor), and LSUN Cat[^1]. Sample quality was competitive with ProgressiveGAN and StyleGAN baselines, though not strictly state-of-the-art on every category. The visual fidelity of DDPM LSUN samples — particularly the church outdoor scenes — was an important demonstration that the diffusion framework scaled beyond toy resolutions.

Likelihoods

Following the simplified L_simple training, DDPM's variational lower bound on test log-likelihood was worse than that of explicit likelihood models of the time, even though sample quality was higher[^1]. This sample-quality / likelihood tension was a recurring theme in early diffusion work and partly motivated the hybrid loss introduced by Nichol and Dhariwal (2021)[^4].

Theoretical connections

Score matching

There is a tight equivalence between the DDPM noise-prediction objective and denoising score matching (Vincent, 2011; Song & Ermon, 2019)[^3]. Given the closed-form forward distribution q(x_t | x_0), the score of q(x_t) at a noisy point x_t is:

∇_{x_t} log q(x_t | x_0) = − (x_t − sqrt(ᾱ_t) · x_0) / (1 − ᾱ_t) = − ε / sqrt(1 − ᾱ_t)

Therefore predicting the noise ε is equivalent (up to a constant scaling that depends on t) to predicting the score. The DDPM noise network is, in effect, a score model at every noise level — and the DDPM ancestral sampler is a discretization of a particular reverse-time SDE that uses the score[^5].

Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole (2021)'s "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) made this unification explicit, deriving DDPM as a discretization of a Variance Preserving (VP) SDE and NCSN as a discretization of a Variance Exploding (VE) SDE, both governed by a single continuous-time formulation with the same score-matching loss[^5].

Variational autoencoders

DDPM is also formally a deep hierarchical variational autoencoder with a fixed, non-learned encoder (the forward Markov chain) and a Gaussian-Markov decoder (the reverse chain)[^1]. This perspective makes the VLB derivation natural and connects DDPM to the broader VAE literature. The key innovation over earlier hierarchical VAEs is that the encoder is hand-designed and noise-only, eliminating the optimization difficulties (such as posterior collapse) that plagued learned hierarchical posteriors.

Energy-based models and Langevin dynamics

The reverse-time SDE view also clarifies the link to energy-based models: the score is the gradient of an implicit log-density, and DDPM's iterative denoising is a stabilized, annealed analogue of Langevin sampling from an energy-based model.

Limitations of DDPM

The original DDPM has several well-documented limitations that became the agenda of subsequent diffusion research:

Slow sampling

The defining cost of DDPM is its 1000-step sampling chain: each image requires roughly T = 1000 forward passes through the U-Net[^1]. On 2020-era hardware this made DDPM orders of magnitude slower than GANs at inference time, and it remains the principal disadvantage of diffusion-based generation. Subsequent work attacked this bottleneck through faster solvers (DDIM, DPM-Solver), distillation (progressive distillation, consistency models), and architectural shortcuts (latent diffusion).

Pixel-space cost

DDPM operates directly on raw pixels. For a 256×256 RGB image, every U-Net forward pass processes 196,608 input values, and the same is true for every one of the 1000 sampling steps. Scaling DDPM to 1024×1024 or video resolutions is prohibitively expensive without first compressing the data — a problem that Latent Diffusion Models (Rombach et al., 2022) solved by running the diffusion process in the latent space of a pretrained autoencoder[^6].

Sample quality / log-likelihood mismatch

L_simple sacrifices log-likelihood for sample quality. Models trained with L_simple have worse density estimation than VAEs and PixelRNNs of comparable size, even though their samples look better[^1]. Nichol and Dhariwal (2021) partially closed this gap with a hybrid loss, but the underlying tension between perceptual fidelity and likelihood remains[^4].

Limited controllability

The original DDPM is fully unconditional. Practical text-to-image generation required two further ingredients: classifier guidance (Dhariwal and Nichol, 2021)[^7] and especially classifier-free guidance (Ho and Salimans, 2021)[^8], the latter of which is now standard in essentially every conditional diffusion model.

Sensitivity to noise schedule

The linear schedule that works on CIFAR-10 does not work as well on lower-resolution images or on images with very different statistics, and there is no principled choice of schedule from theory alone. Schedule design has remained an active research topic since 2020[^4].

Follow-up improvements

The DDPM paper opened a research program that has now produced dozens of major follow-ups. The most influential are summarized below.

Improved DDPM (Nichol and Dhariwal, 2021)

Same authors, same model family. Three changes — the cosine schedule, learned variances, and a hybrid VLB+L_simple loss — improved both FID and log-likelihood, and a strided sampling schedule cut inference cost by roughly an order of magnitude with negligible quality loss[^4].

DDIM (Song, Meng, Ermon, 2020)

Denoising Diffusion Implicit Models (arXiv:2010.02502) defined a family of non-Markovian reverse processes that share DDPM's marginals q(x_t | x_0) and therefore can be sampled from any DDPM-trained model without retraining[^9]. With the stochasticity parameter set to zero, DDIM yields a deterministic ODE-like sampler that produces high-quality samples in 50–100 steps — roughly 10×–20× fewer than DDPM[^9]. DDIM also gives the model a meaningful latent space: the same initial noise vector always maps to the same image, enabling interpolation and inversion.

Property	DDPM	DDIM (σ = 0)
Sampling process	Stochastic (SDE)	Deterministic (ODE)
Typical steps	~1000	50–100 (sometimes 10–20)
Same noise → same image	No	Yes
Latent space interpolation	No	Yes
Retraining required	—	None

Score SDE framework (Song et al., 2021)

The unification of DDPM and NCSN under a continuous-time SDE, with a corresponding probability-flow ODE that produces deterministic, exact-likelihood sampling and connects diffusion models to the wider literature on normalizing flows[^5].

Diffusion Models Beat GANs (Dhariwal and Nichol, 2021)

This paper (arXiv:2105.05233) introduced classifier guidance — using the gradient of a separately trained noise-aware classifier to push samples toward a desired class — and used it to set new ImageNet FID records, decisively beating BigGAN on class-conditional generation[^7]. The title "Diffusion Models Beat GANs" became a slogan for the broader shift.

Classifier-free guidance (Ho and Salimans, 2021)

Classifier-Free Diffusion Guidance (arXiv:2207.12598) eliminated the separate classifier by training the denoiser jointly as both conditional and unconditional model (with the conditioning randomly dropped during training)[^8]. At sample time, the guided noise prediction is extrapolated away from the unconditional prediction toward the conditional one:

ε_guided = ε_θ(x_t, t) + w · (ε_θ(x_t, t, c) − ε_θ(x_t, t))[^8]

where w is the guidance scale. CFG has become the universal mechanism by which text-to-image models such as Stable Diffusion, DALL-E 2, and Imagen obtain strong prompt adherence[^6][^8].

Latent Diffusion / Stable Diffusion (Rombach et al., 2022)

"High-Resolution Image Synthesis with Latent Diffusion Models" (arXiv:2112.10752) ran the DDPM process not on pixels but in the latent space of a pretrained autoencoder, typically compressing 512×512 images down to 64×64 latents before any diffusion[^6]. This reduced the compute cost of every sampling step by 1–2 orders of magnitude, enabling high-resolution text-to-image generation on consumer GPUs. The publicly released Stable Diffusion model (Stability AI, August 2022) is the most widely used instantiation of this Latent Diffusion framework and is a direct architectural descendant of DDPM[^6].

Consistency models and distillation

Progressive distillation (Salimans and Ho, 2022) and consistency models (Song, Dhariwal, Chen, Sutskever, 2023) compressed multi-step diffusion samplers into models that produce high-quality samples in 1–4 steps[^10]. Consistency distillation works by training a student model to map any point on a diffusion trajectory directly to its endpoint, eliminating the iterative chain at inference time.

Flow matching (Lipman et al., 2023)

Flow Matching for Generative Modeling (arXiv:2210.02747) generalized the diffusion idea to arbitrary transport flows: instead of fixing a noise-adding forward process, the model learns a velocity field that transports samples from a source distribution to the data distribution along any chosen interpolant[^11]. For the Gaussian-source special case, flow matching is mathematically equivalent to a Gaussian diffusion model with appropriate parameterization[^12]. Stable Diffusion 3 (Esser et al., 2024) and several Meta video models are trained with flow matching.

Legacy and impact

Within roughly eighteen months of the DDPM preprint, diffusion models had displaced GANs as the default backbone for high-fidelity image generation. By mid-2022, three of the most-discussed AI systems of the year — DALL-E 2 (OpenAI, April 2022), Imagen (Google, May 2022), and Stable Diffusion (Stability AI, August 2022) — were all diffusion models trained on the DDPM recipe (with classifier-free guidance and, for Stable Diffusion, latent-space efficiency)[^6][^8]. The DDPM paper itself, by 2026, is one of the most cited generative modeling publications of the deep learning era, with citations crossing into the tens of thousands[^1].

The influence has propagated well beyond static images:

Audio. DiffWave (Kong et al., 2021) and Grad-TTS apply DDPM-style training to raw audio and mel-spectrograms; modern text-to-speech and music systems such as MusicLM and Stable Audio are diffusion-based.
Video. Imagen Video, Make-A-Video, Sora, and Veo are 3D U-Net or DiT diffusion models that extend DDPM to space-time.
3D. DreamFusion (Poole et al., 2022) uses a pretrained 2D diffusion model and score distillation sampling (SDS) to optimize 3D radiance fields without 3D training data.
Molecular and protein design. Models such as RFdiffusion (Watson et al., 2023) apply DDPM-style noising and denoising to protein backbones, producing novel functional proteins.
Robotics. Diffusion Policy (Chi et al., 2023) treats action sequences as the data and uses a DDPM-style conditional denoiser to generate robot control policies.

Methodologically, DDPM cemented a broader shift: generative modeling no longer required adversarial training, and likelihood-based or score-based objectives could deliver state-of-the-art perceptual quality with much greater training stability. That recipe — a U-Net or transformer denoiser, an ε-prediction MSE loss, classifier-free guidance, and a latent-space backbone — is, in 2026, the most copied architecture in generative AI.

References

Background

Sohl-Dickstein's 2015 framework

GAN dominance and the search for stable alternatives

Score-based models and Langevin dynamics

Authors

Forward process

Reverse process

Training objective

Variational lower bound

The simplified L_simple objective

U-Net architecture used

Sampling procedure

Noise schedule

Linear β schedule

Cosine schedule

Other schedules

Empirical results

CIFAR-10

LSUN

Likelihoods

Theoretical connections

Score matching

Variational autoencoders

Energy-based models and Langevin dynamics

Limitations of DDPM

Slow sampling

Pixel-space cost

Sample quality / log-likelihood mismatch

Limited controllability

Sensitivity to noise schedule

Follow-up improvements

Improved DDPM (Nichol and Dhariwal, 2021)

DDIM (Song, Meng, Ermon, 2020)

Score SDE framework (Song et al., 2021)

Diffusion Models Beat GANs (Dhariwal and Nichol, 2021)

Classifier-free guidance (Ho and Salimans, 2021)

Latent Diffusion / Stable Diffusion (Rombach et al., 2022)

Consistency models and distillation

Flow matching (Lipman et al., 2023)

Legacy and impact

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

Latent diffusion model

Diffusion models

Generative Adversarial Network (GAN)

Generative Model

Background

Sohl-Dickstein's 2015 framework

GAN dominance and the search for stable alternatives

Score-based models and Langevin dynamics

Authors

Forward process

Reverse process

Training objective

Variational lower bound

The simplified L_simple objective

U-Net architecture used

Sampling procedure

Noise schedule

Linear β schedule

Cosine schedule

Other schedules

Empirical results

CIFAR-10

LSUN

Likelihoods

Theoretical connections

Score matching

Variational autoencoders

Energy-based models and Langevin dynamics

Limitations of DDPM

Slow sampling

Pixel-space cost

Sample quality / log-likelihood mismatch

Limited controllability

Sensitivity to noise schedule