A Denoising Diffusion Probabilistic Model (DDPM) is a class of generative model that learns to produce data samples by reversing a gradual noising process. Introduced by Jonathan Ho, Ajay Jain, and Pieter Abbeel in their 2020 paper "Denoising Diffusion Probabilistic Models," DDPMs demonstrated that diffusion-based generation could achieve image quality competitive with generative adversarial networks (GANs) while offering more stable training and better mode coverage. The DDPM framework has since become the foundation for many of the most successful image generation systems, including DALL-E 2, Imagen, and Stable Diffusion, and its principles have influenced audio, video, and 3D generation as well.
The idea of using diffusion processes for generative modeling predates DDPMs. Sohl-Dickstein et al. (2015) introduced "Deep Unsupervised Learning using Nonequilibrium Thermodynamics," which proposed the core concept of gradually adding noise to data and then learning to reverse the process. However, the generated samples from this early work were of limited quality, and the approach did not gain widespread attention [1].
Song and Ermon (2019) developed a related approach called score matching with Langevin dynamics (SMLD), which trained a neural network to estimate the gradient of the log probability density (the "score function") at various noise levels and then used Langevin dynamics to generate samples by following the estimated score. This work demonstrated competitive image generation results and established an important theoretical connection that would later be unified with the DDPM framework [2].
Ho et al. (2020) built on both of these foundations, introducing specific design choices (a particular noise schedule, a simplified training objective, and a U-Net architecture as the denoiser) that together produced high-quality samples and established the template that subsequent diffusion model research would follow [3].
The DDPM framework consists of two processes: a forward (diffusion) process that gradually destroys data by adding noise, and a reverse (denoising) process that learns to reconstruct the data from noise.
The forward process takes a data sample x_0 drawn from the real data distribution and produces a sequence of increasingly noisy versions x_1, x_2, ..., x_T, where T is the total number of timesteps (typically T = 1000 in the original paper). At each step, a small amount of Gaussian noise is added according to a fixed schedule:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
Here, beta_t is a variance schedule parameter that controls how much noise is added at each step. The values beta_1, beta_2, ..., beta_T increase gradually so that by the final step T, the data has been almost entirely replaced by standard Gaussian noise.
An important property of this process is that the noisy sample at any arbitrary timestep t can be computed directly from the original data x_0 without iterating through all intermediate steps:
q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I)
where alpha_t = 1 - beta_t and alpha_bar_t = product of alpha_1 through alpha_t. This closed-form expression is crucial for efficient training, as it allows the model to be trained on randomly sampled timesteps without simulating the full forward chain [3].
The reverse process starts from pure Gaussian noise x_T ~ N(0, I) and iteratively denoises it to produce a sample from the data distribution. Each denoising step is parameterized by a neural network that predicts the distribution:
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)
The network takes the noisy sample x_t and the timestep t as input and predicts the mean mu_theta of the Gaussian distribution from which x_{t-1} should be drawn. The variance sigma_t^2 can either be fixed to a schedule-dependent value or learned by the network (Nichol and Dhariwal, 2021, showed that learning the variance improves sample quality) [4].
Generation proceeds by sampling x_T from a standard Gaussian, then iteratively applying the learned reverse step T times to arrive at x_0. This sequential sampling process is one of the main drawbacks of DDPMs, as generating a single sample requires T forward passes through the neural network.
The theoretical training objective for DDPMs is derived from the variational lower bound (VLB) on the data log-likelihood. However, Ho et al. found that a simplified objective produces better sample quality. Rather than predicting the mean of x_{t-1} directly, the network is trained to predict the noise epsilon that was added to x_0 to produce x_t:
L_simple = E[||epsilon - epsilon_theta(x_t, t)||^2]
where epsilon ~ N(0, I) is the noise sampled during training, x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon is the noisy input, and epsilon_theta is the neural network's prediction. This is equivalent to a weighted form of the VLB but with a simpler weighting that empirically produces better results [3].
The training procedure is straightforward:
This simplicity is a major advantage of the DDPM training procedure compared to GANs, which require careful balancing of generator and discriminator training, or variational autoencoders (VAEs), which require trading off reconstruction quality against KL divergence.
Song et al. (2021) unified DDPMs and score-based models into a single framework based on stochastic differential equations (SDEs). The forward diffusion process can be described by a continuous-time SDE:
dx = f(x, t) dt + g(t) dw
where f is the drift coefficient, g is the diffusion coefficient, and w is a standard Wiener process. The reverse process is given by a reverse-time SDE that depends on the score function (the gradient of the log probability density of the noisy data at time t).
The noise prediction network epsilon_theta in DDPMs is directly related to the score function: the predicted noise is proportional to the negative score. This means that a DDPM trained to predict noise is implicitly learning the score function at each noise level, and the DDPM sampling process is a discretization of the reverse SDE [5].
This unified perspective, often called the score-based generative modeling or SDE framework, opened the door to continuous-time formulations, alternative ODE-based samplers, and a richer theoretical understanding of diffusion models.
The variance schedule {beta_1, ..., beta_T} controls the rate at which noise is added during the forward process. Ho et al. used a linear schedule with T = 1000, where beta values increase linearly from beta_1 = 10^-4 to beta_T = 0.02 [3].
| Schedule | Range | Behavior | Introduced By |
|---|---|---|---|
| Linear | beta: 10^-4 to 0.02 | Uniform noise addition; destroys information quickly at early steps | Ho et al. (2020) [3] |
| Cosine | Derived from alpha_bar_t = cos^2 formula | Slower destruction at start and end; more uniform information loss | Nichol & Dhariwal (2021) [4] |
| Sigmoid | S-shaped schedule | Smooth transition with controlled midpoint | Various researchers |
| Learned | Optimized during training | Adaptive to the specific dataset | Kingma et al. (2021) |
The cosine schedule, introduced in "Improved Denoising Diffusion Probabilistic Models" by Nichol and Dhariwal (2021), was motivated by the observation that the linear schedule destroys information too quickly in the early steps. With the linear schedule, the signal-to-noise ratio drops rapidly at first, meaning many early timesteps are nearly indistinguishable. The cosine schedule changes more slowly near both the beginning and end of the process, resulting in a more uniform rate of information destruction and better use of the full range of timesteps [4].
Ho et al. chose a U-Net architecture, originally developed by Ronneberger et al. (2015) for medical image segmentation, as the denoising network. The U-Net consists of an encoder path that progressively downsamples the input through a series of convolutional blocks, a bottleneck, and a decoder path that progressively upsamples back to the original resolution. Skip connections between corresponding encoder and decoder layers allow fine-grained spatial information to flow directly to the decoder [3].
The DDPM U-Net incorporates several modifications compared to the original medical imaging version:
The U-Net architecture has remained the dominant choice for diffusion model denoisers, though recent work has explored replacing it with transformer-based architectures (the "DiT" or Diffusion Transformer approach introduced by Peebles and Xie, 2023) [6].
Nichol and Dhariwal (2021) published "Improved Denoising Diffusion Probabilistic Models," which made several practical improvements to the original DDPM [4]:
These improvements collectively made DDPMs competitive with GANs on metrics like FID (Frechet Inception Distance) while maintaining the diversity and training stability advantages of diffusion models.
Denoising Diffusion Implicit Models (DDIM), introduced by Song, Meng, and Ermon (2020), addressed the slow sampling speed of DDPMs by defining a family of non-Markovian diffusion processes that share the same training objective as DDPMs but allow for much faster generation [7].
The key insight is that the DDPM training objective depends only on the marginal distributions q(x_t | x_0), not on the full joint distribution q(x_1, ..., x_T | x_0). DDIM defines a different joint distribution that has the same marginals but is non-Markovian, meaning each reverse step can depend on x_0 (predicted from x_t) rather than only on x_t.
The DDIM update rule introduces a parameter sigma_t that controls the stochasticity of the reverse process. When sigma_t matches the DDPM value, DDIM reduces to DDPM sampling. When sigma_t = 0, the reverse process becomes completely deterministic, and the mapping from noise to image is fixed. This deterministic variant corresponds to solving an ordinary differential equation (ODE), while DDPM sampling corresponds to solving a stochastic differential equation (SDE) [7].
The deterministic nature of DDIM has several advantages:
| Property | DDPM | DDIM (sigma = 0) |
|---|---|---|
| Sampling process | Stochastic (SDE) | Deterministic (ODE) |
| Steps needed for quality results | ~1000 | 50-100 (sometimes as few as 10-20) |
| Same noise produces same image | No (different each time) | Yes (deterministic mapping) |
| Meaningful latent space | No | Yes (latent space interpolation is possible) |
| Speed | Slow | Significantly faster |
Because DDIM sampling follows an ODE trajectory, it can use adaptive step-size ODE solvers and skip many timesteps while maintaining sample quality. In practice, DDIM can produce high-quality samples in 50 to 100 steps, compared to the 1000 steps required by DDPM, representing a 10-20x speedup [7].
The relationship between DDPM and DDIM sampling can be understood geometrically: one DDPM sampling step is equivalent to performing one DDIM step to a further point and then adding noise back ("renoising") through forward diffusion. This renoising effectively reverses half the progress made by DDIM, which explains why DDPM requires more steps to traverse the same trajectory [7].
Classifier-free guidance, introduced by Ho and Salimans (2022), is a technique for improving the quality of conditional generation with diffusion models without requiring a separate classifier network [8].
The predecessor technique, classifier guidance (Dhariwal and Nichol, 2021), used the gradient of a pretrained classifier to steer the diffusion sampling process toward images of a specified class. While effective, this required training a separate classifier on noisy images, adding complexity and computational cost [9].
Classifier-free guidance eliminates the separate classifier entirely. During training, the model is jointly trained as both a conditional and unconditional diffusion model by randomly dropping the conditioning information (e.g., class label or text prompt) with some probability (typically 10-20%). At sampling time, the model produces two predictions: a conditional prediction epsilon_theta(x_t, t, c) and an unconditional prediction epsilon_theta(x_t, t). The final prediction is an extrapolation away from the unconditional prediction toward the conditional one:
epsilon_guided = epsilon_theta(x_t, t) + w * (epsilon_theta(x_t, t, c) - epsilon_theta(x_t, t))
where w is the guidance scale. When w = 1, this reduces to standard conditional sampling. When w > 1, the model generates samples that are more strongly aligned with the conditioning signal, at the cost of reduced diversity [8].
Classifier-free guidance has become a standard component of essentially all modern conditional diffusion models. It is the mechanism by which text-to-image models like Stable Diffusion, DALL-E 2, and Imagen produce images that closely match text prompts. Typical guidance scale values range from 3 to 15, depending on the application and desired trade-off between fidelity and diversity [8].
Rombach, Blattmann, et al. (2022) introduced Latent Diffusion Models (LDMs) in their paper "High-Resolution Image Synthesis with Latent Diffusion Models." The key insight was that running the diffusion process in pixel space is computationally wasteful because much of the high-frequency detail is perceptually irrelevant. LDMs first compress images into a lower-dimensional latent space using a pretrained autoencoder (specifically, a VQ-VAE or KL-regularized autoencoder), then apply the DDPM forward and reverse processes in that latent space [10].
This approach offers substantial computational savings: the latent representations are typically 8x to 16x smaller in each spatial dimension than the original images, reducing the cost of the diffusion process by orders of magnitude. The U-Net denoiser operates on latent vectors rather than pixels, and a decoder converts the final latent back to pixel space.
Stable Diffusion, released by Stability AI in August 2022, is the most widely known implementation of the LDM architecture. All versions of Stable Diffusion (1.1 through XL) are direct instantiations of the LDM framework, using a DDPM-style diffusion process in the latent space of a pretrained autoencoder, with a U-Net denoiser conditioned on text embeddings from a CLIP or OpenCLIP text encoder [10].
Flow matching, introduced by Lipman et al. (2023), represents a more recent generalization of diffusion models. Instead of defining a fixed forward noising process and learning to reverse it, flow matching directly learns a velocity field that transports samples from a noise distribution to the data distribution along a continuous-time flow [11].
A common misconception is that flow matching always produces straight paths while diffusion models produce curved paths, or that flow matching is always deterministic while diffusion sampling is always stochastic. In reality, for the common special case where the source distribution is Gaussian, diffusion models and flow matching are mathematically equivalent. The two frameworks produce the same learned transport, and after training, you can use either stochastic (SDE) or deterministic (ODE) sampling with either approach [12].
Flow matching has been adopted in several recent systems, including Stable Diffusion 3 (Esser et al., 2024) and Meta's movie generation models. The practical advantages of flow matching include simpler training objectives, more flexible design choices, and often straighter sampling trajectories that require fewer integration steps.
| Model | Year | Authors | Sampling | Steps for Quality | Key Innovation |
|---|---|---|---|---|---|
| DDPM | 2020 | Ho, Jain, Abbeel | Stochastic (SDE) | ~1000 | Simplified noise prediction loss, quality matching GANs [3] |
| DDIM | 2020 | Song, Meng, Ermon | Deterministic (ODE) or stochastic | 50-100 | Non-Markovian formulation, accelerated sampling [7] |
| Improved DDPM | 2021 | Nichol, Dhariwal | Stochastic | ~100 (with strided schedule) | Cosine schedule, learned variance [4] |
| Score SDE | 2021 | Song, Sohl-Dickstein, et al. | SDE or ODE (probability flow) | Variable | Unified SDE framework for diffusion and score models [5] |
| Guided Diffusion | 2021 | Dhariwal, Nichol | Stochastic + classifier guidance | ~250 | Classifier guidance, upsampling diffusion [9] |
| Latent Diffusion (LDM) | 2022 | Rombach, Blattmann, et al. | Either | Variable | Diffusion in latent space for efficiency [10] |
| Classifier-Free Guidance | 2022 | Ho, Salimans | Either (typically DDIM) | Variable | Conditional generation without separate classifier [8] |
| Consistency Models | 2023 | Song, Dhariwal, et al. | Deterministic | 1-2 steps | Direct mapping from noise to data in one or few steps [13] |
| Flow Matching | 2023 | Lipman, Chen, et al. | ODE | 20-50 | Optimal transport paths, simpler training [11] |
While DDPMs were originally developed for image generation, the framework has been extended to numerous other domains:
For reference, the core mathematical components of the DDPM framework are summarized below:
| Component | Formula | Description |
|---|---|---|
| Forward process (single step) | q(x_t | x_{t-1}) = N(sqrt(1-beta_t) x_{t-1}, beta_t I) | Add noise at each step |
| Forward process (closed form) | q(x_t | x_0) = N(sqrt(alpha_bar_t) x_0, (1-alpha_bar_t) I) | Jump directly to any timestep |
| Reverse process | p_theta(x_{t-1} | x_t) = N(mu_theta(x_t, t), sigma_t^2 I) | Learned denoising step |
| Simplified loss | L = E[||epsilon - epsilon_theta(x_t, t)||^2] | Predict noise added to data |
| DDIM update | x_{t-1} = sqrt(alpha_bar_{t-1}) x_0_pred + sqrt(1-alpha_bar_{t-1}-sigma_t^2) epsilon_theta + sigma_t z | Generalized reverse step (sigma=0 for deterministic) |
As of early 2026, the DDPM framework and its descendants remain central to generative modeling. Several trends characterize the current landscape:
Transformer-based architectures. The DiT (Diffusion Transformer) architecture, introduced by Peebles and Xie (2023), has increasingly replaced U-Nets as the denoiser in state-of-the-art systems. Sora, Stable Diffusion 3, and other recent models use transformer backbones, which scale more predictably and integrate more naturally with text conditioning [6].
Flow matching as the default. Many new systems adopt flow matching rather than the original DDPM formulation, taking advantage of simpler training objectives and faster sampling. The mathematical equivalence between the two frameworks means that insights from DDPM research remain directly applicable.
Distillation for speed. Consistency distillation and progressive distillation techniques have enabled one-step or few-step generation from models that were originally trained as multi-step diffusion models, bridging the speed gap with GANs while maintaining diffusion model quality.
Integration with language models. Diffusion models are increasingly being combined with large language models for multimodal generation. Models like DALL-E 3 use LLMs to rewrite user prompts before passing them to the diffusion model, improving prompt adherence.
The DDPM paper by Ho et al. (2020) has become one of the most cited papers in generative modeling, and its core framework of forward noising and learned reverse denoising continues to define the field.