DDIM (Denoising Diffusion Implicit Models)

Diffusion Models Generative AI

24 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 4,888 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Denoising Diffusion Implicit Models (DDIM) are a class of iterative generative models, introduced by Jiaming Song, Chenlin Meng, and Stefano Ermon of Stanford University in October 2020, that accelerate sampling from a pretrained denoising diffusion probabilistic model by generalizing its Markovian forward process to a family of non-Markovian forward processes that share the same training objective.^[1] DDIM reuses an existing DDPM network with no retraining and produces sample quality comparable to a 1000-step DDPM in as few as 20 to 100 sampling steps, a wall-clock speedup of roughly 10x to 50x.^[1] The paper states plainly that "we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs," which "can produce high quality samples 10x to 50x faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, and can perform semantically meaningful image interpolation directly in the latent space."^[1] In its deterministic limit (the $\eta=0$ case) DDIM follows a fixed, reproducible ODE-like trajectory from noise to image, which makes its latents consistent enough to support semantic interpolation and approximate inversion of real images; it served as a default fast sampler in the original Stable Diffusion release in 2022 and remains a baseline scheduler in the Hugging Face diffusers library, even though later higher-order ODE solvers such as DPM-Solver and the Karras EDM family generally match or exceed its quality at fewer steps.^[2]^[3]^[4]

Attribute	Value
Type	Sampler for diffusion models
First arXiv submission	2020-10-06
Authors	Jiaming Song, Chenlin Meng, Stefano Ermon
Affiliation	Stanford University
Publication venue	ICLR 2021 (poster)
OpenReview ID	St1giarCHLP
Reference implementation	github.com/ermongroup/ddim (MIT License)
Default stochasticity parameter	$\eta = 0$ (deterministic)
Headline speedup	10x to 50x faster wall-clock vs DDPM
Training requirement	None beyond a standard DDPM
Typical inference budget	10 to 100 network function evaluations

What problem does DDIM solve?

Denoising diffusion models are latent-variable generative models that learn to invert a fixed forward process which gradually corrupts data into Gaussian noise.^[1]^[5] The original Markovian formulation of Sohl-Dickstein et al. (2015) and the DDPM training recipe of Ho, Jain, and Abbeel (2020) define a forward chain that adds a small amount of Gaussian noise at each of T discrete steps, and a generative reverse chain that progressively denoises a sample drawn from a standard Gaussian back to the data distribution.^[1]^[5] DDPM, in particular, demonstrated that image quality on CIFAR-10 (FID 3.17 at the time of publication, with T = 1000) could rival that of GANs without adversarial training, but at a substantial sampling cost: the entire 1000-step chain must be simulated sequentially to draw a single image.^[1]^[5]

Song, Meng, and Ermon's October 2020 preprint identified this sampling cost as the principal practical drawback of diffusion models relative to single-pass generators. As they note in the introduction, "it takes around 20 hours to sample 50k images of size 32 x 32 from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU," and the gap widens for larger images.^[1] DDIM was conceived specifically to close this gap without modifying the training procedure.

When was DDIM published?

The paper was first posted to arXiv on 6 October 2020 (preprint number 2010.02502) and accepted as a poster at the International Conference on Learning Representations 2021, with the OpenReview camera-ready dated 12 January 2021.^[1]^[6] The final arXiv revision (v4) is dated 5 October 2022.^[1] The original code release accompanying the paper is hosted at github.com/ermongroup/ddim under an MIT license, and provides reference implementations for CIFAR-10, CelebA 64x64, and LSUN church and bedroom datasets.^[7]

The non-Markovian intuition

The standard derivation of the DDPM reverse process treats the forward chain as a strictly Markovian Gaussian random walk: at each step, $x_t$ is obtained by adding fresh Gaussian noise to $x_{t-1}$ . The reverse chain, then, must approximate a Markovian denoiser. The key conceptual leap of the DDIM paper is to ask whether one can replace the Markovian forward chain with a non-Markovian one, in which the noise injected at step $t$ may depend on both $x_{t-1}$ and the underlying clean sample $x_0$ , while retaining the same marginals $q(x_t \mid x_0)$ . If this is possible, then a different (and potentially shorter) reverse chain can be designed without changing the training loss.

The non-Markovian forward process q_sigma constructed in the paper does exactly this: each x_t is allowed to depend on (x_(t-1), x_0) via Bayes' rule applied to the inference posterior q_sigma(x_(t-1) | x_t, x_0). This dependence is harmless from the perspective of training, because the noise-prediction loss never inspects intermediate joint distributions, only marginals. But it lets the variance sigma_t at each reverse step be a free parameter, which in turn allows the reverse chain to take large jumps in t without violating the underlying ODE structure.^[1]

The DDPM training objective

DDIM does not modify training. To explain the relationship to DDPM, the same notation is used here. DDPM defines a Markovian forward process

q(x_t \mid x_{t-1}) = \mathcal{N}\left(\sqrt{\alpha_t / \alpha_{t-1}}\, x_{t-1},\ (1 - \alpha_t / \alpha_{t-1}) I\right),

where $\alpha_1, \ldots, \alpha_T$ is a decreasing schedule in $(0, 1]$ .^[1] A key property is that the marginal q(x_t | x_0) admits a closed-form Gaussian, so any noisy latent can be sampled directly without iterating:

$q(x_t \mid x_0) = \mathcal{N}(\sqrt{\alpha_t}\, x_0, (1 - \alpha_t) I)$ .^[1]

The DDPM training objective (Equation 5 of the DDIM paper, equivalent to the simple loss of Ho et al. 2020) reduces to a weighted denoising score-matching loss in which a neural network $\epsilon_\theta$ is trained to predict the noise variable $\epsilon$ used to corrupt a clean sample $x_0$ at a given timestep $t$ .^[1]^[5] Critically, this objective depends only on the marginals $q(x_t \mid x_0)$ , not on the full joint $q(x_{1:T} \mid x_0)$ .^[1] Multiple joint distributions, including non-Markovian ones, can share the same marginals while inducing different reverse-time generative processes.

How does DDIM work?

Non-Markovian forward processes

The central observation of the DDIM paper is that, since the DDPM loss depends only on the marginals q(x_t | x_0), one can construct a parametric family of inference distributions $Q$ indexed by a vector $\sigma \in \mathbb{R}^T_{\ge 0}$ that all share the DDPM marginals but differ in their joint structure.^[1] The family is defined by

q_\sigma(x_{1:T} \mid x_0) = q_\sigma(x_T \mid x_0) \prod_{t=2}^{T} q_\sigma(x_{t-1} \mid x_t, x_0),

where $q_\sigma(x_T \mid x_0)$ is the DDPM terminal marginal and each posterior is the Gaussian

$q_\sigma(x_{t-1} \mid x_t, x_0) = \mathcal{N}\left( \sqrt{\alpha_{t-1}}\, x_0 + \sqrt{1 - \alpha_{t-1} - \sigma_t^2} \cdot \frac{x_t - \sqrt{\alpha_t}\, x_0}{\sqrt{1 - \alpha_t}},\ \sigma_t^2 I \right)$ .^[1]

The mean is chosen so that the induced marginals $q_\sigma(x_t \mid x_0)$ match the DDPM marginals exactly, while the variance $\sigma_t$ is a free parameter.^[1] When all $\sigma_t$ are zero, the conditional $q_\sigma(x_{t-1} \mid x_t, x_0)$ becomes a deterministic linear function of $x_t$ and $x_0$ . The corresponding forward process $q_\sigma(x_t \mid x_{t-1}, x_0)$ , obtained from Bayes' rule, is non-Markovian because each $x_t$ depends on both $x_{t-1}$ and $x_0$ .^[1] The Hugging Face documentation describes this concisely: DDIMScheduler "extends the denoising procedure introduced in denoising diffusion probabilistic models (DDPMs) with non-Markovian guidance."^[3]

Theorem 1 of the paper proves that for every $\sigma > 0$ the resulting variational training objective $J_\sigma$ is equal, up to a constant, to a reweighting $L_\gamma$ of the original DDPM loss. Consequently, any model trained with the DDPM "simple" loss ( $L_1$ , $\gamma = 1$ ) is, simultaneously, a valid model for every member of the non-Markovian family. No retraining is required to switch sampling regimes.^[1]

The DDIM sampling update

Given a noisy latent x_t and the noise prediction epsilon_theta^(t)(x_t), the DDIM update rule (Equation 12 of the paper) is

x_{t-1} = \sqrt{\alpha_{t-1}} \left( \frac{x_t - \sqrt{1 - \alpha_t}\, \epsilon_\theta^{(t)}(x_t)}{\sqrt{\alpha_t}} \right) + \sqrt{1 - \alpha_{t-1} - \sigma_t^2}\, \epsilon_\theta^{(t)}(x_t) + \sigma_t \epsilon_t,

where $\epsilon_t \sim \mathcal{N}(0, I)$ is independent of $x_t$ and by convention $\alpha_0 = 1$ .^[1] The three terms have intuitive labels in the paper: a "predicted x_0" term, a "direction pointing to x_t," and a "random noise" term.^[1]

The variance $\sigma_t$ is parameterized for experiments by a scalar $\eta \in [0, \infty)$ :

$\sigma_{\tau_i}(\eta) = \eta \sqrt{\frac{1 - \alpha_{\tau_{i-1}}}{1 - \alpha_{\tau_i}}} \sqrt{1 - \frac{\alpha_{\tau_i}}{\alpha_{\tau_{i-1}}}}$ .^[1]

Two special cases are highlighted in the paper. When $\eta = 1$ , $\sigma_t$ equals the posterior standard deviation of the original DDPM reverse process and the update reduces to the DDPM sampler.^[1] When $\eta = 0$ , the random-noise term vanishes and $x_{t-1}$ is a deterministic function of $x_t$ alone: this is DDIM proper, an implicit probabilistic model that maps a fixed x_T to a unique sample.^[1] Intermediate values of $\eta$ interpolate between these two regimes. In the Hugging Face diffusers library, this is exposed as the eta argument of the DDIMScheduler.step method, which defaults to 0.0 and is documented as: "A value of 0 corresponds to DDIM (deterministic) and 1 corresponds to DDPM (fully stochastic)."^[3]

How does DDIM make sampling faster?

Because the training objective is independent of the specific forward process (only the marginals matter), DDIM allows the user to choose any increasing sub-sequence $\tau = [\tau_1, \ldots, \tau_S]$ of $[1, \ldots, T]$ as the sampling schedule and apply the DDIM update only on the timesteps in tau.^[1] The marginals $q(x_{\tau_i} \mid x_0)$ remain Gaussian with the closed form fixed at training time, so the model continues to denoise correctly without modification. The total cost of sampling becomes proportional to $S$ rather than $T$ , which is the source of the 10x to 50x wall-clock speedup: a 1000-step DDPM chain is replaced by a chain of only 20 to 100 network evaluations.^[1]

The paper considers two simple sub-sampling schedules: a "linear" schedule, in which $\tau_i$ is roughly proportional to $i$ , and a "quadratic" schedule, in which $\tau_i$ is proportional to $i^2$ .^[1] The quadratic schedule is found to give slightly better FID for CIFAR-10 in low-step regimes, while the linear schedule is preferred for CelebA. Subsequent work (notably the 2023 paper "Common Diffusion Noise Schedules and Sample Steps are Flawed") proposes "trailing" and "linspace" alternatives in the diffusers DDIMScheduler, which materially improve quality at very low step counts ( $S = 5$ ) when combined with v-prediction training and zero-terminal-SNR noise schedules.^[8]^[3]

Relation to neural ODEs and the probability-flow ODE

The DDIM update can be rewritten in a form that exposes its structure as an Euler discretization of an ordinary differential equation. Equation 13 of the paper rearranges the iterate as

x_(t - dt) / sqrt(alpha_(t - dt)) = x_t / sqrt(alpha_t) + ( sqrt((1 - alpha_(t - dt)) / alpha_(t - dt)) - sqrt((1 - alpha_t) / alpha_t) ) * epsilon_theta^(t)(x_t).^[1]

Reparameterizing with $\sigma(t) = \sqrt{1 - \alpha(t)} / \sqrt{\alpha(t)}$ and $\bar{x}(t) = x(t) / \sqrt{\alpha(t)}$ yields the limiting ODE

$d\bar{x}(t) = \epsilon_\theta^{(t)}\left( \bar{x}(t) / \sqrt{\sigma^2 + 1} \right) d\sigma(t)$ .^[1]

Proposition 1 of the paper proves that with the optimal noise-prediction network, this ODE is equivalent to a special case of the "probability flow ODE" derived concurrently by Yang Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole in "Score-Based Generative Modeling through Stochastic Differential Equations," corresponding to the variance-exploding diffusion SDE.^[1] The two methods take Euler steps with respect to different parameterizations (DDIM with respect to dsigma(t), the probability-flow Euler method with respect to dt), which gives different update equations in the discrete-step regime; in the continuous limit they coincide.^[1]

The ODE viewpoint has two practical consequences. First, deterministic DDIM is, formally, a first-order ODE solver, which means that higher-order solvers can in principle produce better samples for the same number of network evaluations: this is the route taken by subsequent samplers such as DPM-Solver, PNDM, Heun's method, and the Karras EDM family.^[2]^[4]^[9] Second, the ODE is invertible: by running the DDIM iterate "in reverse" (the encoding direction), one obtains a deterministic map from a real image x_0 to a latent code x_T. Iterating the encoding and decoding directions reconstructs the image, an operation called "DDIM inversion" that became a standard primitive for image editing.^[1]^[10]

How much faster is DDIM than DDPM?

The DDIM paper benchmarks the eta parameter and the trajectory length S on two unconditional image-generation datasets, CIFAR-10 (32 x 32) and CelebA (64 x 64), using exactly the same network weights trained with the DDPM L_1 loss.^[1] Frechet Inception Distance (FID) is reported for $S \in \{10, 20, 50, 100, 1000\}$ and $\eta \in \{0.0, 0.2, 0.5, 1.0\}$ , with an additional "sigma-hat" row corresponding to the implementation used by Ho et al. (2020) for CIFAR-10 samples.^[1]

$S$	DDIM ( $\eta=0$ ) CIFAR-10	DDIM ( $\eta=0$ ) CelebA	DDPM ( $\eta=1$ ) CIFAR-10	DDPM ( $\eta=1$ ) CelebA
10	13.36	17.33	41.07	33.12
20	6.84	13.73	18.36	26.03
50	4.67	9.17	8.01	18.48
100	4.16	6.53	5.78	13.93
1000	4.04	3.51	4.73	5.98

(FID scores reproduced from Table 1 of Song, Meng, and Ermon, 2020.^[1])

The headline finding is that deterministic DDIM achieves dramatically better FID than DDPM at low step counts. On CIFAR-10 with $S = 10$ , DDIM scores 13.36 while DDPM scores 41.07; on CelebA with $S = 20$ , DDIM scores 13.73 while DDPM scores 26.03. As S grows the gap narrows; at $S = 1000$ both methods are within 1 FID unit of each other and the DDPM "sigma-hat" parameterization is marginally better on CIFAR-10 (3.17 versus 4.04).^[1] In wall-clock terms, the paper quotes a 10x to 50x speedup, defined operationally as the ratio of steps required by DDPM to those required by DDIM to reach comparable FID.^[1]

A second experiment in Section 5.2 establishes that the deterministic DDIM map preserves "high-level features" across sampling-trajectory lengths: starting from the same $x_T$ and varying $S$ between 10 and 1000, the generated images share the same coarse semantic structure, with only fine details differing.^[1] This consistency is a direct consequence of determinism: in DDPM the random-noise injections at each step erase any informative structure that x_T might encode about x_0, so identical $x_T$ do not yield identical outputs.

A third experiment (Section 5.3) exploits the same consistency to perform image interpolation in the latent space $x_T$ . Linear or spherical interpolations between two latent codes $x_T^{(1)}$ and $x_T^{(2)}$ produce smooth, semantically meaningful image interpolations on CelebA, a behavior previously associated with implicit generative models such as GANs and not with diffusion models.^[1] A fourth experiment (Table 2 of the paper) measures reconstruction error after encoding and decoding through DDIM at various step counts; the mean squared error on CIFAR-10 falls from 0.014 at $S = 10$ to 0.0009 at $S = 100$ , confirming that the deterministic map is approximately invertible in practice.^[1]

Where is DDIM used?

Stable Diffusion and the diffusers library

When CompVis released the original Stable Diffusion 1.x family of latent diffusion checkpoints in 2022, the inference pipeline shipped DDIM as a default sampler, alongside the higher-order PLMS/PNDM sampler that became the default in the Hugging Face diffusers StableDiffusionPipeline.^[11]^[3] Many of the early Stable Diffusion tutorials and grid comparisons on community sites (AUTOMATIC1111's stable-diffusion-webui, ComfyUI, InvokeAI) treat DDIM as the reference sampler against which newer samplers are benchmarked.^[2]

The diffusers DDIMScheduler class exposes the full eta parameter (defaulting to 0.0 for deterministic sampling), supports v-prediction and epsilon-prediction parameterizations, and provides "leading," "trailing," and "linspace" timestep spacings.^[3] The default beta schedule for Stable Diffusion is the "scaled_linear" schedule with beta_start = 0.00085 and beta_end = 0.012, which is the schedule under which Stable Diffusion 1.4 and 1.5 were trained.^[3] The companion class DDIMInverseScheduler provides the encoding direction used by DDIM-inversion editing pipelines such as Prompt-to-Prompt and Null-Text Inversion.^[10]^[12]

The 2023 paper "Common Diffusion Noise Schedules and Sample Steps are Flawed" by Lin, Liu, Li, and Yang documented several subtle implementation bugs in the default DDIM configuration of Stable Diffusion 1.x: the noise schedule does not reach zero terminal signal-to-noise ratio, and the sampler does not start from the final timestep, biasing generations toward medium-brightness outputs.^[8] The paper proposes four fixes (rescale-betas-zero-SNR, v-prediction training, trailing timestep spacing, and guidance rescaling), all of which were subsequently exposed as configuration flags on DDIMScheduler.^[3]^[8]

Practical configuration in `diffusers`

The reference diffusers implementation exposes the full set of DDIM hyperparameters relevant to modern usage. Key configuration arguments include num_train_timesteps (default 1000, matching DDPM training), beta_start and beta_end (the boundary values of the noise schedule), beta_schedule (one of "linear", "scaled_linear", or "squaredcos_cap_v2", with "scaled_linear" being the Stable Diffusion default), clip_sample (whether to clip the predicted x_0 to [-1, 1] for pixel-space models), set_alpha_to_one (whether to anchor the final-step alpha product at 1, controlling the terminal-SNR behavior), prediction_type (one of "epsilon", "sample", or "v_prediction"), timestep_spacing (one of "leading", "trailing", or "linspace"), and rescale_betas_zero_snr (the 2023 fix to enforce zero terminal SNR).^[3]

At inference, the set_timesteps(num_inference_steps) method discretizes the training-time noise schedule down to the requested number of evaluation points, and the step(model_output, timestep, sample, eta=0.0, ...) method performs one DDIM update on a given latent. The eta parameter on step is the runtime knob exposed by Equation 16 of the original paper: a value of 0.0 produces deterministic DDIM, 1.0 produces DDPM-style stochastic sampling, and intermediate values give a continuous family of stochastic samplers, all sharing the same trained model.^[3]

The inverse direction (encoding) is provided by a companion DDIMInverseScheduler class, which iterates the DDIM update in the forward-time direction so that a clean image can be encoded into an x_T latent for editing tasks. The combination of DDIMScheduler and DDIMInverseScheduler is the substrate on which most diffusion-editing pipelines in diffusers are built.^[3]

What is DDIM inversion used for?

Because deterministic DDIM defines an approximately invertible map between a clean image x_0 and a latent code x_T, it supports a family of editing techniques collectively known as "DDIM inversion." Given a real image, one runs the DDIM iterate in the encoding direction to obtain a latent x_T such that decoding with the same model recovers the original image to high accuracy; one can then modify the conditioning signal (the text prompt, an attention map, or a ControlNet structure signal) and decode to obtain an edited image while preserving structural content.^[10]^[12] The Prompt-to-Prompt (Hertz et al., 2022) and Null-Text Inversion (Mokady et al., CVPR 2023) papers were among the first to use DDIM inversion as the substrate for text-driven editing of real images with Stable Diffusion.^[12]

How does DDIM compare with later samplers?

DDIM is a first-order ODE solver in the variance-exploding parameterization, and as such it is now considered a baseline that is consistently outperformed at very low step counts by higher-order solvers and by improved time discretizations. The most important successors are:

Sampler	Order	Reference	Typical NFE for SD-quality images
DDIM	1	Song, Meng, Ermon (ICLR 2021)^[1]	50 to 100
DPM-Solver	1-3 (multistep)	Lu et al. (NeurIPS 2022)^[4]	10 to 20
DPM-Solver++	1-3 (multistep)	Lu et al. (arXiv 2022)^[13]	10 to 20
Karras Heun	2 (Heun)	Karras, Aittala, Aila, Laine (NeurIPS 2022)^[9]	35
PNDM	multistep	Liu, Ren, Lin, Zhao (ICLR 2022)^[14]	50

DPM-Solver and DPM-Solver++ exploit the semi-linear structure of the probability-flow ODE to handle the linear component analytically, achieving FID 4.70 on CIFAR-10 with only 10 function evaluations and 2.87 with 20 evaluations, a 4x to 16x speedup over previous training-free samplers including DDIM.^[4] The Karras EDM family (Karras et al., NeurIPS 2022) re-derives the noise schedule and sampler design space from scratch, reaching state-of-the-art FID with 35 network evaluations per image on CIFAR-10 and ImageNet 64.^[9] By 2023, the diffusers library and most production text-to-image systems had switched their default scheduler away from DDIM and PNDM toward DPM-Solver++, EulerDiscrete, or Karras-style Heun samplers.^[2]^[3]

Why does DDIM matter?

DDIM was an early demonstration that the slow sampling of diffusion models was not intrinsic to the framework but a consequence of the specific reverse process. By separating the choice of forward process (which determines training) from the choice of generative chain (which determines sampling), it opened the door to a now-large literature on training-free fast samplers for diffusion models, including DPM-Solver, PNDM, EDM-style Heun samplers, k-LMS, UniPC, and many others.^[4]^[9]^[14]^[2] The connection between deterministic DDIM and the probability-flow ODE made the bridge between score-based diffusion and neural ODE literature explicit, framing diffusion sampling as a numerical ODE-solving problem and motivating high-order solver design.^[1]^[4]^[9]

DDIM was also the first widely deployed sampler that produced a meaningful "latent code" for diffusion models. The fact that a clean image x_0 could be encoded as a latent x_T, manipulated, and decoded, with high fidelity, made diffusion models usable for the same kinds of attribute manipulation, interpolation, and image-to-image editing that had previously been associated with GANs and variational autoencoders.^[1]^[12] DDIM inversion became the standard tool for editing real photographs with Stable Diffusion before being supplanted, and complemented, by null-text inversion, edit-friendly DDPM inversion, and inversion-free editing methods.^[12]

A third, methodological contribution of the paper is its framing of training and inference as separable design choices for an entire family of diffusion-like models. The DDIM proof that all members of the non-Markovian family share a surrogate objective with DDPM (Theorem 1) showed that the simple noise-prediction loss of Ho et al. (2020) is, in a precise sense, "universal" across many different generative procedures. This decoupling has been repeatedly used in subsequent work: progressive distillation (Salimans and Ho, 2022), consistency models (Song et al., 2023), and rectified flow (Liu et al., 2022) all build on the observation that a single pretrained noise-prediction network supports many sampling algorithms.^[1]^[4]^[9] The DDIM training-free aspect, namely that no fine-tuning or auxiliary training is required to switch sampling schemes, has been a defining feature of the diffusion-sampler literature ever since.

The paper is one of the most heavily cited diffusion-sampling works of all time. Semantic Scholar records more than 11,000 citations by 2026, and the paper is consistently listed alongside the original DDPM paper and the score-based SDE paper of Yang Song et al. as one of the three foundational works of the modern diffusion era.^[1]^[5]^[6]^[15]

Use in other generative tasks

Beyond image generation, the DDIM sampler has been adopted for a wide range of diffusion-based modalities and tasks. Audio diffusion models built on top of DDPM-style training, including those used by various text-to-audio systems, frequently expose a DDIM scheduler as a default fast sampler. Video diffusion models, which inherit large per-step compute costs because the U-Net must process a full sequence of frames, were among the early beneficiaries of DDIM-style acceleration, since cutting the step count by 10 to 50 times directly cuts video-synthesis cost in proportion. Imitation learning and robotic policy diffusion, in which a diffusion model is trained to denoise action trajectories rather than pixels, similarly use DDIM as a low-step sampler in real-time control loops.^[3]

DDIM-based encoding is used as a primitive for editing tasks even when the editing model itself is more sophisticated than a vanilla diffusion model: for example, ControlNet-conditioned editing pipelines often invert a source image through DDIM, modify the structural conditioning (a Canny edge map, a depth map, a pose skeleton, etc.), and decode through the same DDIM trajectory to produce an edited image whose layout and identity are preserved.^[3]^[10] The same primitive supports null-text inversion, prompt tuning, plug-and-play diffusion features, and a long list of similar real-image editing methods.^[12]

What are the limitations of DDIM?

Several limitations of DDIM, some discussed in the paper itself and others identified by later work, are now well understood.

The deterministic DDIM map is only approximately invertible. As Table 2 of the paper shows, reconstruction error on CIFAR-10 with $S = 10$ is 0.014 per pixel, falling to 0.0009 at $S = 100$ ; the residual error reflects the first-order discretization of the underlying ODE.^[1] For real-image editing tasks this discretization error compounds with classifier-free guidance error, and additional tricks (null-text optimization, prompt-tuning inversion) are often required to obtain acceptable reconstructions.^[12]

DDIM, like all first-order ODE solvers, requires more sampling steps than higher-order solvers to reach the same FID. At $S \in \{10, 20\}$ on CIFAR-10, DPM-Solver reaches FID 4.70 with 10 evaluations and 2.87 with 20 evaluations, while DDIM reaches 13.36 and 6.84 respectively.^[1]^[4] For Stable Diffusion 1.x at typical 25-50 step budgets, DDIM and DPM-Solver++ produce comparable quality, but at very low step counts ( $S = 5$ to $10$ ) DPM-Solver++ and EulerDiscrete samplers are noticeably better.^[2]^[4]

The default DDIM configuration in Stable Diffusion 1.x had two implementation-level issues identified in 2023: the noise schedule did not enforce zero terminal signal-to-noise ratio, and the inference timesteps did not begin at the final training timestep, both of which biased generations toward medium-brightness outputs.^[8] These issues are not intrinsic to DDIM itself (they apply equally to most schedulers in the affected pipelines), but they are usually fixed by the rescale_betas_zero_snr and timestep_spacing="trailing" options of the DDIMScheduler.^[3]^[8]

The stochastic DDPM end of the eta family (eta = 1) is generally worse than DDIM for fewer than ~100 steps, and the "sigma-hat" parameterization used by Ho et al. (2020) on CIFAR-10 is dramatically worse at low step counts, with FID 367.43 at $S = 10$ on CIFAR-10.^[1] This is consistent with the broader observation that SDE-style stochastic samplers need many more steps than their ODE-style deterministic counterparts to reach the same quality, and the asymptotic advantage of stochastic samplers (slightly better FID at the full 1000 steps) does not survive aggressive step-count reduction.^[1]

DDIM stands in a small family of fast diffusion samplers and is most usefully understood relative to its neighbors. The original DDPM (Ho, Jain, Abbeel, 2020) is the Markovian baseline against which DDIM was designed.^[5] The concurrent score-based SDE framework of Yang Song et al. (2020) provides the probability-flow ODE viewpoint that DDIM is a special case of.^[1] PNDM (Liu et al., ICLR 2022) extends DDIM by using a pseudo linear multi-step method for the underlying ODE.^[14] DPM-Solver (Lu et al., NeurIPS 2022) and DPM-Solver++ (Lu et al., 2022) provide dedicated semi-linear ODE solvers achieving high quality in roughly 10 to 20 steps.^[4]^[13] The EDM design space of Karras, Aittala, Aila, and Laine (NeurIPS 2022) re-parameterizes the noise schedule and uses a second-order Heun method to reach state-of-the-art FID at 35 network evaluations on CIFAR-10 and ImageNet 64.^[9] The DDIM-inversion line of work, including Prompt-to-Prompt and Null-Text Inversion, treats deterministic DDIM as a primitive for real-image editing on top of Stable Diffusion.^[10]^[12]

In application, DDIM was used as the original sampler for the public Stable Diffusion 1.x family from CompVis (Rombach, Blattmann, Lorenz, Esser, Ommer at LMU Munich) and remains a baseline scheduler in the Hugging Face diffusers library.^[11]^[3] It has been cited by virtually every subsequent diffusion-sampler paper as the first-order baseline.

References

Jiaming Song, Chenlin Meng, Stefano Ermon, "Denoising Diffusion Implicit Models", arXiv preprint, 2020-10-06 (v4 dated 2022-10-05). https://arxiv.org/abs/2010.02502. Accessed 2026-06-24. ↩
Andrew Zhu et al., "Stable Diffusion Samplers: A Comprehensive Guide", Stable Diffusion Art, 2024-09-01. https://stable-diffusion-art.com/samplers/. Accessed 2026-05-20. ↩
Hugging Face, "DDIMScheduler", diffusers documentation v0.38.0, 2025-04-01. https://huggingface.co/docs/diffusers/api/schedulers/ddim. Accessed 2026-06-24. ↩
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps", arXiv preprint, 2022-06-02. https://arxiv.org/abs/2206.00927. Accessed 2026-05-20. ↩
Jonathan Ho, Ajay Jain, Pieter Abbeel, "Denoising Diffusion Probabilistic Models", arXiv preprint, 2020-06-19. https://arxiv.org/abs/2006.11239. Accessed 2026-05-20. ↩
OpenReview, "Denoising Diffusion Implicit Models (ICLR 2021 poster)", OpenReview.net, 2021-01-12. https://openreview.net/forum?id=St1giarCHLP. Accessed 2026-05-20. ↩
Jiaming Song, Chenlin Meng, Stefano Ermon, "ermongroup/ddim: Denoising Diffusion Implicit Models", GitHub repository, 2020-10-06. https://github.com/ermongroup/ddim. Accessed 2026-05-20. ↩
Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang, "Common Diffusion Noise Schedules and Sample Steps are Flawed", arXiv preprint, 2023-05-15. https://arxiv.org/abs/2305.08891. Accessed 2026-05-20. ↩
Tero Karras, Miika Aittala, Timo Aila, Samuli Laine, "Elucidating the Design Space of Diffusion-Based Generative Models", arXiv preprint, 2022-06-01. https://arxiv.org/abs/2206.00364. Accessed 2026-05-20. ↩
Hugging Face, "DDIM Inversion (Unit 4)", Diffusion Models Course, 2024-01-01. https://huggingface.co/learn/diffusion-course/en/unit4/2. Accessed 2026-05-20. ↩
CompVis, "Stable Diffusion v1-4 Model Card", Hugging Face, 2022-08-22. https://huggingface.co/CompVis/stable-diffusion-v1-4. Accessed 2026-05-20. ↩
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or, "Null-text Inversion for Editing Real Images using Guided Diffusion Models", CVPR 2023 proceedings, 2023-06-01. https://openaccess.thecvf.com/content/CVPR2023/papers/Mokady_NULL-Text_Inversion_for_Editing_Real_Images_Using_Guided_Diffusion_Models_CVPR_2023_paper.pdf. Accessed 2026-05-20. ↩
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu, "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models", arXiv preprint, 2022-11-02. https://arxiv.org/abs/2211.01095. Accessed 2026-05-20. ↩
Luping Liu, Yi Ren, Zhijie Lin, Zhou Zhao, "Pseudo Numerical Methods for Diffusion Models on Manifolds", arXiv preprint, 2022-02-20. https://arxiv.org/abs/2202.09778. Accessed 2026-05-20. ↩
Semantic Scholar, "Denoising Diffusion Implicit Models (Song, Meng, Ermon)", citation record. https://www.semanticscholar.org/paper/Denoising-Diffusion-Implicit-Models-Song-Meng/014576b866078524286802b1d0e18628520aa886. Accessed 2026-06-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Abbreviations DDPM DPM-Solver Diffusion policy EDM (Elucidating Diffusion Models)Latent Consistency Models (LCM)Lumiere Mercury (Inception Labs)Midjourney Step Timestep

What problem does DDIM solve?

When was DDIM published?

The non-Markovian intuition

The DDPM training objective

How does DDIM work?

Non-Markovian forward processes

The DDIM sampling update

How does DDIM make sampling faster?

Relation to neural ODEs and the probability-flow ODE

How much faster is DDIM than DDPM?

Where is DDIM used?

Stable Diffusion and the diffusers library

Practical configuration in diffusers

What is DDIM inversion used for?

How does DDIM compare with later samplers?

Why does DDIM matter?

Use in other generative tasks

What are the limitations of DDIM?

Related work

See also

References

Improve this article

Related Articles

Stable Diffusion

DALL-E

Midjourney

Sora

Imagen (text-to-image model)

Flux (text-to-image model)

What links here

Related Articles

Stable Diffusion

DALL-E

Midjourney

Sora

Imagen (text-to-image model)

Flux (text-to-image model)

What links here

Practical configuration in `diffusers`