DDIM (Denoising Diffusion Implicit Models)
Denoising Diffusion Implicit Models (DDIM) are a class of iterative generative models, introduced by Jiaming Song, Chenlin Meng, and Stefano Ermon of Stanford University in October 2020, that accelerate sampling from pretrained denoising diffusion probabilistic models by generalizing the underlying Markovian forward process to a family of non-Markovian forward processes sharing the same training objective.[1] The method preserves all marginal distributions of the original DDPM forward process while introducing a free variance parameter that interpolates between a deterministic ordinary-differential-equation sampler and the original stochastic DDPM sampler.[1] Empirically, DDIM produces sample quality comparable to a 1000-step DDPM with as few as 20 to 100 sampling steps, a wall-clock speedup of roughly 10x to 50x without retraining.[1] In its deterministic limit (the eta=0 case) DDIM corresponds to a first-order Euler discretization of the probability-flow ordinary differential equation associated with the variance-exploding stochastic differential equation of score-based diffusion.[1] DDIM served as the default sampler in the original Stable Diffusion release in 2022 and remains a baseline scheduler in the Hugging Face diffusers library, even though later higher-order ODE solvers such as DPM-Solver and the Karras EDM family generally match or exceed its quality at fewer steps.[2][3][4]
| Attribute | Value |
|---|
| Type | Sampler for diffusion models |
| First arXiv submission | 2020-10-06 |
| Authors | Jiaming Song, Chenlin Meng, Stefano Ermon |
| Affiliation | Stanford University |
| Publication venue | ICLR 2021 (poster) |
| OpenReview ID | St1giarCHLP |
| Reference implementation | github.com/ermongroup/ddim (MIT License) |
| Default stochasticity parameter | eta = 0 (deterministic) |
| Training requirement | None beyond a standard DDPM |
| Typical inference budget | 10 to 100 network function evaluations |
Background
Denoising diffusion models are latent-variable generative models that learn to invert a fixed forward process which gradually corrupts data into Gaussian noise.[1][5] The original Markovian formulation of Sohl-Dickstein et al. (2015) and the DDPM training recipe of Ho, Jain, and Abbeel (2020) define a forward chain that adds a small amount of Gaussian noise at each of T discrete steps, and a generative reverse chain that progressively denoises a sample drawn from a standard Gaussian back to the data distribution.[1][5] DDPM, in particular, demonstrated that image quality on CIFAR-10 (FID 3.17 at the time of publication, with T = 1000) could rival that of GANs without adversarial training, but at a substantial sampling cost: the entire 1000-step chain must be simulated sequentially to draw a single image.[1][5]
Song, Meng, and Ermon's October 2020 preprint identified this sampling cost as the principal practical drawback of diffusion models relative to single-pass generators. As they note in the introduction, "it takes around 20 hours to sample 50k images of size 32 x 32 from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU," and the gap widens for larger images.[1] DDIM was conceived specifically to close this gap without modifying the training procedure.
The paper was first posted to arXiv on 6 October 2020 (preprint number 2010.02502) and accepted as a poster at the International Conference on Learning Representations 2021, with the OpenReview camera-ready dated 12 January 2021.[1][6] The final arXiv revision (v4) is dated 5 October 2022.[1] The original code release accompanying the paper is hosted at github.com/ermongroup/ddim under an MIT license, and provides reference implementations for CIFAR-10, CelebA 64x64, and LSUN church and bedroom datasets.[7]
The non-Markovian intuition
The standard derivation of the DDPM reverse process treats the forward chain as a strictly Markovian Gaussian random walk: at each step, x_t is obtained by adding fresh Gaussian noise to x_(t-1). The reverse chain, then, must approximate a Markovian denoiser. The key conceptual leap of the DDIM paper is to ask whether one can replace the Markovian forward chain with a non-Markovian one, in which the noise injected at step t may depend on both x_(t-1) and the underlying clean sample x_0, while retaining the same marginals q(x_t | x_0). If this is possible, then a different (and potentially shorter) reverse chain can be designed without changing the training loss.
The non-Markovian forward process q_sigma constructed in the paper does exactly this: each x_t is allowed to depend on (x_(t-1), x_0) via Bayes' rule applied to the inference posterior q_sigma(x_(t-1) | x_t, x_0). This dependence is harmless from the perspective of training, because the noise-prediction loss never inspects intermediate joint distributions, only marginals. But it lets the variance sigma_t at each reverse step be a free parameter, which in turn allows the reverse chain to take large jumps in t without violating the underlying ODE structure.[1]
The DDPM training objective
DDIM does not modify training. To explain the relationship to DDPM, the same notation is used here. DDPM defines a Markovian forward process
q(x_t | x_(t-1)) = N(sqrt(alpha_t / alpha_(t-1)) x_(t-1), (1 - alpha_t / alpha_(t-1)) I),
where alpha_1, ..., alpha_T is a decreasing schedule in (0, 1].[1] A key property is that the marginal q(x_t | x_0) admits a closed-form Gaussian, so any noisy latent can be sampled directly without iterating:
q(x_t | x_0) = N(sqrt(alpha_t) x_0, (1 - alpha_t) I).[1]
The DDPM training objective (Equation 5 of the DDIM paper, equivalent to the simple loss of Ho et al. 2020) reduces to a weighted denoising score-matching loss in which a neural network epsilon_theta is trained to predict the noise variable epsilon used to corrupt a clean sample x_0 at a given timestep t.[1][5] Critically, this objective depends only on the marginals q(x_t | x_0), not on the full joint q(x_(1:T) | x_0).[1] Multiple joint distributions, including non-Markovian ones, can share the same marginals while inducing different reverse-time generative processes.
How DDIM works
Non-Markovian forward processes
The central observation of the DDIM paper is that, since the DDPM loss depends only on the marginals q(x_t | x_0), one can construct a parametric family of inference distributions Q indexed by a vector sigma in R^T_(>=0) that all share the DDPM marginals but differ in their joint structure.[1] The family is defined by
q_sigma(x_(1:T) | x_0) = q_sigma(x_T | x_0) * prod_(t=2)^T q_sigma(x_(t-1) | x_t, x_0),
where q_sigma(x_T | x_0) is the DDPM terminal marginal and each posterior is the Gaussian
q_sigma(x_(t-1) | x_t, x_0) = N( sqrt(alpha_(t-1)) x_0 + sqrt(1 - alpha_(t-1) - sigma_t^2) * (x_t - sqrt(alpha_t) x_0) / sqrt(1 - alpha_t), sigma_t^2 I ).[1]
The mean is chosen so that the induced marginals q_sigma(x_t | x_0) match the DDPM marginals exactly, while the variance sigma_t is a free parameter.[1] When all sigma_t are zero, the conditional q_sigma(x_(t-1) | x_t, x_0) becomes a deterministic linear function of x_t and x_0. The corresponding forward process q_sigma(x_t | x_(t-1), x_0), obtained from Bayes' rule, is non-Markovian because each x_t depends on both x_(t-1) and x_0.[1]
Theorem 1 of the paper proves that for every sigma > 0 the resulting variational training objective J_sigma is equal, up to a constant, to a reweighting L_gamma of the original DDPM loss. Consequently, any model trained with the DDPM "simple" loss (L_1, gamma = 1) is, simultaneously, a valid model for every member of the non-Markovian family. No retraining is required to switch sampling regimes.[1]
The DDIM sampling update
Given a noisy latent x_t and the noise prediction epsilon_theta^(t)(x_t), the DDIM update rule (Equation 12 of the paper) is
x_(t-1) = sqrt(alpha_(t-1)) * ((x_t - sqrt(1 - alpha_t) * epsilon_theta^(t)(x_t)) / sqrt(alpha_t)) + sqrt(1 - alpha_(t-1) - sigma_t^2) * epsilon_theta^(t)(x_t) + sigma_t * epsilon_t,
where epsilon_t ~ N(0, I) is independent of x_t and by convention alpha_0 = 1.[1] The three terms have intuitive labels in the paper: a "predicted x_0" term, a "direction pointing to x_t," and a "random noise" term.[1]
The variance sigma_t is parameterized for experiments by a scalar eta in [0, infinity):
sigma_(tau_i)(eta) = eta * sqrt((1 - alpha_(tau_(i-1))) / (1 - alpha_(tau_i))) * sqrt(1 - alpha_(tau_i) / alpha_(tau_(i-1))).[1]
Two special cases are highlighted in the paper. When eta = 1, sigma_t equals the posterior standard deviation of the original DDPM reverse process and the update reduces to the DDPM sampler.[1] When eta = 0, the random-noise term vanishes and x_(t-1) is a deterministic function of x_t alone: this is DDIM proper, an implicit probabilistic model that maps a fixed x_T to a unique sample.[1] Intermediate values of eta interpolate between these two regimes. In the Hugging Face diffusers library, this is exposed as the eta argument of the DDIMScheduler.step method, with a default of 0.0.[3]
Accelerated sampling trajectories
Because the training objective is independent of the specific forward process (only the marginals matter), DDIM allows the user to choose any increasing sub-sequence tau = [tau_1, ..., tau_S] of [1, ..., T] as the sampling schedule and apply the DDIM update only on the timesteps in tau.[1] The marginals q(x_(tau_i) | x_0) remain Gaussian with the closed form fixed at training time, so the model continues to denoise correctly without modification. The total cost of sampling becomes proportional to S rather than T.
The paper considers two simple sub-sampling schedules: a "linear" schedule, in which tau_i is roughly proportional to i, and a "quadratic" schedule, in which tau_i is proportional to i^2.[1] The quadratic schedule is found to give slightly better FID for CIFAR-10 in low-step regimes, while the linear schedule is preferred for CelebA. Subsequent work (notably the 2023 paper "Common Diffusion Noise Schedules and Sample Steps are Flawed") proposes "trailing" and "linspace" alternatives in the diffusers DDIMScheduler, which materially improve quality at very low step counts (S = 5) when combined with v-prediction training and zero-terminal-SNR noise schedules.[8][3]
Relation to neural ODEs and the probability-flow ODE
The DDIM update can be rewritten in a form that exposes its structure as an Euler discretization of an ordinary differential equation. Equation 13 of the paper rearranges the iterate as
x_(t - dt) / sqrt(alpha_(t - dt)) = x_t / sqrt(alpha_t) + ( sqrt((1 - alpha_(t - dt)) / alpha_(t - dt)) - sqrt((1 - alpha_t) / alpha_t) ) * epsilon_theta^(t)(x_t).[1]
Reparameterizing with sigma(t) = sqrt(1 - alpha(t)) / sqrt(alpha(t)) and x_bar(t) = x(t) / sqrt(alpha(t)) yields the limiting ODE
d x_bar(t) = epsilon_theta^(t)( x_bar(t) / sqrt(sigma^2 + 1) ) d sigma(t).[1]
Proposition 1 of the paper proves that with the optimal noise-prediction network, this ODE is equivalent to a special case of the "probability flow ODE" derived concurrently by Yang Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole in "Score-Based Generative Modeling through Stochastic Differential Equations," corresponding to the variance-exploding diffusion SDE.[1] The two methods take Euler steps with respect to different parameterizations (DDIM with respect to dsigma(t), the probability-flow Euler method with respect to dt), which gives different update equations in the discrete-step regime; in the continuous limit they coincide.[1]
The ODE viewpoint has two practical consequences. First, deterministic DDIM is, formally, a first-order ODE solver, which means that higher-order solvers can in principle produce better samples for the same number of network evaluations: this is the route taken by subsequent samplers such as DPM-Solver, PNDM, Heun's method, and the Karras EDM family.[2][4][9] Second, the ODE is invertible: by running the DDIM iterate "in reverse" (the encoding direction), one obtains a deterministic map from a real image x_0 to a latent code x_T. Iterating the encoding and decoding directions reconstructs the image, an operation called "DDIM inversion" that became a standard primitive for image editing.[1][10]
Experimental results
The DDIM paper benchmarks the eta parameter and the trajectory length S on two unconditional image-generation datasets, CIFAR-10 (32 x 32) and CelebA (64 x 64), using exactly the same network weights trained with the DDPM L_1 loss.[1] Frechet Inception Distance (FID) is reported for S in {10, 20, 50, 100, 1000} and eta in {0.0, 0.2, 0.5, 1.0}, with an additional "sigma-hat" row corresponding to the implementation used by Ho et al. (2020) for CIFAR-10 samples.[1]
| S | DDIM (eta=0) CIFAR-10 | DDIM (eta=0) CelebA | DDPM (eta=1) CIFAR-10 | DDPM (eta=1) CelebA |
|---|
| 10 | 13.36 | 17.33 | 41.07 | 33.12 |
| 20 | 6.84 | 13.73 | 18.36 | 26.03 |
| 50 | 4.67 | 9.17 | 8.01 | 18.48 |
| 100 | 4.16 | 6.53 | 5.78 | 13.93 |
| 1000 | 4.04 | 3.51 | 4.73 | 5.98 |
(FID scores reproduced from Table 1 of Song, Meng, and Ermon, 2020.[1])
The headline finding is that deterministic DDIM achieves dramatically better FID than DDPM at low step counts. On CIFAR-10 with S = 10, DDIM scores 13.36 while DDPM scores 41.07; on CelebA with S = 20, DDIM scores 13.73 while DDPM scores 26.03. As S grows the gap narrows; at S = 1000 both methods are within 1 FID unit of each other and the DDPM "sigma-hat" parameterization is marginally better on CIFAR-10 (3.17 versus 4.04).[1] In wall-clock terms, the paper quotes a 10x to 50x speedup, defined operationally as the ratio of steps required by DDPM to those required by DDIM to reach comparable FID.[1]
A second experiment in Section 5.2 establishes that the deterministic DDIM map preserves "high-level features" across sampling-trajectory lengths: starting from the same x_T and varying S between 10 and 1000, the generated images share the same coarse semantic structure, with only fine details differing.[1] This consistency is a direct consequence of determinism: in DDPM the random-noise injections at each step erase any informative structure that x_T might encode about x_0, so identical x_T do not yield identical outputs.
A third experiment (Section 5.3) exploits the same consistency to perform image interpolation in the latent space x_T. Linear or spherical interpolations between two latent codes x_T^(1) and x_T^(2) produce smooth, semantically meaningful image interpolations on CelebA, a behavior previously associated with implicit generative models such as GANs and not with diffusion models.[1] A fourth experiment (Table 2 of the paper) measures reconstruction error after encoding and decoding through DDIM at various step counts; the mean squared error on CIFAR-10 falls from 0.014 at S = 10 to 0.0009 at S = 100, confirming that the deterministic map is approximately invertible in practice.[1]
Adoption in downstream systems
Stable Diffusion and the diffusers library
When CompVis released the original Stable Diffusion 1.x family of latent diffusion checkpoints in 2022, the inference pipeline shipped DDIM as the default sampler, alongside the higher-order PLMS/PNDM sampler that became the default in the Hugging Face diffusers StableDiffusionPipeline.[11][3] Many of the early Stable Diffusion tutorials and grid comparisons on community sites (AUTOMATIC1111's stable-diffusion-webui, ComfyUI, InvokeAI) treat DDIM as the reference sampler against which newer samplers are benchmarked.[2]
The diffusers DDIMScheduler class exposes the full eta parameter (defaulting to 0.0 for deterministic sampling), supports v-prediction and epsilon-prediction parameterizations, and provides "leading," "trailing," and "linspace" timestep spacings.[3] The default beta schedule for Stable Diffusion is the "scaled_linear" schedule with beta_start = 0.00085 and beta_end = 0.012, which is the schedule under which Stable Diffusion 1.4 and 1.5 were trained.[3] The companion class DDIMInverseScheduler provides the encoding direction used by DDIM-inversion editing pipelines such as Prompt-to-Prompt and Null-Text Inversion.[10][12]
The 2023 paper "Common Diffusion Noise Schedules and Sample Steps are Flawed" by Lin, Liu, Li, and Yang documented several subtle implementation bugs in the default DDIM configuration of Stable Diffusion 1.x: the noise schedule does not reach zero terminal signal-to-noise ratio, and the sampler does not start from the final timestep, biasing generations toward medium-brightness outputs.[8] The paper proposes four fixes (rescale-betas-zero-SNR, v-prediction training, trailing timestep spacing, and guidance rescaling), all of which were subsequently exposed as configuration flags on DDIMScheduler.[3][8]
Practical configuration in diffusers
The reference diffusers implementation exposes the full set of DDIM hyperparameters relevant to modern usage. Key configuration arguments include num_train_timesteps (default 1000, matching DDPM training), beta_start and beta_end (the boundary values of the noise schedule), beta_schedule (one of "linear", "scaled_linear", or "squaredcos_cap_v2", with "scaled_linear" being the Stable Diffusion default), clip_sample (whether to clip the predicted x_0 to [-1, 1] for pixel-space models), set_alpha_to_one (whether to anchor the final-step alpha product at 1, controlling the terminal-SNR behavior), prediction_type (one of "epsilon", "sample", or "v_prediction"), timestep_spacing (one of "leading", "trailing", or "linspace"), and rescale_betas_zero_snr (the 2023 fix to enforce zero terminal SNR).[3]
At inference, the set_timesteps(num_inference_steps) method discretizes the training-time noise schedule down to the requested number of evaluation points, and the step(model_output, timestep, sample, eta=0.0, ...) method performs one DDIM update on a given latent. The eta parameter on step is the runtime knob exposed by Equation 16 of the original paper: a value of 0.0 produces deterministic DDIM, 1.0 produces DDPM-style stochastic sampling, and intermediate values give a continuous family of stochastic samplers, all sharing the same trained model.[3]
The inverse direction (encoding) is provided by a companion DDIMInverseScheduler class, which iterates the DDIM update in the forward-time direction so that a clean image can be encoded into an x_T latent for editing tasks. The combination of DDIMScheduler and DDIMInverseScheduler is the substrate on which most diffusion-editing pipelines in diffusers are built.[3]
DDIM inversion and image editing
Because deterministic DDIM defines an approximately invertible map between a clean image x_0 and a latent code x_T, it supports a family of editing techniques collectively known as "DDIM inversion." Given a real image, one runs the DDIM iterate in the encoding direction to obtain a latent x_T such that decoding with the same model recovers the original image to high accuracy; one can then modify the conditioning signal (the text prompt, an attention map, or a ControlNet structure signal) and decode to obtain an edited image while preserving structural content.[10][12] The Prompt-to-Prompt (Hertz et al., 2022) and Null-Text Inversion (Mokady et al., CVPR 2023) papers were among the first to use DDIM inversion as the substrate for text-driven editing of real images with Stable Diffusion.[12]
Comparison with later samplers
DDIM is a first-order ODE solver in the variance-exploding parameterization, and as such it is now considered a baseline that is consistently outperformed at very low step counts by higher-order solvers and by improved time discretizations. The most important successors are:
| Sampler | Order | Reference | Typical NFE for SD-quality images |
|---|
| DDIM | 1 | Song, Meng, Ermon (ICLR 2021)[1] | 50 to 100 |
| DPM-Solver | 1-3 (multistep) | Lu et al. (NeurIPS 2022)[4] | 10 to 20 |
| DPM-Solver++ | 1-3 (multistep) | Lu et al. (arXiv 2022)[13] | 10 to 20 |
| Karras Heun | 2 (Heun) | Karras, Aittala, Aila, Laine (NeurIPS 2022)[9] | 35 |
| PNDM | multistep | Liu, Ren, Lin, Zhao (ICLR 2022)[14] | 50 |
DPM-Solver and DPM-Solver++ exploit the semi-linear structure of the probability-flow ODE to handle the linear component analytically, achieving FID 4.70 on CIFAR-10 with only 10 function evaluations and 2.87 with 20 evaluations, a 4x to 16x speedup over previous training-free samplers including DDIM.[4] The Karras EDM family (Karras et al., NeurIPS 2022) re-derives the noise schedule and sampler design space from scratch, reaching state-of-the-art FID with 35 network evaluations per image on CIFAR-10 and ImageNet 64.[9] By 2023, the diffusers library and most production text-to-image systems had switched their default scheduler away from DDIM and PNDM toward DPM-Solver++, EulerDiscrete, or Karras-style Heun samplers.[2][3]
Significance
DDIM was an early demonstration that the slow sampling of diffusion models was not intrinsic to the framework but a consequence of the specific reverse process. By separating the choice of forward process (which determines training) from the choice of generative chain (which determines sampling), it opened the door to a now-large literature on training-free fast samplers for diffusion models, including DPM-Solver, PNDM, EDM-style Heun samplers, k-LMS, UniPC, and many others.[4][9][14][2] The connection between deterministic DDIM and the probability-flow ODE made the bridge between score-based diffusion and neural ODE literature explicit, framing diffusion sampling as a numerical ODE-solving problem and motivating high-order solver design.[1][4][9]
DDIM was also the first widely deployed sampler that produced a meaningful "latent code" for diffusion models. The fact that a clean image x_0 could be encoded as a latent x_T, manipulated, and decoded, with high fidelity, made diffusion models usable for the same kinds of attribute manipulation, interpolation, and image-to-image editing that had previously been associated with GANs and variational autoencoders.[1][12] DDIM inversion became the standard tool for editing real photographs with Stable Diffusion before being supplanted, and complemented, by null-text inversion, edit-friendly DDPM inversion, and inversion-free editing methods.[12]
A third, methodological contribution of the paper is its framing of training and inference as separable design choices for an entire family of diffusion-like models. The DDIM proof that all members of the non-Markovian family share a surrogate objective with DDPM (Theorem 1) showed that the simple noise-prediction loss of Ho et al. (2020) is, in a precise sense, "universal" across many different generative procedures. This decoupling has been repeatedly used in subsequent work: progressive distillation (Salimans and Ho, 2022), consistency models (Song et al., 2023), and rectified flow (Liu et al., 2022) all build on the observation that a single pretrained noise-prediction network supports many sampling algorithms.[1][4][9] The DDIM training-free aspect, namely that no fine-tuning or auxiliary training is required to switch sampling schemes, has been a defining feature of the diffusion-sampler literature ever since.
The paper has accumulated tens of thousands of citations on Google Scholar and Semantic Scholar by 2026 and is one of the most heavily cited diffusion-sampling papers of all time. It is consistently listed alongside the original DDPM paper and the score-based SDE paper of Yang Song et al. as one of the three foundational works of the modern diffusion era.[1][5][6]
Use in other generative tasks
Beyond image generation, the DDIM sampler has been adopted for a wide range of diffusion-based modalities and tasks. Audio diffusion models built on top of DDPM-style training, including those used by various text-to-audio systems, frequently expose a DDIM scheduler as a default fast sampler. Video diffusion models, which inherit large per-step compute costs because the U-Net must process a full sequence of frames, were among the early beneficiaries of DDIM-style acceleration, since cutting the step count by 10 to 50 times directly cuts video-synthesis cost in proportion. Imitation learning and robotic policy diffusion, in which a diffusion model is trained to denoise action trajectories rather than pixels, similarly use DDIM as a low-step sampler in real-time control loops.[3]
DDIM-based encoding is used as a primitive for editing tasks even when the editing model itself is more sophisticated than a vanilla diffusion model: for example, ControlNet-conditioned editing pipelines often invert a source image through DDIM, modify the structural conditioning (a Canny edge map, a depth map, a pose skeleton, etc.), and decode through the same DDIM trajectory to produce an edited image whose layout and identity are preserved.[3][10] The same primitive supports null-text inversion, prompt tuning, plug-and-play diffusion features, and a long list of similar real-image editing methods.[12]
Limitations
Several limitations of DDIM, some discussed in the paper itself and others identified by later work, are now well understood.
The deterministic DDIM map is only approximately invertible. As Table 2 of the paper shows, reconstruction error on CIFAR-10 with S = 10 is 0.014 per pixel, falling to 0.0009 at S = 100; the residual error reflects the first-order discretization of the underlying ODE.[1] For real-image editing tasks this discretization error compounds with classifier-free guidance error, and additional tricks (null-text optimization, prompt-tuning inversion) are often required to obtain acceptable reconstructions.[12]
DDIM, like all first-order ODE solvers, requires more sampling steps than higher-order solvers to reach the same FID. At S in {10, 20} on CIFAR-10, DPM-Solver reaches FID 4.70 with 10 evaluations and 2.87 with 20 evaluations, while DDIM reaches 13.36 and 6.84 respectively.[1][4] For Stable Diffusion 1.x at typical 25-50 step budgets, DDIM and DPM-Solver++ produce comparable quality, but at very low step counts (S = 5 to 10) DPM-Solver++ and EulerDiscrete samplers are noticeably better.[2][4]
The default DDIM configuration in Stable Diffusion 1.x had two implementation-level issues identified in 2023: the noise schedule did not enforce zero terminal signal-to-noise ratio, and the inference timesteps did not begin at the final training timestep, both of which biased generations toward medium-brightness outputs.[8] These issues are not intrinsic to DDIM itself (they apply equally to most schedulers in the affected pipelines), but they are usually fixed by the rescale_betas_zero_snr and timestep_spacing="trailing" options of the DDIMScheduler.[3][8]
The stochastic DDPM end of the eta family (eta = 1) is generally worse than DDIM for fewer than ~100 steps, and the "sigma-hat" parameterization used by Ho et al. (2020) on CIFAR-10 is dramatically worse at low step counts, with FID 367.43 at S = 10 on CIFAR-10.[1] This is consistent with the broader observation that SDE-style stochastic samplers need many more steps than their ODE-style deterministic counterparts to reach the same quality, and the asymptotic advantage of stochastic samplers (slightly better FID at the full 1000 steps) does not survive aggressive step-count reduction.[1]
DDIM stands in a small family of fast diffusion samplers and is most usefully understood relative to its neighbors. The original DDPM (Ho, Jain, Abbeel, 2020) is the Markovian baseline against which DDIM was designed.[5] The concurrent score-based SDE framework of Yang Song et al. (2020) provides the probability-flow ODE viewpoint that DDIM is a special case of.[1] PNDM (Liu et al., ICLR 2022) extends DDIM by using a pseudo linear multi-step method for the underlying ODE.[14] DPM-Solver (Lu et al., NeurIPS 2022) and DPM-Solver++ (Lu et al., 2022) provide dedicated semi-linear ODE solvers achieving high quality in roughly 10 to 20 steps.[4][13] The EDM design space of Karras, Aittala, Aila, and Laine (NeurIPS 2022) re-parameterizes the noise schedule and uses a second-order Heun method to reach state-of-the-art FID at 35 network evaluations on CIFAR-10 and ImageNet 64.[9] The DDIM-inversion line of work, including Prompt-to-Prompt and Null-Text Inversion, treats deterministic DDIM as a primitive for real-image editing on top of Stable Diffusion.[10][12]
In application, DDIM was used as the original sampler for the public Stable Diffusion 1.x family from CompVis (Rombach, Blattmann, Lorenz, Esser, Ommer at LMU Munich) and remains a baseline scheduler in the Hugging Face diffusers library.[11][3] It has been cited by virtually every subsequent diffusion-sampler paper as the first-order baseline.
See also
References
- Jiaming Song, Chenlin Meng, Stefano Ermon, "Denoising Diffusion Implicit Models", arXiv preprint, 2020-10-06 (v4 dated 2022-10-05). https://arxiv.org/abs/2010.02502. Accessed 2026-05-20.
- Andrew Zhu et al., "Stable Diffusion Samplers: A Comprehensive Guide", Stable Diffusion Art, 2024-09-01. https://stable-diffusion-art.com/samplers/. Accessed 2026-05-20.
- Hugging Face, "DDIMScheduler", diffusers documentation v0.38.0, 2025-04-01. https://huggingface.co/docs/diffusers/api/schedulers/ddim. Accessed 2026-05-20.
- Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps", arXiv preprint, 2022-06-02. https://arxiv.org/abs/2206.00927. Accessed 2026-05-20.
- Jonathan Ho, Ajay Jain, Pieter Abbeel, "Denoising Diffusion Probabilistic Models", arXiv preprint, 2020-06-19. https://arxiv.org/abs/2006.11239. Accessed 2026-05-20.
- OpenReview, "Denoising Diffusion Implicit Models (ICLR 2021 poster)", OpenReview.net, 2021-01-12. https://openreview.net/forum?id=St1giarCHLP. Accessed 2026-05-20.
- Jiaming Song, Chenlin Meng, Stefano Ermon, "ermongroup/ddim: Denoising Diffusion Implicit Models", GitHub repository, 2020-10-06. https://github.com/ermongroup/ddim. Accessed 2026-05-20.
- Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang, "Common Diffusion Noise Schedules and Sample Steps are Flawed", arXiv preprint, 2023-05-15. https://arxiv.org/abs/2305.08891. Accessed 2026-05-20.
- Tero Karras, Miika Aittala, Timo Aila, Samuli Laine, "Elucidating the Design Space of Diffusion-Based Generative Models", arXiv preprint, 2022-06-01. https://arxiv.org/abs/2206.00364. Accessed 2026-05-20.
- Hugging Face, "DDIM Inversion (Unit 4)", Diffusion Models Course, 2024-01-01. https://huggingface.co/learn/diffusion-course/en/unit4/2. Accessed 2026-05-20.
- CompVis, "Stable Diffusion v1-4 Model Card", Hugging Face, 2022-08-22. https://huggingface.co/CompVis/stable-diffusion-v1-4. Accessed 2026-05-20.
- Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or, "Null-text Inversion for Editing Real Images using Guided Diffusion Models", CVPR 2023 proceedings, 2023-06-01. https://openaccess.thecvf.com/content/CVPR2023/papers/Mokady_NULL-Text_Inversion_for_Editing_Real_Images_Using_Guided_Diffusion_Models_CVPR_2023_paper.pdf. Accessed 2026-05-20.
- Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu, "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models", arXiv preprint, 2022-11-02. https://arxiv.org/abs/2211.01095. Accessed 2026-05-20.
- Luping Liu, Yi Ren, Zhijie Lin, Zhou Zhao, "Pseudo Numerical Methods for Diffusion Models on Manifolds", arXiv preprint, 2022-02-20. https://arxiv.org/abs/2202.09778. Accessed 2026-05-20.