EDM (Elucidating Diffusion Models)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,228 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,228 words
Add missing citations, update stale details, or suggest a clearer explanation.
EDM is the common shorthand for the paper "Elucidating the Design Space of Diffusion-Based Generative Models" by Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine of NVIDIA, presented at NeurIPS 2022.[^1][^2] The paper recasts the previously fragmented literature on score-based and denoising diffusion models into a single design space and shows that, with the right preconditioning, noise schedule, training-loss weighting, and sampler, classical diffusion models can reach state-of-the-art image-generation quality with a small number of network evaluations.[^1][^2] EDM also introduced the "Karras sigma schedule" (parameterized by sigma_min, sigma_max, and a power rho) and a Heun's 2nd-order deterministic ordinary differential equation (ODE) sampler that have become the de-facto defaults in the open-source image diffusion stack, including ComfyUI, the Hugging Face Diffusers library, k-diffusion, and AUTOMATIC1111's Stable Diffusion WebUI.[^3][^4][^5] A follow-up paper at CVPR 2024, commonly called EDM2, kept the EDM formulation but added magnitude-preserving network layers and a post-hoc EMA scheme, pushing the ImageNet 512x512 diffusion models FID record to 1.81 with fast deterministic sampling.[^6][^7]
By mid-2022 the diffusion literature already contained three large families of methods that looked superficially different but were known to be closely related. Denoising diffusion probabilistic models (DDPM) trained a neural network to predict noise added to images at a discrete sequence of timesteps; DDIM generalized the same training objective into a deterministic ODE-style sampler. The score-based "Variance Preserving" (VP) and "Variance Exploding" (VE) stochastic differential equation formulations re-derived the same models as continuous-time stochastic processes. Each line of work used its own notation, parameterization of the noise level, network preconditioning, and sampler, which made comparisons difficult and obscured which design choices actually mattered for sample quality.[^1][^8]
Karras et al. argued that the design choices visible in the literature were a confused mix of "what the model is" and "how it is parameterized," and that a clean separation would expose simple recipes capable of beating then-state-of-the-art results.[^1] Their EDM paper presents that separation: every diffusion model is treated as a denoiser D(x; sigma) of a noised input x = y + n where n is Gaussian with standard deviation sigma, and the differences between DDPM, DDIM, VE, VP, etc. are isolated into orthogonal choices of preconditioning, schedule, loss weighting, and sampler.[^1][^8]
The same NVIDIA group had previously produced the StyleGAN line of generative adversarial networks, and EDM was in part a response to the observation that diffusion models had overtaken GANs on standard image-generation benchmarks but were widely seen as slow and finicky to train. EDM aimed to make diffusion models both better and faster without changing the underlying score network architectures.[^1]
The paper's stance on prior work is unusual in that it does not introduce a new model class. Instead, it argues that the existing model classes are equivalent up to a change of variables, and that the right way to make progress is to ablate each component of the recipe independently. The result is a kind of recipe book: a small number of named "configurations" (labelled A through F in the paper, where F is the proposed combination) that can be mixed and matched on any of the supported backbones. This made EDM unusually easy to integrate into other codebases, because adopting EDM does not require rewriting the network; it requires rewriting only the loss, the schedule, the preconditioning, and the sampler.[^1][^8]
| Field | Value |
|---|---|
| Title | Elucidating the Design Space of Diffusion-Based Generative Models |
| Authors | Tero Karras, Miika Aittala, Timo Aila, Samuli Laine |
| Affiliation | NVIDIA |
| First arXiv submission | 2022-06-01 (arXiv:2206.00364) |
| Conference | NeurIPS 2022 |
| Reference implementation | github.com/NVlabs/edm (CC BY-NC-SA 4.0) |
| Headline result | CIFAR-10 class-conditional FID 1.79, ImageNet 64x64 FID 1.36 |
| Default sampler | Heun's 2nd-order deterministic ODE (Algorithm 1) |
| Default schedule | "Karras sigmas": rho-power schedule with sigma_min=0.002, sigma_max=80, rho=7 |
| Sigma_data default | 0.5 |
The paper's headline empirical claim is that, by changing only the design-space choices and keeping the underlying networks the same, it can train and sample CIFAR-10 and ImageNet 64x64 models that reach new state-of-the-art Frechet Inception Distance (FID) scores while using roughly 35 network evaluations per image at inference time, far fewer than the hundreds typical of contemporary DDPM-style samplers.[^1][^2]
The central conceptual contribution of EDM is the modular decomposition of a diffusion model into four orthogonal "axes" that can be designed independently.[^1][^8]
EDM views the network output as predicting a denoised image D(x; sigma) rather than predicting the noise directly. To keep network inputs and outputs at roughly unit variance across a very wide range of noise levels (sigma from approximately 0.002 to 80), EDM wraps the raw network F_theta with sigma-dependent scaling functions:[^1][^4]
D(x; sigma) = c_skip(sigma) * x + c_out(sigma) * F_theta( c_in(sigma) * x ; c_noise(sigma) )
with the explicit choices
c_skip(sigma) = sigma_data^2 / (sigma^2 + sigma_data^2)c_out(sigma) = sigma * sigma_data / sqrt(sigma_data^2 + sigma^2)c_in(sigma) = 1 / sqrt(sigma_data^2 + sigma^2)c_noise(sigma) = (1/4) * log(sigma)The skip connection means at very low noise the network mostly returns the input, and at very high noise it returns the predicted clean image scaled by the data standard deviation. This preconditioning fixes a long-standing instability in score-based training, where the network was forced to produce outputs spanning many orders of magnitude across sigma values.[^1][^4] The Hugging Face Diffusers implementation of EDM exposes these as precondition_inputs, precondition_noise, and precondition_outputs methods on its EDMEulerScheduler, with c_noise(sigma) = 0.25 * log(sigma) matching the EDM paper exactly.[^4]
EDM separates the choice of which sigma values to use during sampling from the training distribution of sigmas. For sampling, the paper proposes a closed-form schedule, often called "Karras sigmas," defined for i = 0, ..., N-1 by[^1][^3]
sigma_i = ( sigma_max^(1/rho) + (i / (N-1)) * ( sigma_min^(1/rho) - sigma_max^(1/rho) ) )^rho
with sigma_N = 0. The schedule is monotonically decreasing from sigma_max at i=0 to sigma_min at i=N-1, and the power rho controls how aggressively step sizes shrink near the low-noise end. Karras et al. recommend rho = 7 (with sigma_min = 0.002, sigma_max = 80) and show that lower values like rho = 3 (linear-in-sigma) and higher values like rho = 100 (squeezed to small sigma) both hurt FID at low step counts.[^1][^3] The intuition is that the change in the image per unit of sigma is large near the end of the trajectory, so the schedule should take many small steps there.[^1]
For training, EDM samples the noise level itself from a log-normal distribution ln(sigma) ~ N(P_mean, P_std^2) with P_mean = -1.2 and P_std = 1.2 for the unconditional CIFAR-10 models in the paper, concentrating training around intermediate noise levels where the denoising task is hardest.[^1][^8]
The training objective in EDM is a denoising score-matching loss of the form
E_{y ~ p_data} E_{sigma ~ p(sigma)} E_{n ~ N(0, sigma^2 I)} [ lambda(sigma) * || D_theta(y + n; sigma) - y ||^2 ]
The weighting lambda(sigma) is chosen so that the effective per-sigma loss has unit variance, which combined with the input/output preconditioning means the network never has to learn signals across vastly different magnitudes. EDM uses lambda(sigma) = (sigma^2 + sigma_data^2) / (sigma * sigma_data)^2, which falls out naturally from the preconditioning and ensures a well-conditioned optimization problem.[^1][^4][^8]
The fourth axis is the sampler. EDM provides two algorithms in the paper:[^1][^9]
sigma_i to sigma_{i+1} evaluates the denoiser at sigma_i, takes an Euler step, evaluates the denoiser at the candidate sigma_{i+1}, and averages the two derivatives to form a corrected step. This costs two network evaluations per step (except for the final step, where the correction is skipped), and reaches near-optimal FID at roughly 18 sampler steps (about 35 network evaluations) on CIFAR-10.[^1][^9]S_churn, S_min, S_max, and S_noise. Stochastic sampling helps for some datasets (notably ImageNet 64) but for others the deterministic ODE is already at or beyond GAN-level FID.[^1][^9]The Heun integrator is a classical 2nd-order Runge-Kutta method whose use here is sometimes called the "Karras sampler" in popular frameworks even though the integrator predates the EDM paper by more than a century; the contribution is the demonstration that, paired with the EDM schedule, it is the most efficient deterministic sampler tested in the paper for a given quality budget.[^1][^9]
Algorithm 1 in the paper proceeds as follows for i = 0, ..., N-1. Starting from a sample x_i at noise level sigma_i, EDM evaluates the denoiser D(x_i; sigma_i) once, computes the Euler derivative d_i = (x_i - D(x_i; sigma_i)) / sigma_i, and takes a tentative step x_{i+1}_tilde = x_i + (sigma_{i+1} - sigma_i) * d_i. If sigma_{i+1} > 0, EDM evaluates the denoiser again at the tentative endpoint to compute a corrected derivative d_i_prime = (x_{i+1}_tilde - D(x_{i+1}_tilde; sigma_{i+1})) / sigma_{i+1} and replaces the Euler step with the average x_{i+1} = x_i + (sigma_{i+1} - sigma_i) * 0.5 * (d_i + d_i_prime). If sigma_{i+1} = 0, EDM skips the correction (since the second derivative would divide by zero) and takes the Euler step as the final result. The total number of network evaluations is 2 * N - 1, which is why EDM cites roughly 35 evaluations for 18 sampler steps.[^1][^9]
The EDM paper reports the following FID numbers on standard benchmarks, all using the proposed "Config F" combination of preconditioning, schedule, weighting, and sampler:[^1][^2]
| Dataset | Resolution | FID |
|---|---|---|
| CIFAR-10 (unconditional) | 32x32 | 1.97 |
| CIFAR-10 (class-conditional) | 32x32 | 1.79 |
| FFHQ (unconditional) | 64x64 | 2.39 |
| AFHQv2 (unconditional) | 64x64 | 1.96 |
| ImageNet (class-conditional, improved pre-trained) | 64x64 | 1.55 |
| ImageNet (class-conditional, re-trained) | 64x64 | 1.36 |
These numbers were state-of-the-art at the time of submission for the corresponding benchmarks and, importantly, were achieved with only about 35 network function evaluations per generated image, compared with hundreds or thousands typical of contemporary DDPM-derived samplers.[^1][^2]
The paper also reports several useful ablations. Swapping the EDM schedule into an existing DDPM or VE-trained network (without retraining) already improves FID, which is direct evidence that the schedule and sampler choice are largely orthogonal to the training objective. Conversely, retraining with the EDM preconditioning and loss weighting but keeping a legacy schedule recovers most of the training benefit but not the sampling-efficiency benefit. These results make the modular argument concrete: each of the four axes contributes a measurable share of the improvement, and the gains roughly add up rather than overlapping.[^1][^8]
The official NVIDIA reference code is released at github.com/NVlabs/edm under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license; baseline network code derives from earlier projects with their own permissive licenses.[^2][^10] The repository ships:
train.py for training with three configurable backbone architectures: ddpmpp and ncsnpp (DDPM++/NCSN++ from earlier score-based work) and adm (the ImageNet backbone from OpenAI's "Diffusion Models Beat GANs on Image Synthesis").generate.py for sampling, exposing --solver=heun|euler, --steps, scheduling/scaling modes (vp, ve, edm, linear, none), and the stochastic controls S_churn, S_min, S_max, S_noise.fid.py for FID evaluation, dataset_tool.py for preprocessing, and example.py as a minimal standalone use of the pre-trained Config F checkpoints.The official repo is intentionally narrow in scope: it covers the small-image benchmarks of the paper rather than text-to-image generation. Most downstream adoption goes through third-party libraries that re-implement the EDM equations on top of larger backbones.[^10]
The repository's training entry point train.py accepts a "preconditioner" type (vp, ve, edm) and a separate "schedule" type, which makes it straightforward to reproduce the legacy DDPM++ and NCSN++ recipes within the EDM codebase and compare them against the EDM recipe on identical data and networks. Multi-GPU training is supported via PyTorch Distributed Data Parallel and the repository includes scripts that compute FID over 50,000 generated samples to match the standard evaluation protocol used by the wider diffusion literature. Pretrained checkpoints, including the Config F models that produced the headline FID numbers, are hosted on NVIDIA's content delivery network and are referenced directly by URL in the example scripts.[^10]
A direct follow-up by the same core team, published as "Analyzing and Improving the Training Dynamics of Diffusion Models" (arXiv:2312.02696, CVPR 2024 oral), is commonly referred to as EDM2.[^6][^7] The author list adds Jaakko Lehtinen and Janne Hellsten alongside Karras, Aittala, Aila, and Laine.[^6]
EDM2 observes that, even with EDM's input/output preconditioning, internal activations and weights in the ADM backbone drift in magnitude during training in uncontrolled ways. The paper redesigns essentially every layer of the ADM network so that, in expectation, the magnitude of activations, weights, and parameter updates is preserved through the layer.[^6][^7] Concretely this involves "forced" weight normalizations (rescaling weights at each step so their effective magnitude matches a target), magnitude-preserving versions of attention and convolution, and an Adam-like step where the update is divided by its empirical magnitude. Magnitude preservation is applied without changing the high-level structure of the ADM U-Net, so the model remains a U-Net-style backbone.[^6][^11]
EDM2 also introduces a "post-hoc EMA" method that allows the choice of exponential-moving-average decay rate to be deferred to after training. During training, several EMA snapshots of the weights at different short half-lives are kept; at test time, an arbitrary EMA profile can be reconstructed by a closed-form linear combination of these snapshots.[^6][^7] Because the EMA half-life interacts strongly with FID but is normally fixed before training begins (and only its long-term effect is visible near convergence), post-hoc EMA replaces a costly hyperparameter sweep with a cheap post-processing step.[^6]
On ImageNet 512x512, EDM2 improves the prior best FID for diffusion models from 2.41 to 1.81 using fast deterministic sampling, at the same training compute budget as the ADM baseline it builds on, and reports FID and FDDINOv2 numbers across model sizes from "XS" through "XXL."[^6][^7] The official repo at github.com/NVlabs/edm2 ships training code, sampling code, post-hoc EMA reconstruction utilities, and pretrained checkpoints. The same repository also contains the code for the related NeurIPS 2024 oral paper "Guiding a Diffusion Model with a Bad Version of Itself" (autoguidance), which uses a deliberately weaker auxiliary diffusion model in place of an unconditional model to provide diffusion guidance without the artifacts that classifier-free guidance introduces.[^11]
As of May 2026 there is no published paper officially branded "EDM3" by the same NVIDIA group. Community references to "EDM3" in forum posts or third-party blog posts should not be treated as referring to a specific NVIDIA paper; the EDM line as published consists of EDM (NeurIPS 2022) and EDM2 (CVPR 2024), plus the autoguidance paper that shares the same codebase.[^11]
Autoguidance, presented at NeurIPS 2024 by the same group, is a guidance technique that pairs a strong target diffusion model with a deliberately weaker auxiliary diffusion model trained on the same task. At sampling time the strong model's score is pushed away from the weak model's score, which has the effect of guiding samples toward "the things the strong model can do but the weak model cannot," yielding sharper conditional samples without the over-saturation and reduced diversity associated with classifier-free guidance. Because autoguidance shares the EDM2 codebase, the same magnitude-preserving layers and post-hoc EMA pipeline are reused for both the strong and the auxiliary network.[^11]
EDM's design-space ideas, especially the Karras sigma schedule and the Heun deterministic sampler, were absorbed into the broader image-diffusion ecosystem almost immediately after publication and have remained defaults for several years.
The library k-diffusion, maintained by Katherine Crowson, is an early and influential PyTorch re-implementation of the EDM samplers and training equations, released under the MIT license.[^3][^12] It implements EDM's Heun-style Algorithm 2 as sample_heun, plus a family of compatible Euler and DPM-Solver-based samplers that all consume the same Karras sigma schedule. k-diffusion provides wrappers that adapt models trained in other formulations (v-diffusion, OpenAI's guided-diffusion, CompVis) to the EDM-style denoiser interface, so that the same samplers can be reused across model families.[^12]
Because Stable Diffusion 1.x and 2.x checkpoints from CompVis and StabilityAI were originally trained with the DDPM objective rather than the EDM objective, k-diffusion's wrapping mechanism is how those models acquired Karras-schedule samplers in popular UIs. Sampler menu entries like "DPM++ 2M Karras," "Euler a Karras," and "DPM++ SDE Karras" in AUTOMATIC1111's Stable Diffusion WebUI and similar tools refer to a sampler implemented in k-diffusion combined with a Karras-style sigma schedule from the EDM paper.[^3][^5][^13]
The Hugging Face Diffusers library implements EDM directly as a scheduler family. The EDMEulerScheduler documents itself as "the Karras formulation of the Euler scheduler (Algorithm 2)" with default parameters sigma_min=0.002, sigma_max=80.0, sigma_data=0.5, rho=7.0, and sigma_schedule="karras", all matching the EDM paper.[^4] The library also ships EDMDPMSolverMultistepScheduler, a Karras-style version of DPM-Solver++ used by some recent text-to-image models, and the EDM scheduler is the default for several non-Stable-Diffusion pipelines including Playground v2.5 and the StabilityAI CosXL release that swap in an "exponential" Karras-compatible schedule.[^4]
ComfyUI exposes EDM-style scheduling under the KarrasScheduler node, which returns sigmas computed by the EDM rho-power formula and is the natural counterpart to its sampler nodes. ComfyUI's bundled sampling code under comfy/k_diffusion/sampling.py is a derivative of k-diffusion and supports the Heun and DPM-Solver variants on top of the Karras schedule.[^14] AUTOMATIC1111's WebUI, vladmandic's SDNext, and InvokeAI all expose "Karras" variants of their samplers via the same underlying schedule and integrator pattern.[^5][^13]
OpenAI's consistency models codebase reuses the EDM denoiser interface and Karras sigmas for distillation and for evaluation of the resulting student models, which is one reason that "Karras sigmas" appear in many distilled-sampler papers whose underlying methods are otherwise unrelated to EDM.[^15] Several subsequent papers on faster diffusion sampling, on flow matching and rectified flow, on better noise schedules, and on alternative network architectures explicitly use the EDM design space as the baseline they compare against.[^1][^7][^15]
The naming convention in user interfaces is worth a note. In AUTOMATIC1111's WebUI and similar tools, sampler entries such as Euler a Karras, DPM++ 2M Karras, DPM++ SDE Karras, and Heun Karras decompose into an integrator (Euler ancestral, DPM-Solver++ multistep, etc.) and the EDM rho-power sigma schedule. The "Karras" suffix refers only to the schedule; the integrator itself can be older or newer than EDM. This labeling can be confusing because in some places "Karras sampler" colloquially refers to the EDM Heun integrator specifically and in others it refers to any sampler using the EDM schedule.[^3][^5][^13]
EDM's significance lies less in any single new technique than in the way it reorganized the diffusion-model literature.[^1][^8] Before EDM, papers tended to bundle a particular preconditioning, schedule, weighting, and sampler into "the model" and to compare end to end, which made it hard to know which knob was responsible for an improvement. EDM showed that those four axes are essentially independent and that the right combination, with no changes to the underlying score network, could move FID by a factor of two or more on standard benchmarks.[^1][^2] This decomposition framework, more than the specific equations, is what subsequent work built on.
Practically, the Karras sigma schedule and Heun sampler made small numbers of sampling steps (20-30) competitive with hundreds-of-steps sampling, which was important for the usability of Stable Diffusion and related text-to-image systems on consumer hardware.[^3][^5][^13] The widespread "(...) Karras" sampler labels that users now see across ComfyUI, AUTOMATIC1111, InvokeAI, and others are direct adoption of the EDM schedule.[^3][^4][^5]
For training, EDM made the case that the right preconditioning and loss weighting are not cosmetic choices but determine whether a diffusion model is stable to train across the full noise range. EDM2 extended that argument to layer-by-layer activation and weight magnitudes, and the post-hoc EMA technique introduced in EDM2 has been picked up by other research projects that need to tune EMA aggressively.[^6][^7][^11]
EDM is a paper about small-image, class-conditional diffusion models. It does not address text-to-image generation, classifier-free guidance, latent-diffusion architectures, or conditional generation beyond class labels; downstream systems that use EDM-style schedules in those settings inherit the schedule and sampler but use their own training recipes.[^1][^10] The default sigma_max=80 is appropriate for pixel-space models at the resolutions tested in the paper, and Diffusers documents [0.2, 80.0] as a reasonable range; downstream latent-space models often re-derive their own sigma extremes, so values copied from EDM should not be assumed appropriate for arbitrary backbones.[^4][^10]
The Heun deterministic sampler is a 2nd-order method and so costs two network evaluations per step (except the final step), meaning that comparisons between EDM-Heun at N steps and a first-order sampler at N steps are not apples-to-apples; a Heun step at N steps should be compared against an Euler or DDIM step at roughly 2N steps.[^1][^9] EDM's Algorithm 2 stochastic sampler adds four hyperparameters (S_churn, S_min, S_max, S_noise) that need to be tuned per dataset to get the best FID, which weakens the "design space without surprises" framing for stochastic sampling.[^1][^9]
EDM2 keeps the ADM U-Net backbone and the EDM loss but rewrites the layers extensively; downstream practitioners adopting magnitude preservation report that the changes interact non-trivially with mixed-precision training and with non-square attention shapes, and that careful re-tuning is needed when porting the technique to other architectures or to diffusion transformers.[^6][^7][^11]
Finally, the EDM and EDM2 codebases are released under CC BY-NC-SA 4.0 (non-commercial, share-alike) rather than a permissive license, which has practical implications for direct reuse of the reference code in commercial systems even though most downstream adoption uses independent re-implementations (k-diffusion, Diffusers, ComfyUI) under MIT or Apache licenses.[^10][^11][^12]
EDM sits at the center of a tight cluster of diffusion-model design papers and tools.
| Work | Relation to EDM | Reference |
|---|---|---|
| DDPM (Ho et al., 2020) | Discrete-time predecessor; EDM unifies and supersedes its noise-prediction parameterization. | [^1] |
| DDIM (Song et al., 2021) | Deterministic sampler predecessor; EDM's Heun method targets the same probability-flow ODE more accurately. | [^1] |
| Score SDE (VP/VE) (Song et al., 2021) | Continuous-time predecessor; EDM derives from the same SDE framework but uses a different parameterization of sigma. | [^1] |
| ADM (Dhariwal & Nichol, 2021) | The "Ablated Diffusion Model" U-Net backbone that EDM and EDM2 use as their ImageNet network. | [^1][^6] |
| Consistency models (Song et al., 2023) | Distill EDM-style teachers using EDM denoiser interface and Karras sigmas. | [^15] |
| Flow matching / rectified flow | Alternative training objectives over the same probability-flow ODE; often compared against EDM as a baseline. | [^7] |
| k-diffusion | Third-party PyTorch implementation of EDM samplers used in most open-source UIs. | [^12] |
| ComfyUI / AUTOMATIC1111 | UIs whose "Karras" sampler entries refer to EDM's rho-power sigma schedule. | [^5][^13][^14] |