SDEdit

Deep Learning Generative AI

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,933 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

SDEdit (Stochastic Differential Editing) is a method for guided image synthesis and editing that turns a rough user guide, such as a stroke painting, a coarse collage, or a real photograph with edits pasted in, into a realistic image without any task-specific training, paired data, or hand-drawn masks. The core idea is simple: take the guide, add an intermediate amount of Gaussian noise to it, and then denoise the result with a pretrained diffusion model by running its reverse stochastic differential equation (SDE). The added noise is enough to wash out the unrealistic artifacts of the guide while preserving its overall structure, and the reverse diffusion process then projects the noisy input back onto the manifold of realistic images ^[1]^[2].

The technique was introduced in "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations" by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. The paper was first posted to arXiv on 2 August 2021 and published at the International Conference on Learning Representations (ICLR) in 2022. Most of the authors were at Stanford University; Jun-Yan Zhu was at Carnegie Mellon University ^[1]^[3]^[4]. SDEdit is historically important as the direct conceptual basis of the "img2img" or image-to-image editing mode in Stable Diffusion and many other text-to-image systems, where a "strength" or "denoising strength" slider exposes exactly the noise-level tradeoff that SDEdit formalized ^[2]^[5].

Background: the faithfulness versus realism tradeoff

Guided image synthesis lets everyday users create or alter photo-realistic images with minimal effort, for example by drawing colored strokes or pasting a patch from one photo onto another. The central difficulty is balancing two competing objectives: faithfulness to the user input (the output should respect the strokes, colors, and layout the user provided) and realism (the output should look like a natural photograph rather than a crude drawing or an obvious cut-and-paste) ^[1].

Before SDEdit, the dominant approaches were built on generative adversarial networks (GANs). These came in two broad families, both with drawbacks the paper highlights:

Conditional GANs learn a direct mapping from a guide (for example, a segmentation map or an edge sketch) to a realistic image. They typically require paired training data and a separately trained model for each application.
GAN inversion methods project a guide into the latent space of a pretrained GAN and then resynthesize. Finding a latent code that is both faithful and realistic is difficult, and these methods often need task-specific loss functions or regularizers.

In both cases, supporting a new editing task generally means new training data, new losses, or new models. SDEdit's contribution was to show that a single pretrained diffusion model, used as a generative prior, can handle stroke-based synthesis, stroke-based editing, and image compositing out of the box, with no additional training and no inversion, while navigating the faithfulness-realism tradeoff through a single intuitive knob ^[1].

How SDEdit works: noise then denoise

SDEdit builds on score-based generative models defined through SDEs, the framework of Song et al. (2021) that unifies denoising diffusion and score matching. In that framework, a forward SDE gradually perturbs a clean image into pure Gaussian noise over a continuous time variable t that runs from 0 (clean data) to 1 (pure noise). A learned score network approximates the gradient of the log data density at each noise level, and simulating the corresponding reverse-time SDE starting from noise generates new samples. SDEdit was demonstrated with both the Variance Exploding (VE) and Variance Preserving (VP) SDE formulations ^[1].

The forward process can be written schematically as x(t) = alpha(t) x(0) + sigma(t) z, where z is standard Gaussian noise and sigma(t) sets the noise scale. The key move in SDEdit is to not start the reverse process from pure noise at t = 1. Instead, it picks an intermediate time t0 strictly between 0 and 1, and proceeds in two steps ^[1]^[2]:

Perturb the guide. Take the user-provided guide image and add Gaussian noise corresponding to the noise level sigma(t0). This pushes the guide partway up the diffusion trajectory, far enough to make its low-level artifacts (the visible brushstrokes, hard collage seams, flat colors) statistically indistinguishable from where a noised real image would be, but not so far that the overall layout and color composition are destroyed.
Run the reverse SDE from t0. Initialize the reverse-time SDE with this noisy guide and integrate it down from t0 to 0 using the pretrained score model. The result is a clean image that lies on the model's manifold of realistic images while remaining close to the structure of the guide.

The hyperparameter t0 is the central control. It directly trades off the two objectives ^[1]^[2]:

Choice of t0	Noise added to the guide	Effect on output
Small t0 (near 0)	Little noise	More faithful to the guide, but artifacts may survive (less realistic)
Intermediate t0	Moderate noise	Balanced: realistic output that still follows the guide
Large t0 (near 1)	Much noise	More realistic, but less faithful; structure of the guide can be lost

In other words, adding more noise and running the SDE for longer yields more realistic but less faithful images, while adding less noise yields more faithful but less realistic ones. The paper notes there is generally a "sweet spot" range of t0 (it reports good results for t0 roughly in the 0.3 to 0.6 range, depending on the task) where outputs are both realistic and faithful. Because the procedure is stochastic, running it several times from the same guide produces a set of distinct plausible outputs ^[1].

For local edits, SDEdit can also incorporate a user-specified region. The unedited pixels are kept fixed by repeatedly replacing them with an appropriately noised version of the original image at each reverse step, so that only the targeted region is regenerated while the rest of the image is preserved. This makes the same algorithm usable for editing and compositing, not just full-image synthesis ^[1].

Applications

The original paper demonstrates three main tasks, all using the same pretrained diffusion model and the same noise-then-denoise procedure ^[1]^[2]:

Stroke-based image synthesis. The user provides a coarse colored stroke painting, and SDEdit turns it into a realistic image (for example, a landscape) whose layout and colors follow the strokes.
Stroke-based image editing. The user draws strokes over part of a real photograph to indicate a desired change, and SDEdit regenerates that region realistically while leaving the rest of the image intact.
Image compositing. The user pastes a patch or object from one image onto another, producing a guide with visible seams, and SDEdit harmonizes the composite into a single coherent, realistic image.

To evaluate these tasks, the authors measured faithfulness with metrics such as the L2 distance between the guide and the output and a masked LPIPS perceptual distance for unintended changes, and measured realism with the Kernel Inception Distance (KID) and human studies on Amazon Mechanical Turk. In the human evaluations for stroke-based synthesis, SDEdit was reported to outperform prior GAN-based baselines by large margins, by up to 98.09 percent on realism and 91.72 percent on overall satisfaction ^[1].

Relationship to img2img and other editing methods

SDEdit is the conceptual core of the now-ubiquitous image-to-image (img2img) editing mode in modern diffusion tools. In Stable Diffusion's img2img pipeline, an input image is encoded, partially noised according to a "strength" (or "denoising strength") parameter in the range 0 to 1, and then denoised under a text prompt. That strength parameter plays exactly the role of SDEdit's t0: a strength near 0 barely changes the image (high faithfulness), while a strength near 1 discards almost all input information and behaves like text-to-image generation from scratch (high realism, low faithfulness). The same mechanism underlies image-conditioned editing in systems such as GLIDE and various distilled Stable Diffusion variants ^[2]^[5]^[6].

SDEdit belongs to a broader family of training-free diffusion editing methods, and it is useful to contrast how each one steers the reverse process ^[1]^[7]^[8]:

Method	Core mechanism	What it needs
SDEdit	Add intermediate noise to a guide, then run the reverse SDE	A pretrained diffusion model; no extra training or masks
ILVR (Choi et al., 2021)	Iteratively refine low-frequency content to match a reference at each denoising step	A reference image and a chosen downsampling (low-pass) function
RePaint (2022)	Diffusion-based inpainting that fills a masked region while resampling the known background	A pretrained diffusion model and an explicit mask
Prompt-to-Prompt (2022)	Edit images by manipulating the cross-attention maps that bind words to image regions	A text-to-image model and paired source/target prompts

Compared with ILVR, which conditions on a known reference through a fixed low-pass measurement at every step, SDEdit conditions only once by its choice of starting noise level and otherwise lets the unconstrained reverse SDE run, which is what makes it agnostic to the type of guide. Compared with mask-based inpainting methods like RePaint, plain SDEdit needs no mask for global tasks such as stroke-to-image synthesis, although it can use a region mask when only a local edit is wanted. Compared with Prompt-to-Prompt and other attention-based editors, which operate in the semantic space of a text prompt, SDEdit operates directly in pixel or latent space on a spatial guide, so the two approaches are complementary and are often combined in practice ^[1]^[7]^[8].

Limitations

SDEdit's simplicity is also the source of its limitations. Because the realism-faithfulness balance is governed by the single hyperparameter t0, no single setting is optimal for every guide: a low t0 may leave visible artifacts, while a high t0 may discard structure the user wanted to keep, so the noise level often must be tuned per image or per task ^[1].

The method also conditions on the guide only through the initial noised state. Once the reverse SDE begins, no further constraint pulls the output back toward the guide, so fine details and exact positions are not guaranteed to be preserved, and high noise levels can drift away from the user's intent. For local editing and compositing, SDEdit keeps unedited regions fixed only when the user supplies a region to preserve ^[1].

Finally, SDEdit inherits the constraints of the underlying diffusion model: output quality, resolution, and the range of depictable content are bounded by the pretrained prior, and like other iterative diffusion samplers it requires many sequential denoising steps, making it slower than a single GAN forward pass. Later work on faster samplers and on methods that condition more strongly on structure, text, or reference images has built on the basic SDEdit recipe rather than replaced it ^[1]^[2]^[6].

References

C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, S. Ermon, "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations," arXiv:2108.01073, 2021. https://arxiv.org/abs/2108.01073 ↩
"SDEdit: Image Synthesis and Editing with Stochastic Differential Equations," project page. https://sde-image-editing.github.io/ ↩
"SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations," OpenReview (ICLR 2022). https://openreview.net/forum?id=aBsCjcPu_tE ↩
"SDEdit," ICLR 2022 poster page. https://iclr.cc/virtual/2022/poster/6268 ↩
"Image-to-image," Hugging Face Diffusers documentation. https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img ↩
CompVis, "Stable Diffusion," GitHub repository (img2img / image-to-image). https://github.com/CompVis/stable-diffusion ↩
J. Choi, S. Kim, Y. Jeong, Y. Gwon, S. Yoon, "ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models," arXiv:2108.02938, 2021. https://arxiv.org/abs/2108.02938 ↩
A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, D. Cohen-Or, "Prompt-to-Prompt Image Editing with Cross Attention Control," arXiv:2208.01626, 2022. https://arxiv.org/abs/2208.01626 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

CycleGAN Image-to-Image Models Prompt-to-Prompt

Overview

Background: the faithfulness versus realism tradeoff

How SDEdit works: noise then denoise

Applications

Relationship to img2img and other editing methods

Limitations

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here