SDEdit
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,933 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,933 words
Add missing citations, update stale details, or suggest a clearer explanation.
SDEdit (Stochastic Differential Editing) is a method for guided image synthesis and editing that turns a rough user guide, such as a stroke painting, a coarse collage, or a real photograph with edits pasted in, into a realistic image without any task-specific training, paired data, or hand-drawn masks. The core idea is simple: take the guide, add an intermediate amount of Gaussian noise to it, and then denoise the result with a pretrained diffusion model by running its reverse stochastic differential equation (SDE). The added noise is enough to wash out the unrealistic artifacts of the guide while preserving its overall structure, and the reverse diffusion process then projects the noisy input back onto the manifold of realistic images [1][2].
The technique was introduced in "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations" by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. The paper was first posted to arXiv on 2 August 2021 and published at the International Conference on Learning Representations (ICLR) in 2022. Most of the authors were at Stanford University; Jun-Yan Zhu was at Carnegie Mellon University [1][3][4]. SDEdit is historically important as the direct conceptual basis of the "img2img" or image-to-image editing mode in Stable Diffusion and many other text-to-image systems, where a "strength" or "denoising strength" slider exposes exactly the noise-level tradeoff that SDEdit formalized [2][5].
Guided image synthesis lets everyday users create or alter photo-realistic images with minimal effort, for example by drawing colored strokes or pasting a patch from one photo onto another. The central difficulty is balancing two competing objectives: faithfulness to the user input (the output should respect the strokes, colors, and layout the user provided) and realism (the output should look like a natural photograph rather than a crude drawing or an obvious cut-and-paste) [1].
Before SDEdit, the dominant approaches were built on generative adversarial networks (GANs). These came in two broad families, both with drawbacks the paper highlights:
In both cases, supporting a new editing task generally means new training data, new losses, or new models. SDEdit's contribution was to show that a single pretrained diffusion model, used as a generative prior, can handle stroke-based synthesis, stroke-based editing, and image compositing out of the box, with no additional training and no inversion, while navigating the faithfulness-realism tradeoff through a single intuitive knob [1].
SDEdit builds on score-based generative models defined through SDEs, the framework of Song et al. (2021) that unifies denoising diffusion and score matching. In that framework, a forward SDE gradually perturbs a clean image into pure Gaussian noise over a continuous time variable t that runs from 0 (clean data) to 1 (pure noise). A learned score network approximates the gradient of the log data density at each noise level, and simulating the corresponding reverse-time SDE starting from noise generates new samples. SDEdit was demonstrated with both the Variance Exploding (VE) and Variance Preserving (VP) SDE formulations [1].
The forward process can be written schematically as x(t) = alpha(t) x(0) + sigma(t) z, where z is standard Gaussian noise and sigma(t) sets the noise scale. The key move in SDEdit is to not start the reverse process from pure noise at t = 1. Instead, it picks an intermediate time t0 strictly between 0 and 1, and proceeds in two steps [1][2]:
Perturb the guide. Take the user-provided guide image and add Gaussian noise corresponding to the noise level sigma(t0). This pushes the guide partway up the diffusion trajectory, far enough to make its low-level artifacts (the visible brushstrokes, hard collage seams, flat colors) statistically indistinguishable from where a noised real image would be, but not so far that the overall layout and color composition are destroyed.
Run the reverse SDE from t0. Initialize the reverse-time SDE with this noisy guide and integrate it down from t0 to 0 using the pretrained score model. The result is a clean image that lies on the model's manifold of realistic images while remaining close to the structure of the guide.
The hyperparameter t0 is the central control. It directly trades off the two objectives [1][2]:
| Choice of t0 | Noise added to the guide | Effect on output |
|---|---|---|
| Small t0 (near 0) | Little noise | More faithful to the guide, but artifacts may survive (less realistic) |
| Intermediate t0 | Moderate noise | Balanced: realistic output that still follows the guide |
| Large t0 (near 1) | Much noise | More realistic, but less faithful; structure of the guide can be lost |
In other words, adding more noise and running the SDE for longer yields more realistic but less faithful images, while adding less noise yields more faithful but less realistic ones. The paper notes there is generally a "sweet spot" range of t0 (it reports good results for t0 roughly in the 0.3 to 0.6 range, depending on the task) where outputs are both realistic and faithful. Because the procedure is stochastic, running it several times from the same guide produces a set of distinct plausible outputs [1].
For local edits, SDEdit can also incorporate a user-specified region. The unedited pixels are kept fixed by repeatedly replacing them with an appropriately noised version of the original image at each reverse step, so that only the targeted region is regenerated while the rest of the image is preserved. This makes the same algorithm usable for editing and compositing, not just full-image synthesis [1].
The original paper demonstrates three main tasks, all using the same pretrained diffusion model and the same noise-then-denoise procedure [1][2]:
To evaluate these tasks, the authors measured faithfulness with metrics such as the L2 distance between the guide and the output and a masked LPIPS perceptual distance for unintended changes, and measured realism with the Kernel Inception Distance (KID) and human studies on Amazon Mechanical Turk. In the human evaluations for stroke-based synthesis, SDEdit was reported to outperform prior GAN-based baselines by large margins, by up to 98.09 percent on realism and 91.72 percent on overall satisfaction [1].
SDEdit is the conceptual core of the now-ubiquitous image-to-image (img2img) editing mode in modern diffusion tools. In Stable Diffusion's img2img pipeline, an input image is encoded, partially noised according to a "strength" (or "denoising strength") parameter in the range 0 to 1, and then denoised under a text prompt. That strength parameter plays exactly the role of SDEdit's t0: a strength near 0 barely changes the image (high faithfulness), while a strength near 1 discards almost all input information and behaves like text-to-image generation from scratch (high realism, low faithfulness). The same mechanism underlies image-conditioned editing in systems such as GLIDE and various distilled Stable Diffusion variants [2][5][6].
SDEdit belongs to a broader family of training-free diffusion editing methods, and it is useful to contrast how each one steers the reverse process [1][7][8]:
| Method | Core mechanism | What it needs |
|---|---|---|
| SDEdit | Add intermediate noise to a guide, then run the reverse SDE | A pretrained diffusion model; no extra training or masks |
| ILVR (Choi et al., 2021) | Iteratively refine low-frequency content to match a reference at each denoising step | A reference image and a chosen downsampling (low-pass) function |
| RePaint (2022) | Diffusion-based inpainting that fills a masked region while resampling the known background | A pretrained diffusion model and an explicit mask |
| Prompt-to-Prompt (2022) | Edit images by manipulating the cross-attention maps that bind words to image regions | A text-to-image model and paired source/target prompts |
Compared with ILVR, which conditions on a known reference through a fixed low-pass measurement at every step, SDEdit conditions only once by its choice of starting noise level and otherwise lets the unconstrained reverse SDE run, which is what makes it agnostic to the type of guide. Compared with mask-based inpainting methods like RePaint, plain SDEdit needs no mask for global tasks such as stroke-to-image synthesis, although it can use a region mask when only a local edit is wanted. Compared with Prompt-to-Prompt and other attention-based editors, which operate in the semantic space of a text prompt, SDEdit operates directly in pixel or latent space on a spatial guide, so the two approaches are complementary and are often combined in practice [1][7][8].
SDEdit's simplicity is also the source of its limitations. Because the realism-faithfulness balance is governed by the single hyperparameter t0, no single setting is optimal for every guide: a low t0 may leave visible artifacts, while a high t0 may discard structure the user wanted to keep, so the noise level often must be tuned per image or per task [1].
The method also conditions on the guide only through the initial noised state. Once the reverse SDE begins, no further constraint pulls the output back toward the guide, so fine details and exact positions are not guaranteed to be preserved, and high noise levels can drift away from the user's intent. For local editing and compositing, SDEdit keeps unedited regions fixed only when the user supplies a region to preserve [1].
Finally, SDEdit inherits the constraints of the underlying diffusion model: output quality, resolution, and the range of depictable content are bounded by the pretrained prior, and like other iterative diffusion samplers it requires many sequential denoising steps, making it slower than a single GAN forward pass. Later work on faster samplers and on methods that condition more strongly on structure, text, or reference images has built on the basic SDEdit recipe rather than replaced it [1][2][6].