Prompt-to-Prompt

Deep Learning Generative AI

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,798 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Prompt-to-Prompt is a training-free image editing technique for text-conditioned diffusion models that edits a generated image by manipulating the model's cross-attention maps when the text prompt is changed ^[1]. The central observation is that, in a text-to-image diffusion model, the cross-attention maps that bind each text token to a region of the image largely determine the spatial layout of the result. By holding those maps fixed (or selectively modifying them) while swapping a word, adding a phrase, or changing a token's weight, Prompt-to-Prompt produces an edited image that preserves the geometry and composition of the original, without any per-image fine-tuning and without a user-supplied mask ^[1].

The method was introduced in the paper "Prompt-to-Prompt Image Editing with Cross Attention Control" by Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or, researchers at Google Research and Tel Aviv University. The paper was posted to arXiv on 2 August 2022 and was published at the International Conference on Learning Representations (ICLR) in 2023 ^[1]. The original work used Google's Imagen as the backbone model, while the public reference implementation released by Google targets Stable Diffusion and latent diffusion ^[2]^[3].

Prompt-to-Prompt became an influential primitive in diffusion-based editing. Its attention-control idea underlies a family of later "attention injection" editing methods, it was paired with Null-text Inversion to extend editing to real photographs, and it was used to synthesize the paired training data for the instruction-following editor InstructPix2Pix ^[4]^[5].

Background: cross-attention controls layout

A text-to-image diffusion model generates an image by iteratively denoising a noisy latent (or pixel) tensor, conditioned at every step on a text prompt. The conditioning is injected through cross-attention layers placed throughout the denoising U-Net. At each such layer, the spatial features of the image act as queries and the per-token text embeddings act as keys and values. The resulting attention map is a set of per-token spatial heatmaps: for every word in the prompt, there is a 2D map over image locations indicating how strongly that word attends to (and therefore influences) each region ^[1].

Hertz and colleagues showed that these maps are interpretable and that they encode the layout of the scene early in the denoising process. The pixels that a token such as "bear" attends to correspond closely to where the bear ends up in the image, and the rough structure of the composition is fixed in the first denoising steps before fine texture is filled in ^[1]. In Imagen the composition is determined mainly at the low 64x64 base resolution, where the cross-attention sits in the network's bottleneck, so the maps are spatially coarse but semantically meaningful ^[1]. This insight is the foundation of the method: if the spatial layout is carried by the cross-attention maps, then to keep the layout while changing content it is sufficient to keep (or controllably edit) those maps rather than the image itself.

A related lever is classifier-free guidance, which the underlying diffusion model uses to amplify the influence of the text prompt during sampling. Guidance interacts with the editing operations and with the inversion step used for real images (see below) ^[1]^[4].

How Prompt-to-Prompt works

Prompt-to-Prompt edits an image by running two synchronized denoising passes: one with the original prompt and one with the edited prompt, starting from the same random seed (the same initial noise). During the edited pass, the cross-attention maps computed from the original prompt are injected into the network, overriding or blending with the maps that the edited prompt would otherwise produce ^[1]. The choice of which maps to inject, and for how many denoising steps, defines three editing operations.

Operation	Prompt change	Attention manipulation	Typical effect
Word swap	Replace a token (for example "bicycle" to "car")	Inject the original maps for the shared tokens; the swapped token receives the original token's map for the early steps, then is released	Replace an object while keeping pose, position, and scene layout
Adding a phrase	Append new words ("a tree" to "a tree in the snow")	Keep (freeze) the original maps for the unchanged tokens; let the new tokens form fresh attention	Add or change an attribute or context while preserving existing content
Re-weighting	Same tokens	Scale a chosen token's cross-attention values up or down by a factor	Strengthen or weaken how much a word affects the image (more or less "snow", "fluffy", and so on)

For a word swap, the original and edited prompts share most tokens, so the original attention maps are injected for the unchanged words. For the replaced word, the original map is injected for the first fraction of the denoising steps (controlled by a parameter the paper denotes tau) to lock in the location and shape, after which the model uses its own attention so the new object can adapt its texture and details. Because the two prompts may have different token positions, an alignment function maps tokens of the new prompt to the corresponding tokens of the old one ^[1].

For adding a phrase (prompt refinement), the edited prompt is a superset of the original. The attention of the previously existing tokens is frozen to the original maps so the existing content does not move, while attention is allowed to flow freely to the newly added tokens, which introduces the new element or style ^[1]^[2].

For attention re-weighting, the prompt is unchanged but the cross-attention values associated with a particular token are multiplied by a scalar. Increasing the factor strengthens that word's effect and decreasing it (including to negative values) attenuates or removes it, giving continuous control over an attribute that is hard to express in words ^[1].

All three operations are training-free, run at inference time on a frozen model, and require no segmentation mask from the user; the spatial extent of an edit is derived automatically from the attention maps ^[1]. The public implementation exposes parameters such as cross_replace_steps (the fraction of steps for which cross-attention maps are replaced) and self_replace_steps (the analogous fraction for self-attention maps), plus an optional LocalBlend object that restricts an edit to the region indicated by selected words, improving locality for difficult edits ^[3].

Editing real images (Null-text Inversion)

Prompt-to-Prompt as described edits images that the model itself generated, because it needs the initial noise that produced them. To edit a real photograph, the image must first be inverted into the diffusion model's noise space so that re-running the denoiser reconstructs it. Deterministic DDIM inversion provides such a trajectory, but the authors found that DDIM inversion is not accurate enough once classifier-free guidance is applied, producing visible distortion and a distortion-versus-editability trade-off that degrades subsequent edits ^[1]^[4].

To address this, Mokady, Hertz, Aberman, Pritch, and Cohen-Or introduced Null-text Inversion in the paper "Null-text Inversion for Editing Real Images using Guided Diffusion Models," posted to arXiv in November 2022 and published at CVPR 2023 ^[4]. The method takes a real image and a caption, computes an initial DDIM inversion as a pivot trajectory, and then optimizes only the null-text (unconditional) embedding used in classifier-free guidance, per timestep, so that the guided denoising reconstructs the input image faithfully. The model weights and the conditional text embedding are left untouched. After this reconstruction is achieved, the image is edited by changing the caption and applying Prompt-to-Prompt as usual ^[4]. Null-text Inversion was built on Stable Diffusion and made high-fidelity, text-only editing of arbitrary real images practical, and it became a common front-end for attention-control editing pipelines ^[4].

Relationship to other editing methods

Prompt-to-Prompt sits within a broader landscape of diffusion editing techniques:

SDEdit (Meng et al., ICLR 2022) edits by adding a controlled amount of noise to an input image and then denoising it under a new condition, trading off realism against faithfulness through the noise level. SDEdit is also training-free and mask-free, but it perturbs the image globally and does not explicitly preserve the original structure the way attention injection does ^[6].
InstructPix2Pix (Brooks, Holynski, Efros, CVPR 2023) turns editing into a supervised, instruction-following task. It uses GPT-3 to generate edit instructions and uses Stable Diffusion together with Prompt-to-Prompt to synthesize a dataset of over 450,000 before-and-after image pairs, then trains a conditional diffusion model on those pairs. Prompt-to-Prompt is what keeps each synthetic pair structurally consistent so the model learns the intended edit rather than an unrelated regeneration ^[5].
ControlNet and similar adapters constrain generation with explicit spatial signals (edges, depth, pose) and require training auxiliary networks, a different mechanism from Prompt-to-Prompt's training-free reuse of internal attention.
Personalization methods such as Textual Inversion and DreamBooth target a complementary problem, teaching a model a new subject or concept, and can be combined with attention-based editing.

Prompt-to-Prompt also seeded a line of attention-control editing research. Later methods generalize the idea by injecting or sharing self-attention as well as cross-attention features (for example Plug-and-Play diffusion features and MasaCtrl), and by combining attention control with improved real-image inversion. These approaches inherit the core principle that the internal attention of a diffusion model is an editable, structure-carrying representation ^[1]^[4].

Limitations

The authors and subsequent work note several limitations ^[1]^[4]:

Real-image editing depends on inversion quality. DDIM inversion with guidance can distort the reconstruction; Null-text Inversion mitigates but does not fully eliminate this, and it adds a per-image optimization step.
The cross-attention maps are low resolution because they live in the U-Net bottleneck (in Imagen, layout is set at 64x64), which limits fine spatial control of edits.
The method cannot spatially relocate an existing object; it preserves layout rather than rearranging it, so moving or reposing an object is outside its scope.
Word swaps work best when the new object is geometrically compatible with the original. Replacing an object with one of very different shape can leave the injected structure looking unnatural.
Edits are expressed through prompt changes, so complex compositions can require carefully chosen prompts, and a poorly aligned caption for a real image can reduce edit fidelity.

References

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D. "Prompt-to-Prompt Image Editing with Cross Attention Control." arXiv:2208.01626 (2 August 2022); ICLR 2023. https://arxiv.org/abs/2208.01626 ↩
Prompt-to-Prompt project page. https://prompt-to-prompt.github.io/ ↩
google/prompt-to-prompt, GitHub repository (Apache-2.0 license). https://github.com/google/prompt-to-prompt ↩
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D. "Null-text Inversion for Editing Real Images using Guided Diffusion Models." arXiv:2211.09794 (November 2022); CVPR 2023. https://arxiv.org/abs/2211.09794 ↩
Brooks, T., Holynski, A., Efros, A. A. "InstructPix2Pix: Learning to Follow Image Editing Instructions." arXiv:2211.09800 (November 2022); CVPR 2023. https://arxiv.org/abs/2211.09800 ↩
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., Ermon, S. "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations." arXiv:2108.01073 (2021); ICLR 2022. https://arxiv.org/abs/2108.01073 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SDEdit