Prompt-to-Prompt
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,798 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,798 words
Add missing citations, update stale details, or suggest a clearer explanation.
Prompt-to-Prompt is a training-free image editing technique for text-conditioned diffusion models that edits a generated image by manipulating the model's cross-attention maps when the text prompt is changed [1]. The central observation is that, in a text-to-image diffusion model, the cross-attention maps that bind each text token to a region of the image largely determine the spatial layout of the result. By holding those maps fixed (or selectively modifying them) while swapping a word, adding a phrase, or changing a token's weight, Prompt-to-Prompt produces an edited image that preserves the geometry and composition of the original, without any per-image fine-tuning and without a user-supplied mask [1].
The method was introduced in the paper "Prompt-to-Prompt Image Editing with Cross Attention Control" by Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or, researchers at Google Research and Tel Aviv University. The paper was posted to arXiv on 2 August 2022 and was published at the International Conference on Learning Representations (ICLR) in 2023 [1]. The original work used Google's Imagen as the backbone model, while the public reference implementation released by Google targets Stable Diffusion and latent diffusion [2][3].
Prompt-to-Prompt became an influential primitive in diffusion-based editing. Its attention-control idea underlies a family of later "attention injection" editing methods, it was paired with Null-text Inversion to extend editing to real photographs, and it was used to synthesize the paired training data for the instruction-following editor InstructPix2Pix [4][5].
A text-to-image diffusion model generates an image by iteratively denoising a noisy latent (or pixel) tensor, conditioned at every step on a text prompt. The conditioning is injected through cross-attention layers placed throughout the denoising U-Net. At each such layer, the spatial features of the image act as queries and the per-token text embeddings act as keys and values. The resulting attention map is a set of per-token spatial heatmaps: for every word in the prompt, there is a 2D map over image locations indicating how strongly that word attends to (and therefore influences) each region [1].
Hertz and colleagues showed that these maps are interpretable and that they encode the layout of the scene early in the denoising process. The pixels that a token such as "bear" attends to correspond closely to where the bear ends up in the image, and the rough structure of the composition is fixed in the first denoising steps before fine texture is filled in [1]. In Imagen the composition is determined mainly at the low 64x64 base resolution, where the cross-attention sits in the network's bottleneck, so the maps are spatially coarse but semantically meaningful [1]. This insight is the foundation of the method: if the spatial layout is carried by the cross-attention maps, then to keep the layout while changing content it is sufficient to keep (or controllably edit) those maps rather than the image itself.
A related lever is classifier-free guidance, which the underlying diffusion model uses to amplify the influence of the text prompt during sampling. Guidance interacts with the editing operations and with the inversion step used for real images (see below) [1][4].
Prompt-to-Prompt edits an image by running two synchronized denoising passes: one with the original prompt and one with the edited prompt, starting from the same random seed (the same initial noise). During the edited pass, the cross-attention maps computed from the original prompt are injected into the network, overriding or blending with the maps that the edited prompt would otherwise produce [1]. The choice of which maps to inject, and for how many denoising steps, defines three editing operations.
| Operation | Prompt change | Attention manipulation | Typical effect |
|---|---|---|---|
| Word swap | Replace a token (for example "bicycle" to "car") | Inject the original maps for the shared tokens; the swapped token receives the original token's map for the early steps, then is released | Replace an object while keeping pose, position, and scene layout |
| Adding a phrase | Append new words ("a tree" to "a tree in the snow") | Keep (freeze) the original maps for the unchanged tokens; let the new tokens form fresh attention | Add or change an attribute or context while preserving existing content |
| Re-weighting | Same tokens | Scale a chosen token's cross-attention values up or down by a factor | Strengthen or weaken how much a word affects the image (more or less "snow", "fluffy", and so on) |
For a word swap, the original and edited prompts share most tokens, so the original attention maps are injected for the unchanged words. For the replaced word, the original map is injected for the first fraction of the denoising steps (controlled by a parameter the paper denotes tau) to lock in the location and shape, after which the model uses its own attention so the new object can adapt its texture and details. Because the two prompts may have different token positions, an alignment function maps tokens of the new prompt to the corresponding tokens of the old one [1].
For adding a phrase (prompt refinement), the edited prompt is a superset of the original. The attention of the previously existing tokens is frozen to the original maps so the existing content does not move, while attention is allowed to flow freely to the newly added tokens, which introduces the new element or style [1][2].
For attention re-weighting, the prompt is unchanged but the cross-attention values associated with a particular token are multiplied by a scalar. Increasing the factor strengthens that word's effect and decreasing it (including to negative values) attenuates or removes it, giving continuous control over an attribute that is hard to express in words [1].
All three operations are training-free, run at inference time on a frozen model, and require no segmentation mask from the user; the spatial extent of an edit is derived automatically from the attention maps [1]. The public implementation exposes parameters such as cross_replace_steps (the fraction of steps for which cross-attention maps are replaced) and self_replace_steps (the analogous fraction for self-attention maps), plus an optional LocalBlend object that restricts an edit to the region indicated by selected words, improving locality for difficult edits [3].
Prompt-to-Prompt as described edits images that the model itself generated, because it needs the initial noise that produced them. To edit a real photograph, the image must first be inverted into the diffusion model's noise space so that re-running the denoiser reconstructs it. Deterministic DDIM inversion provides such a trajectory, but the authors found that DDIM inversion is not accurate enough once classifier-free guidance is applied, producing visible distortion and a distortion-versus-editability trade-off that degrades subsequent edits [1][4].
To address this, Mokady, Hertz, Aberman, Pritch, and Cohen-Or introduced Null-text Inversion in the paper "Null-text Inversion for Editing Real Images using Guided Diffusion Models," posted to arXiv in November 2022 and published at CVPR 2023 [4]. The method takes a real image and a caption, computes an initial DDIM inversion as a pivot trajectory, and then optimizes only the null-text (unconditional) embedding used in classifier-free guidance, per timestep, so that the guided denoising reconstructs the input image faithfully. The model weights and the conditional text embedding are left untouched. After this reconstruction is achieved, the image is edited by changing the caption and applying Prompt-to-Prompt as usual [4]. Null-text Inversion was built on Stable Diffusion and made high-fidelity, text-only editing of arbitrary real images practical, and it became a common front-end for attention-control editing pipelines [4].
Prompt-to-Prompt sits within a broader landscape of diffusion editing techniques:
Prompt-to-Prompt also seeded a line of attention-control editing research. Later methods generalize the idea by injecting or sharing self-attention as well as cross-attention features (for example Plug-and-Play diffusion features and MasaCtrl), and by combining attention control with improved real-image inversion. These approaches inherit the core principle that the internal attention of a diffusion model is an editable, structure-carrying representation [1][4].
The authors and subsequent work note several limitations [1][4]: