Emu Edit
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,638 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,638 words
Add missing citations, update stale details, or suggest a clearer explanation.
Emu Edit is an instruction-based image editing model from Meta AI, announced on November 16, 2023 alongside the text-to-video model Emu Video. It edits a source image according to a free-form natural-language instruction, such as "Dress the emu with a fireman outfit," while leaving the rest of the image untouched. The work was published as "Emu Edit: Precise Image Editing via Recognition and Generation Tasks" by Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman of Meta's GenAI group, and was later presented at CVPR 2024.[1][2][3]
The central idea is that an image editor should change only the pixels relevant to the request. Meta's example was adding the word "Aloha!" to a baseball cap: the text appears, but the cap itself stays the same.[3] Earlier instruction-based editors, including InstructPix2Pix, could often follow an instruction in spirit but altered the whole image or failed on operations slightly outside their training distribution. Emu Edit addresses this through two design choices that the authors describe as essential to its performance: training a single model across a very wide range of tasks, and giving the model a learned embedding for each task so it knows which operation to perform.[1]
Emu Edit is built on top of Emu (Meta AI), Meta's text-to-image foundation model. Emu is a latent diffusion model trained in two stages: a large pre-training phase followed by a quality fine-tuning phase on a small, highly curated set of images. Emu Edit adapts that base architecture: a large U-Net with roughly 2.8 billion parameters, text conditioning from a CLIP ViT-L encoder together with the T5-XXL text encoder, a 16-channel autoencoder, and a pre-training set of about 1.1 billion images.[2]
To turn Emu from a text-to-image generator into an instruction-based editor, the model is conditioned on two extra inputs: the image to be modified and the editing instruction. The image is supplied to the U-Net by increasing the number of input channels, with the new weights initialized to zero so the model starts close to the original Emu behavior. At inference the model uses classifier-free guidance on both the image and the text condition. The same lineage links the broader family: Emu Video also fine-tunes an Emu foundation model, in that case on 34 million video-text pairs.[2][3]
Rather than training a specialist editor, the authors train Emu Edit jointly on sixteen distinct tasks, all reformulated as generative (image-to-image) problems. The paper groups these into three families.[2]
| Task family | Tasks | Description |
|---|---|---|
| Region-based editing | Local, Remove, Add, Texture, Background | Substituting or altering a specific object, erasing an object, inserting a new object, changing an object's texture without changing its structure, and changing the scene background |
| Free-form editing | Global, Style, Text editing | Edits affecting the whole image or that cannot be described by a mask, style changes, and adding, removing, swapping, or restyling text within the image |
| Vision tasks | Detect, Segment, Color, Image-to-image translation | Marking an object with a bounding box, isolating and marking an object, adjustments such as sharpening or blurring, and translations such as sketch-to-image, depth-to-image, normal-to-image, pose-to-image, and segmentation-map-to-image |
A core finding is that the recognition-style computer vision tasks, detection, segmentation, and image-to-image translation, improve editing quality rather than just adding capabilities. In an ablation, removing detection and segmentation lowered performance on region-based editing, and removing image-to-image translation lowered performance on free-form editing. The authors hypothesize that recognition tasks sharpen the model's ability to localize edits, while image-to-image tasks help it understand overall image structure for global edits.[2] The paper title's phrase "Recognition and Generation Tasks" refers to this combination.
Because no existing corpus covered this task range with enough diversity and quality, the team built its own dataset of ten million examples. Each example is a quadruplet: an input image, an input caption, an instruction text, a target image, and a task index (one of the sixteen). Editing instructions and captions were generated using a dialogue-optimized 70-billion-parameter Llama 2 variant, prompted separately per task to avoid the bias a single prompt introduces. Input and edited image pairs were synthesized so that only the relevant regions differ, using a mask-extraction step driven by the language model. A filtering pass removed roughly 70% of generated pairs using a task predictor, CLIP filtering, depth-based structure-preservation checks, and object detectors, leaving the final ten million samples. Meta described this as, to its knowledge, the largest dataset of its kind.[2][3]
The mechanism that tells Emu Edit which operation to apply is a learned task embedding. For each of the tasks, the model learns a unique embedding vector, stored in an embedding table and optimized jointly with the U-Net weights during training. The embedding is integrated into the network in two ways: through cross-attention interactions, and by adding it to the timestep embeddings.[2]
The motivation is that complex or ambiguous instructions can leave a model "perplexed" about what kind of edit is wanted. The paper illustrates this with a model trained without task embeddings, which might perform a global edit when a texture edit was intended, or segment an image when a global change was requested. At inference time the correct task index is not known in advance, so the authors fine-tune a Flan-T5-XL model to predict the task from the instruction. In an ablation on the validation set, conditioning on the task embedding improved results over no conditioning, and the learned task predictor nearly closed the gap to using the ground-truth task.[2]
The same embedding design supports adaptation to new tasks through a procedure the authors call task inversion. Given a few examples of an unseen task, the U-Net weights are frozen and only a new task embedding is learned to fit that task. The model then handles the new operation while retaining its original abilities. Reported new tasks include image inpainting, 4x super-resolution, object contour detection, and compositions of editing operations such as add-then-detect. A single example was enough to improve performance noticeably, and around 100 examples nearly matched an expert model trained on 100,000 examples, which is useful when labeled data is scarce.[1][2]
To support more rigorous evaluation, Meta released the Emu Edit Test Set, a benchmark spanning seven editing categories: background alteration (background), comprehensive image changes (global), style alteration (style), object removal (remove), object addition (add), localized modifications (local), and color or texture alterations (texture). The benchmark draws input images from the MagicBrush collection (which itself uses MS-COCO images), and for each operation crowd workers wrote relevant, creative, and challenging instructions, with a verification stage filtering out off-task examples. Each item also includes an input caption and an output caption so that methods needing image descriptions can be evaluated as well.[2][4]
The public dataset on Hugging Face (facebook/emu_edit_test_set) contains 5,611 examples, split into 3,591 test and 2,020 validation rows, and is released under a CC-BY-NC 4.0 license. A companion dataset, facebook/emu_edit_test_set_generations, provides Emu Edit's own outputs on the benchmark.[4]
The paper reports automatic metrics and human judgments on two benchmarks: the Emu Edit Test Set and the MagicBrush test set. Automatic metrics include CLIP text-image direction similarity (CLIP_dir), CLIP image similarity (CLIP_img), CLIP output similarity (CLIP_out), L1 pixel distance, and DINO similarity. Human raters were asked two questions per comparison: which result best preserves elements of the input image (image faithfulness), and which best follows the instruction (text alignment).[2]
| Method | CLIP_dir | CLIP_img | L1 | DINO |
|---|---|---|---|---|
| InstructPix2Pix | 0.078 | 0.834 | 0.121 | 0.762 |
| MagicBrush | 0.090 | 0.838 | 0.100 | 0.776 |
| Emu Edit | 0.109 | 0.859 | 0.094 | 0.819 |
Selected automatic metrics on the Emu Edit Test Set (higher is better except L1). Source: Table 2 of the paper.[2]
In the human evaluations on the Emu Edit Test Set, raters preferred Emu Edit over InstructPix2Pix in 77.33% of cases for text alignment and 76.71% for image faithfulness, and over MagicBrush in 74.50% and 74.10%. Against the text-based methods Plug-and-Play and a Null-Text Inversion variant, which were given access to ground-truth captions, raters still preferred Emu Edit by wide margins (for example 98.95% / 99.00% over Plug-and-Play). Emu Edit also led on most automatic metrics on the MagicBrush test set.[2]
Coverage at launch (VentureBeat, SiliconANGLE, InfoQ, Decrypt) framed Emu Edit and Emu Video as research milestones building on Meta's Emu model, and noted that the editor was presented as research rather than a shipping product. Reports highlighted the ten-million-example dataset and the model's focus on precise, instruction-only edits.[3][5][6] Within the research community the paper's notable contributions were the demonstration that recognition tasks like detection and segmentation can improve generative editing when trained together, the learned task-embedding mechanism, and the release of a cleaner, more diverse evaluation benchmark to complement InstructPix2Pix and MagicBrush.[2]