Emu (Meta AI)

7 min read
Updated
Suggest editHistory
RawGraph

Last reviewed

Sources

8 citations

Review status

Source-backed

Revision

v2 · 1,455 words

Emu is a text-to-image generation foundation model developed by Meta AI and unveiled at the Meta Connect conference in September 2023. The name expands to "Expressive Media Universe." Emu is a latent diffusion model best known for introducing "quality-tuning," a fine-tuning recipe that fine-tunes a pre-trained model on only a few thousand carefully selected, visually appealing images to sharply improve aesthetic quality. Emu is Meta's first foundational model for image generation, powers consumer image features across Meta's apps, and is the basis for the follow-on research models Emu Video and Emu Edit.[1][2][5]

The model was described in the paper "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack" by Xiaoliang Dai and colleagues at Meta, submitted to arXiv on 27 September 2023.[1] It should not be confused with other unrelated models that share the "Emu" name, including Meta's separate multimodal generative model "Emu: Generative Pretraining in Multimodality" or various open-source projects.

What is Emu?

Emu is a foundation model that turns a text prompt into a high-quality image. Its distinguishing idea is that aesthetic quality can be added to a broadly capable model with very little data: after pre-training on web-scale image-text pairs, Emu is fine-tuned on a tiny, hand-curated set of exceptionally appealing images. The paper's abstract states the result plainly: "as little as a few thousand carefully selected images," later quantified at roughly 2,000, "can significantly improve the visual appeal of the generated images, without sacrificing generality."[1] The same abstract argues the recipe is not architecture-specific: "we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models."[1]

Background

Text-to-image models trained on web-scale image-text pairs can generate a wide range of visual concepts, but the resulting images are often inconsistent in aesthetic quality, with problems in composition, lighting, color, and focus. The Emu authors framed this as a need for "aesthetic alignment" after the main pre-training stage, analogous to how large language models are aligned to human preferences after pre-training. Their central claim is that this alignment can be achieved with a surprisingly small amount of data, provided that data is of very high visual quality.[1]

What architecture does Emu use?

Emu uses the latent diffusion architecture popularized by Stable Diffusion: an autoencoder compresses an image into a lower-dimensional latent representation, a diffusion model (a U-Net) is trained to denoise in that latent space, and the autoencoder's decoder reconstructs the final pixels. Emu generates images at a target resolution of 1024x1024.[1][3]

A notable design choice concerns the autoencoder. The standard latent diffusion autoencoder compresses an RGB image into four latent channels, which the authors found to be a bottleneck on reconstruction quality at high resolution. Emu increases the latent channel count from 4 to 16, which significantly improves reconstruction; the team also tried 32 channels but found little additional benefit over 16. The autoencoder compresses image resolution by a factor of 64 using three 2x2 downsampling blocks (an f8 spatial downsampling). To further improve fidelity, the autoencoder adds an adversarial loss and a non-learnable Fourier Feature Transform pre-processing step. On reconstruction metrics, the 16-channel autoencoder reaches an SSIM around 0.92 and a PSNR around 34, with the Fourier-feature variant slightly higher.[1][3]

How does quality-tuning work?

Quality-tuning is the contribution that gives the paper its title. The process has two phases:

  1. Pre-training. A latent diffusion model is pre-trained on 1.1 billion image-text pairs, giving it broad coverage of visual concepts but uneven aesthetic quality.[1][2]
  2. Quality-tuning. The pre-trained model is fine-tuned with supervised learning on a tiny, manually curated dataset of only a few thousand exceptionally high-quality images, described in the paper as "photogenic needles in a haystack."[1][4]

The curated set was reduced to 2,000 images through a multi-stage funnel. Automatic filters first cut billions of candidates down to roughly 200,000 by removing offensive content and enforcing image-text alignment and basic quality thresholds. Generalist human annotators then narrowed that pool to about 20,000 images, and specialist annotators applied photographic principles (composition, lighting, color, focus) to select the final set of roughly 2,000.[3][4]

The key empirical finding is that fine-tuning on this small set dramatically improves visual appeal without sacrificing the model's generality, as measured by faithfulness to the text prompt. The paper also reports that quality-tuning is not specific to latent diffusion: it is "a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models."[1]

How good is Emu? (Evaluation)

Emu was evaluated with human preference studies on two prompt sets: the public PartiPrompts benchmark and an internal "Open User Input" benchmark built from real-world usage of text-to-image tools. The headline results compare visual appeal:

ComparisonBenchmarkEmu win rate
Emu vs. its pre-trained-only counterpartPartiPrompts82.9%
Emu vs. its pre-trained-only counterpartOpen User Input91.2%
Emu vs. SDXL v1.0PartiPrompts68.4%
Emu vs. SDXL v1.0Open User Input71.3%

The large margin over the pre-trained baseline (the same model without quality-tuning) is the central evidence that the small curated dataset, not additional capacity or pre-training data, drives the quality gain.[1][2] Mark Zuckerberg stated at Connect that Emu could generate an image in about five seconds.[5]

What products use Emu?

Meta positioned Emu as its "first foundational model for image generation" and wired it into consumer features across its apps. These features combine Emu with other systems, such as the Llama 2 language model for prompt handling and the Segment Anything Model for masking.[5][6]

FeatureWhat it doesWhere it appears
Imagine (image generation)Generates images from a text prompt, invoked in chat with "@Meta AI /imagine"WhatsApp, Messenger, Instagram (via Meta AI assistant)
AI stickersCreates custom stickers from a text description in secondsInstagram, Facebook, Messenger, WhatsApp
RestyleApplies a text-described visual style to an existing photoInstagram
BackdropChanges the background or scene while keeping the subject, using Segment AnythingInstagram

On 6 December 2023, Meta launched a standalone web experience for the technology at imagine.meta.com, branded "Imagine with Meta," which generated four images per prompt and was initially free in the United States. Meta said it would add visible and invisible watermarks to AI-generated images from these tools to signal that they were machine-made.[6][7]

What is in the Emu family?

On 16 November 2023, Meta announced two research models that build on Emu:[8]

  • Emu Video generates short videos from a text prompt using a factorized, two-step process: first it generates a still image conditioned on the prompt, then it generates video conditioned on both the prompt and that image. It uses two diffusion models rather than a long pipeline of distinct models, producing 512x512 four-second clips at 16 frames per second. In human evaluations it was preferred over Meta's earlier Make-A-Video by 96% on quality and 85% on faithfulness, and it can also animate user-supplied images.[8]
  • Emu Edit performs instruction-based image editing, letting a user describe an edit in plain language. It handles local and global edits, adding or removing backgrounds, color and geometry changes, and even detection and segmentation tasks, while leaving unrelated pixels untouched. It was trained on a dataset of 10 million synthesized examples, each consisting of an input image, a task instruction, and the target output.[8]

Significance

Emu's main contribution is methodological rather than architectural: it demonstrated that a relatively small, tightly curated dataset can realign a large pre-trained image model toward high aesthetic quality, a result that influenced later work on data curation for generative models. For Meta, Emu also marked the company's shift from publishing image-generation research, such as Make-A-Scene and Make-A-Video, toward shipping image generation directly inside its consumer apps.[1][5][8]

References

  1. Dai, Xiaoliang et al. "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack." arXiv:2309.15807, 27 September 2023. https://arxiv.org/abs/2309.15807
  2. "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack." AI Papers Academy. https://aipapersacademy.com/emu/
  3. "Emu: The Most Advanced Next-Generation Image Model From Meta." School of Machine Learning, 29 September 2023. https://www.schoolofmachinelearning.com/2023/09/29/emu-image-generation-model-from-meta/
  4. "A Deep Dive Inside Emu, Meta's New Image Generation AI Model." Maginative. https://www.maginative.com/article/a-deep-dive-inside-emu-metas-new-image-generation-ai-model/
  5. "Introducing New AI Experiences Across Our Family of Apps and Devices." Meta Newsroom, 27 September 2023. https://about.fb.com/news/2023/09/introducing-ai-powered-assistants-characters-and-creative-tools/
  6. "Meta publicly launches AI image generator trained on your Facebook, Instagram photos." VentureBeat, 6 December 2023. https://venturebeat.com/ai/meta-publicly-launches-ai-image-generator-trained-on-your-facebook-instagram-photos
  7. "Meta launches a standalone AI-powered image generator." TechCrunch, 6 December 2023. https://techcrunch.com/2023/12/06/meta-launches-a-standalone-ai-powered-image-generator/
  8. "Emu Video and Emu Edit: Our latest generative AI research milestones." Meta AI Blog, 16 November 2023. https://ai.meta.com/blog/emu-text-to-video-generation-image-editing-research/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit