Transfusion

Deep Learning Generative AI

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,576 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Transfusion is a recipe, introduced by Meta in 2024, for training a single Transformer over a mixture of discrete text and continuous image data using two training objectives simultaneously: a next-token-prediction (language modeling) loss on text tokens and a diffusion (denoising) loss on image patches. It unifies autoregressive language modeling and diffusion-based image generation inside one set of parameters, so that the same model can both read and write text and natively generate high-quality images without quantizing those images into discrete tokens. The method was presented in the paper "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model" by Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy, posted to arXiv on 2024-08-20 ^[1]^[2].

Most authors were affiliated with Meta (FAIR); Arun Babu was at Waymo and Xuezhe Ma at the University of Southern California, with Levy's contribution made while at Meta ^[1]. Transfusion is best understood as a direct response to Meta's earlier Chameleon model, which unified modalities by converting images into discrete tokens. Transfusion keeps each modality in its native representation, discrete for text and continuous for images, and was shown to scale substantially better than discrete-token approaches on image tasks ^[1]^[3].

Motivation: unifying text and image generation

Building a single model that can both understand and generate multiple modalities is complicated by the fact that language models operate on discrete tokens trained with a cross-entropy objective, while the strongest image generators are continuous diffusion models. Before Transfusion, two broad strategies dominated ^[1]:

Quantize images into discrete tokens (for example with a VQ-VAE or VQGAN) so that a standard language model can treat image tokens just like words. This is the approach taken by Chameleon. It yields a clean, fully autoregressive recipe but discards information during quantization, which caps achievable image quality.
Bolt a separate diffusion model onto a language model, connecting the two with adapters. This preserves image quality but means the system is really two models, and the diffusion component is not trained jointly with the language objective.

Transfusion's goal was to avoid both compromises: a single Transformer that natively handles high-fidelity image generation through diffusion while retaining a strong text language model, trained end to end from scratch on interleaved data ^[1].

How Transfusion works

Mixed sequences and modality-specific objectives

Transfusion processes a single sequence that interleaves discrete text tokens and continuous image-patch vectors. Each modality is trained with its own loss ^[1]:

Text uses the standard language modeling objective: predict the next token with cross-entropy, under causal (left-to-right) attention.
Images use a diffusion objective in the style of DDPM: noise is added to the latent image patches, and the model is trained to predict that noise (denoise the patches). Within each image, the patches attend to each other bidirectionally rather than causally, since all patches of an image are present at once.

The two losses are combined into a single training objective with a weighting coefficient lambda:

L_Transfusion = L_LM + lambda * L_DDPM

The paper uses lambda = 5, balancing the text and image gradients ^[1]. A single Transformer backbone is shared across both modalities; only the input/output layers are modality-aware.

Attention and modality delimiters

A key design choice is the attention mask. Text tokens use causal attention, so each text token sees only earlier elements in the sequence. The patches of a given image use bidirectional attention among themselves, which matches how diffusion denoises a whole image jointly, while the image as a whole still attends causally to preceding context ^[1]. Special begin-of-image (BOI) and end-of-image (EOI) tokens delimit each image, marking where the model should switch from autoregressive text behavior to bidirectional diffusion behavior and back ^[1].

Image encoding

Images are not fed in as raw pixels. A pretrained variational autoencoder (VAE) compresses each 256 by 256 image into a latent tensor of shape 32 by 32 by 8 (8 channels per spatial location). These latents are then grouped into patches and projected into the Transformer's embedding space. Transfusion explores two patchification schemes for turning latent patches into and out of model vectors ^[1]:

A simple linear layer applied to each k by k patch.
U-Net down and up blocks, which add a small amount of convolutional inductive bias at the model's input and output.

Using modality-specific encoding/decoding layers (the U-Net variant) improved results and let each image be compressed down to as few as 16 patches in the Transformer sequence, sharply reducing the sequence length spent on images ^[1]. At inference, image generation is performed by running the diffusion sampler over the image positions (the paper uses a cosine noise schedule, 1,000 training timesteps, and 250 sampling steps), while text is generated autoregressively. Image quality is further improved with classifier-free guidance, reported with a guidance coefficient of 5 in the main results ^[1].

Results

The authors trained a family of Transfusion models at 0.16B, 0.37B, 0.76B, 1.4B, and 7B parameters, pretraining on a roughly 50/50 mixture of text and image data and establishing scaling laws across unimodal and cross-modal benchmarks ^[1]. The flagship 7B model was trained on about 2 trillion multi-modal tokens (roughly 1 trillion text tokens plus image-patch tokens drawn from about 3.5 billion images) ^[1].

At equal model size and training budget, Transfusion scaled significantly better than the discrete-token (Chameleon-style) recipe on image generation, and matched it on text. In a controlled comparison at 7B parameters and 0.5T tokens, Transfusion reached a zero-shot MS-COCO Frechet Inception Distance (FID) of 16.8 against 29.6 for the discrete-token baseline, and the paper reports that Transfusion reaches FID parity using only a small fraction of the compute of the discrete approach ^[1]. On text and image-to-text tasks, Transfusion reached perplexity parity with the discrete baseline at roughly half its FLOPs and image-to-text parity at about 21.8% of its FLOPs ^[1].

The final 7B Transfusion model generates images on par with dedicated diffusion systems at similar scale while also functioning as a competent language model ^[1]. Selected reported figures:

System	Type	MS-COCO FID (zero-shot)	GenEval
Transfusion 7B	Unified (text + diffusion)	6.78	0.63
DALL-E 2	Dedicated diffusion	10.39	0.52
SDXL	Dedicated diffusion	--	0.55
Stable Diffusion 3	Dedicated diffusion	--	0.68

Values as reported in the Transfusion paper; a dash means the figure was not reported for that metric ^[1]. By these measures Transfusion's image generation outperformed DALL-E 2 and SDXL on the cited benchmarks and approached the strongest contemporaneous dedicated text-to-image systems, despite carrying a full language model in the same parameters ^[1].

Relationship to Chameleon and other multimodal models

Transfusion is most directly contrasted with Chameleon, Meta's mixed-modal early-fusion model released earlier in 2024. Chameleon tokenizes 512 by 512 images into 1,024 discrete codes from an 8,192-entry VQ codebook and trains a single Transformer purely autoregressively over the combined text-plus-image vocabulary ^[4]. This is elegant and fully token-based, but quantization is lossy. Transfusion instead leaves images as continuous VAE latents and trains them with diffusion, so no quantization step throws away image detail; the headline experimental claim is that this continuous, diffusion-based recipe exhibits consistently better scaling on image benchmarks than the discrete recipe while preserving text quality ^[1]^[3].

More broadly, Transfusion sits within a wave of work on unified models that aim to both understand and generate across modalities. It contrasts with fully autoregressive token-based unified models such as Chameleon and Emu3, and with systems that connect a frozen language model to an external diffusion decoder; Transfusion's distinguishing feature is that a single jointly trained Transformer carries both the autoregressive and the diffusion objectives, with the diffusion operating on continuous latents rather than discrete tokens ^[1]^[3]. Its hybrid autoregressive-plus-diffusion design over continuous tokens has been frequently cited as a representative architecture in surveys of unified multimodal models ^[3].

Significance

Transfusion demonstrated that next-token prediction and diffusion are not mutually exclusive training paradigms but can coexist productively in one Transformer, removing the need to either degrade images through quantization or maintain a separate generator. By showing favorable scaling against the discrete-token alternative and image quality competitive with dedicated diffusion models at the 7B scale, it provided concrete evidence for natively multimodal generation from a single model trained end to end ^[1].

The recipe is also reproducible: an open-source PyTorch implementation of the method was released by the community, and the technique has been widely referenced as one of the canonical templates for combining autoregressive language modeling with continuous-latent diffusion in subsequent unified multimodal models ^[2]^[3]. As an idea, "predict the next token and diffuse images" has become shorthand for the continuous-token branch of unified model research, alongside the fully discrete branch exemplified by Chameleon ^[1]^[3].

References

Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., and Levy, O. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model." arXiv:2408.11039, 2024-08-20. https://arxiv.org/abs/2408.11039 ↩
AI at Meta. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model" (research publication page). https://ai.meta.com/research/publications/transfusion-predict-the-next-token-and-diffuse-images-with-one-multi-modal-model/ ↩
"Towards Unified Multimodal Models: Trends and Insights." ICLR 2025 Blogposts. https://d2jud02ci9yv69.cloudfront.net/2025-04-28-unified-models-47/blog/unified-models/ ↩
Chameleon Team (Meta FAIR). "Chameleon: Mixed-Modal Early-Fusion Foundation Models." arXiv:2405.09818, 2024. https://arxiv.org/abs/2405.09818 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Show-o