Transfusion
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,576 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,576 words
Add missing citations, update stale details, or suggest a clearer explanation.
Transfusion is a recipe, introduced by Meta in 2024, for training a single Transformer over a mixture of discrete text and continuous image data using two training objectives simultaneously: a next-token-prediction (language modeling) loss on text tokens and a diffusion (denoising) loss on image patches. It unifies autoregressive language modeling and diffusion-based image generation inside one set of parameters, so that the same model can both read and write text and natively generate high-quality images without quantizing those images into discrete tokens. The method was presented in the paper "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model" by Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy, posted to arXiv on 2024-08-20 [1][2].
Most authors were affiliated with Meta (FAIR); Arun Babu was at Waymo and Xuezhe Ma at the University of Southern California, with Levy's contribution made while at Meta [1]. Transfusion is best understood as a direct response to Meta's earlier Chameleon model, which unified modalities by converting images into discrete tokens. Transfusion keeps each modality in its native representation, discrete for text and continuous for images, and was shown to scale substantially better than discrete-token approaches on image tasks [1][3].
Building a single model that can both understand and generate multiple modalities is complicated by the fact that language models operate on discrete tokens trained with a cross-entropy objective, while the strongest image generators are continuous diffusion models. Before Transfusion, two broad strategies dominated [1]:
Transfusion's goal was to avoid both compromises: a single Transformer that natively handles high-fidelity image generation through diffusion while retaining a strong text language model, trained end to end from scratch on interleaved data [1].
Transfusion processes a single sequence that interleaves discrete text tokens and continuous image-patch vectors. Each modality is trained with its own loss [1]:
The two losses are combined into a single training objective with a weighting coefficient lambda:
L_Transfusion = L_LM + lambda * L_DDPM
The paper uses lambda = 5, balancing the text and image gradients [1]. A single Transformer backbone is shared across both modalities; only the input/output layers are modality-aware.
A key design choice is the attention mask. Text tokens use causal attention, so each text token sees only earlier elements in the sequence. The patches of a given image use bidirectional attention among themselves, which matches how diffusion denoises a whole image jointly, while the image as a whole still attends causally to preceding context [1]. Special begin-of-image (BOI) and end-of-image (EOI) tokens delimit each image, marking where the model should switch from autoregressive text behavior to bidirectional diffusion behavior and back [1].
Images are not fed in as raw pixels. A pretrained variational autoencoder (VAE) compresses each 256 by 256 image into a latent tensor of shape 32 by 32 by 8 (8 channels per spatial location). These latents are then grouped into patches and projected into the Transformer's embedding space. Transfusion explores two patchification schemes for turning latent patches into and out of model vectors [1]:
Using modality-specific encoding/decoding layers (the U-Net variant) improved results and let each image be compressed down to as few as 16 patches in the Transformer sequence, sharply reducing the sequence length spent on images [1]. At inference, image generation is performed by running the diffusion sampler over the image positions (the paper uses a cosine noise schedule, 1,000 training timesteps, and 250 sampling steps), while text is generated autoregressively. Image quality is further improved with classifier-free guidance, reported with a guidance coefficient of 5 in the main results [1].
The authors trained a family of Transfusion models at 0.16B, 0.37B, 0.76B, 1.4B, and 7B parameters, pretraining on a roughly 50/50 mixture of text and image data and establishing scaling laws across unimodal and cross-modal benchmarks [1]. The flagship 7B model was trained on about 2 trillion multi-modal tokens (roughly 1 trillion text tokens plus image-patch tokens drawn from about 3.5 billion images) [1].
At equal model size and training budget, Transfusion scaled significantly better than the discrete-token (Chameleon-style) recipe on image generation, and matched it on text. In a controlled comparison at 7B parameters and 0.5T tokens, Transfusion reached a zero-shot MS-COCO Frechet Inception Distance (FID) of 16.8 against 29.6 for the discrete-token baseline, and the paper reports that Transfusion reaches FID parity using only a small fraction of the compute of the discrete approach [1]. On text and image-to-text tasks, Transfusion reached perplexity parity with the discrete baseline at roughly half its FLOPs and image-to-text parity at about 21.8% of its FLOPs [1].
The final 7B Transfusion model generates images on par with dedicated diffusion systems at similar scale while also functioning as a competent language model [1]. Selected reported figures:
| System | Type | MS-COCO FID (zero-shot) | GenEval |
|---|---|---|---|
| Transfusion 7B | Unified (text + diffusion) | 6.78 | 0.63 |
| DALL-E 2 | Dedicated diffusion | 10.39 | 0.52 |
| SDXL | Dedicated diffusion | -- | 0.55 |
| Stable Diffusion 3 | Dedicated diffusion | -- | 0.68 |
Values as reported in the Transfusion paper; a dash means the figure was not reported for that metric [1]. By these measures Transfusion's image generation outperformed DALL-E 2 and SDXL on the cited benchmarks and approached the strongest contemporaneous dedicated text-to-image systems, despite carrying a full language model in the same parameters [1].
Transfusion is most directly contrasted with Chameleon, Meta's mixed-modal early-fusion model released earlier in 2024. Chameleon tokenizes 512 by 512 images into 1,024 discrete codes from an 8,192-entry VQ codebook and trains a single Transformer purely autoregressively over the combined text-plus-image vocabulary [4]. This is elegant and fully token-based, but quantization is lossy. Transfusion instead leaves images as continuous VAE latents and trains them with diffusion, so no quantization step throws away image detail; the headline experimental claim is that this continuous, diffusion-based recipe exhibits consistently better scaling on image benchmarks than the discrete recipe while preserving text quality [1][3].
More broadly, Transfusion sits within a wave of work on unified models that aim to both understand and generate across modalities. It contrasts with fully autoregressive token-based unified models such as Chameleon and Emu3, and with systems that connect a frozen language model to an external diffusion decoder; Transfusion's distinguishing feature is that a single jointly trained Transformer carries both the autoregressive and the diffusion objectives, with the diffusion operating on continuous latents rather than discrete tokens [1][3]. Its hybrid autoregressive-plus-diffusion design over continuous tokens has been frequently cited as a representative architecture in surveys of unified multimodal models [3].
Transfusion demonstrated that next-token prediction and diffusion are not mutually exclusive training paradigms but can coexist productively in one Transformer, removing the need to either degrade images through quantization or maintain a separate generator. By showing favorable scaling against the discrete-token alternative and image quality competitive with dedicated diffusion models at the 7B scale, it provided concrete evidence for natively multimodal generation from a single model trained end to end [1].
The recipe is also reproducible: an open-source PyTorch implementation of the method was released by the community, and the technique has been widely referenced as one of the canonical templates for combining autoregressive language modeling with continuous-latent diffusion in subsequent unified multimodal models [2][3]. As an idea, "predict the next token and diffuse images" has become shorthand for the continuous-token branch of unified model research, alongside the fully discrete branch exemplified by Chameleon [1][3].