CM3leon
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,588 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,588 words
Add missing citations, update stale details, or suggest a clearer explanation.
CM3leon (pronounced "chameleon") is a multimodal generative model from Meta AI, introduced in July 2023, that handles both text-to-image and image-to-text generation in a single architecture. It is a retrieval-augmented, token-based, decoder-only transformer: unlike the diffusion models that dominated text-to-image work at the time (such as Stable Diffusion, DALL-E 2, and Midjourney), CM3leon generates images one discrete token at a time, in the same way a language model generates words. Its headline result was a state-of-the-art zero-shot Fréchet inception distance (FID) on the MS-COCO benchmark while using roughly five times less training compute than comparable methods [1][2].
The work was described in the paper "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning" by Lili Yu and colleagues at Meta, posted to arXiv on 5 September 2023 [2]. Meta presented CM3leon as the first multimodal model trained with a recipe borrowed directly from text-only large language models: a large-scale retrieval-augmented pretraining stage followed by multitask supervised fine-tuning (SFT) [1][2].
CM3leon should not be confused with Chameleon, a separate and later Meta model released in May 2024. The two share a name pronunciation and the broad idea of treating images and text as a common token stream, but they are distinct systems with different architectures and goals (see Relationship to Chameleon below).
CM3leon is built on the CM3 (Causally Masked Multimodal) architecture, an earlier line of Meta research on decoder-only models trained over documents that interleave text and image tokens [1][2]. The CM3 family combines standard left-to-right (causal) language-model training with a masking objective: spans of the document are masked out and moved to the end of the sequence, so the model learns to fill them in using both preceding and following context. Applied to mixed-modal documents, this lets one decoder-only model both continue and infill text and images.
To turn pictures into tokens, CM3leon uses the image tokenizer from Gafni et al. (2022), the same component behind Meta's earlier Make-A-Scene system. It encodes a 256x256 image into 1,024 discrete tokens drawn from a learned vocabulary of 8,192 image codes [2]. Text is handled by a separate tokenizer with a vocabulary of 56,320, and a special break token marks the boundary between modalities. The model operates over sequences of up to 4,096 tokens [2].
Three model sizes were trained, scaling the same decoder-only design from 350 million up to 7 billion parameters:
| Model | Parameters | Layers | Model dimension |
|---|---|---|---|
| CM3leon-350M | 350M | 24 | 1,024 |
| CM3leon-760M | 760M | 24 | 1,536 |
| CM3leon-7B | 7B | 32 | 4,096 |
The architecture follows decoder-only transformer conventions used for contemporary LLMs, and the 7B configuration is the one used for the headline benchmark numbers [2].
The first stage adapts retrieval-augmented pretraining, a technique from text language modeling, to the multimodal setting. During training, each input document is paired with related documents fetched by a retriever, so the model conditions on external, in-context examples rather than relying only on parameters [1][2]. The retriever is a CLIP-based dense bi-encoder using a ViT-B/32 backbone, which embeds text and images into a shared space; candidate documents are scored by relevance and selected with constraints that encourage diversity across both modality and source so the retrieved set is not redundant [2]. The paper reports retrieving on the order of two to three documents per training example and applies query dropout (dropping a fraction of the query tokens) as regularization. Prepending retrieved documents effectively multiplies the data the model sees during pretraining without enlarging the base corpus.
A central claim of the work is that this stage is comparatively cheap. CM3leon reaches its results with about five times less training compute than comparable transformer-based text-to-image methods, and Meta noted that the largest model was trained on only three billion text tokens, far fewer than the corpora behind many competing systems [1][2].
After pretraining, CM3leon undergoes multitask supervised fine-tuning on a mixture of instruction-style image and text tasks, mirroring the instruction-tuning step used for chat-oriented language models [1][2]. This stage is what gives the model its broad, controllable capabilities. Reported SFT tasks include text-guided image generation and editing, image captioning, visual question answering, and structure-guided generation, where an input such as a segmentation map, sketch, or other structural condition guides the output image [1][3].
All of CM3leon's image training data comes from a licensed Shutterstock dataset of image-and-text pairs [1][3]. Meta emphasized this choice as a way to sidestep the data-ownership and attribution concerns that surrounded text-to-image systems trained on web-scraped images, and framed it as evidence that strong results are achievable without unlicensed data [1][3]. The paper does not state the exact number of licensed images.
Because CM3leon is autoregressive, image quality depends heavily on how tokens are sampled. The paper studies several decoding strategies [2]:
For the benchmark result, the model generates multiple candidate images and reranks them with a CLIP-based scorer to pick the best one [2].
CM3leon's flagship number is a zero-shot FID of 4.88 on MS-COCO, which Meta reported as state of the art for text-to-image generation at the time, beating Google's much larger Parti model [1][2]. "Zero-shot" means the model was not trained on MS-COCO captions; FID measures how close the distribution of generated images is to real ones, so lower is better.
The table below lists figures reported in the paper's MS-COCO comparison. Numbers are zero-shot FID; lower is better.
| Model | FID (MS-COCO, zero-shot) | Parameters |
|---|---|---|
| CM3leon-7B | 4.88 | 7B |
| Re-Imagen | 5.25 | 3.6B |
| Parti | 7.23 | 20B |
| Muse | 7.88 | 3B |
| Stable Diffusion | 12.60 | 800M |
The comparison is notable on two axes: CM3leon-7B outperforms Parti despite having under a third of its parameters, and it does so as an autoregressive (not diffusion) model, undercutting a common assumption that diffusion was strictly more efficient for high-quality image synthesis [1][2]. Meta also highlighted that the autoregressive token formulation retains the controllability and inference flexibility of language models [1].
Beyond benchmark image generation, the fine-tuned CM3leon was shown performing a range of tasks from a single model [1][3]:
Meta released the research paper and example outputs but did not publicly release the CM3leon model weights, positioning the work as a demonstration of the autoregressive multimodal recipe rather than a shipped product [1].
CM3leon (2023) and Chameleon (2024) are often conflated because of the shared pronunciation and the common premise of one transformer over interleaved image and text tokens. They differ in important ways. CM3leon is built on the CM3 causally-masked objective and relies on retrieval-augmented pretraining plus supervised fine-tuning, with the retriever supplying external context during training [2]. Chameleon, described in "Chameleon: Mixed-Modal Early-Fusion Foundation Models" (May 2024), is an early-fusion model trained from scratch end-to-end on an interleaved mixture of images, text, and code, without the retrieval component, and is oriented toward general mixed-modal reasoning and interleaved generation rather than primarily text-to-image quality [4]. In short, CM3leon was a focused autoregressive image-generation result that demonstrated the LLM training recipe transfers to images, and Chameleon generalized the token-based, early-fusion idea into a broader foundation model.