CM3leon

Image Generation Meta AI Multimodal AI

8 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v2 · 1,586 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CM3leon (pronounced "chameleon") is a multimodal generative model from Meta AI, introduced in July 2023, that handles both text-to-image and image-to-text generation in a single architecture. It is a retrieval-augmented, token-based, decoder-only transformer: unlike the diffusion models that dominated text-to-image work at the time (such as Stable Diffusion, DALL-E 2, and Midjourney), CM3leon generates images one discrete token at a time, in the same way a language model generates words. Its headline result was a state-of-the-art zero-shot Fréchet inception distance (FID) on the MS-COCO benchmark while using roughly five times less training compute than comparable methods ^[1]^[2].

The work was described in the paper "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning" by Lili Yu and colleagues at Meta, posted to arXiv on 5 September 2023 ^[2]. Meta presented CM3leon as the first multimodal model trained with a recipe borrowed directly from text-only large language models: a large-scale retrieval-augmented pretraining stage followed by multitask supervised fine-tuning (SFT) ^[1]^[2].

CM3leon should not be confused with Chameleon, a separate and later Meta model released in May 2024. The two share a name pronunciation and the broad idea of treating images and text as a common token stream, but they are distinct systems with different architectures and goals (see Relationship to Chameleon below).

Background and architecture

CM3leon is built on the CM3 (Causally Masked Multimodal) architecture, an earlier line of Meta research on decoder-only models trained over documents that interleave text and image tokens ^[1]^[2]. The CM3 family combines standard left-to-right (causal) language-model training with a masking objective: spans of the document are masked out and moved to the end of the sequence, so the model learns to fill them in using both preceding and following context. Applied to mixed-modal documents, this lets one decoder-only model both continue and infill text and images.

To turn pictures into tokens, CM3leon uses the image tokenizer from Gafni et al. (2022), the same component behind Meta's earlier Make-A-Scene system. It encodes a 256x256 image into 1,024 discrete tokens drawn from a learned vocabulary of 8,192 image codes ^[2]. Text is handled by a separate tokenizer with a vocabulary of 56,320, and a special break token marks the boundary between modalities. The model operates over sequences of up to 4,096 tokens ^[2].

Three model sizes were trained, scaling the same decoder-only design from 350 million up to 7 billion parameters:

Model	Parameters	Layers	Model dimension
CM3leon-350M	350M	24	1,024
CM3leon-760M	760M	24	1,536
CM3leon-7B	7B	32	4,096

The architecture follows decoder-only transformer conventions used for contemporary LLMs, and the 7B configuration is the one used for the headline benchmark numbers ^[2].

Training recipe

Retrieval-augmented pretraining

The first stage adapts retrieval-augmented pretraining, a technique from text language modeling, to the multimodal setting. During training, each input document is paired with related documents fetched by a retriever, so the model conditions on external, in-context examples rather than relying only on parameters ^[1]^[2]. The retriever is a CLIP-based dense bi-encoder using a ViT-B/32 backbone, which embeds text and images into a shared space; candidate documents are scored by relevance and selected with constraints that encourage diversity across both modality and source so the retrieved set is not redundant ^[2]. The paper reports retrieving on the order of two to three documents per training example and applies query dropout (dropping a fraction of the query tokens) as regularization. Prepending retrieved documents effectively multiplies the data the model sees during pretraining without enlarging the base corpus.

A central claim of the work is that this stage is comparatively cheap. CM3leon reaches its results with about five times less training compute than comparable transformer-based text-to-image methods, and Meta noted that the largest model was trained on only three billion text tokens, far fewer than the corpora behind many competing systems ^[1]^[2].

Supervised fine-tuning

After pretraining, CM3leon undergoes multitask supervised fine-tuning on a mixture of instruction-style image and text tasks, mirroring the instruction-tuning step used for chat-oriented language models ^[1]^[2]. This stage is what gives the model its broad, controllable capabilities. Reported SFT tasks include text-guided image generation and editing, image captioning, visual question answering, and structure-guided generation, where an input such as a segmentation map, sketch, or other structural condition guides the output image ^[1]^[3].

Training data

All of CM3leon's image training data comes from a licensed Shutterstock dataset of image-and-text pairs ^[1]^[3]. Meta emphasized this choice as a way to sidestep the data-ownership and attribution concerns that surrounded text-to-image systems trained on web-scraped images, and framed it as evidence that strong results are achievable without unlicensed data ^[1]^[3]. The paper does not state the exact number of licensed images.

Generation and decoding

Because CM3leon is autoregressive, image quality depends heavily on how tokens are sampled. The paper studies several decoding strategies ^[2]:

Classifier-free guidance (CFG): at each step the model's conditional prediction (given the text prompt) is combined with an unconditional prediction, with a scaling factor that trades off prompt fidelity against diversity. This is the same idea widely used in diffusion models, adapted to token sampling.
Contrastive decoding (CD-K): a variant the authors introduce that, rather than always favoring the single most probable token, contrasts predictions to produce outputs complementary to CFG. Because CM3leon can generate in both directions (text-to-image and image-to-text), it can build these self-contained contrastive signals internally.
Nucleus (top-p) sampling: standard threshold-based token sampling.

For the benchmark result, the model generates multiple candidate images and reranks them with a CLIP-based scorer to pick the best one ^[2].

Benchmark results

CM3leon's flagship number is a zero-shot FID of 4.88 on MS-COCO, which Meta reported as state of the art for text-to-image generation at the time, beating Google's much larger Parti model ^[1]^[2]. "Zero-shot" means the model was not trained on MS-COCO captions; FID measures how close the distribution of generated images is to real ones, so lower is better.

The table below lists figures reported in the paper's MS-COCO comparison. Numbers are zero-shot FID; lower is better.

Model	FID (MS-COCO, zero-shot)	Parameters
CM3leon-7B	4.88	7B
Re-Imagen	5.25	3.6B
Parti	7.23	20B
Muse	7.88	3B
Stable Diffusion	12.60	800M

The comparison is notable on two axes: CM3leon-7B outperforms Parti despite having under a third of its parameters, and it does so as an autoregressive (not diffusion) model, undercutting a common assumption that diffusion was strictly more efficient for high-quality image synthesis ^[1]^[2]. Meta also highlighted that the autoregressive token formulation retains the controllability and inference flexibility of language models ^[1].

Capabilities

Beyond benchmark image generation, the fine-tuned CM3leon was shown performing a range of tasks from a single model ^[1]^[3]:

Text-to-image generation: producing images from natural-language prompts, including prompts that require following several attributes at once.
Text-guided image editing: modifying an existing image according to an instruction (for example changing the color of an object or the time of day in a scene).
Structure-guided and image-to-image generation: generating an image conditioned on structural inputs such as segmentation maps or object outlines, and related controllability tasks the authors describe as approaching segmentation-to-image and object-to-image.
Image captioning and visual question answering: generating text descriptions of images and answering questions about their content, using the same decoder running image-to-text.
Super-resolution: Meta described a separate super-resolution stage that takes outputs to higher resolution for the demonstrations in its blog post ^[1].

Meta released the research paper and example outputs but did not publicly release the CM3leon model weights, positioning the work as a demonstration of the autoregressive multimodal recipe rather than a shipped product ^[1].

Relationship to Chameleon

CM3leon (2023) and Chameleon (2024) are often conflated because of the shared pronunciation and the common premise of one transformer over interleaved image and text tokens. They differ in important ways. CM3leon is built on the CM3 causally-masked objective and relies on retrieval-augmented pretraining plus supervised fine-tuning, with the retriever supplying external context during training ^[2]. Chameleon, described in "Chameleon: Mixed-Modal Early-Fusion Foundation Models" (May 2024), is an early-fusion model trained from scratch end-to-end on an interleaved mixture of images, text, and code, without the retrieval component, and is oriented toward general mixed-modal reasoning and interleaved generation rather than primarily text-to-image quality ^[4]. In short, CM3leon was a focused autoregressive image-generation result that demonstrated the LLM training recipe transfers to images, and Chameleon generalized the token-based, early-fusion idea into a broader foundation model.

References

Meta AI. "Introducing CM3leon, a more efficient, state-of-the-art generative model for text and images." 14 July 2023. https://ai.meta.com/blog/generative-ai-text-images-cm3leon/ ↩
Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan. "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning." arXiv:2309.02591, 5 September 2023. https://arxiv.org/abs/2309.02591 ↩
Maximilian Schreiner. "Meta's new state-of-the-art, versatile image model is trained solely on licensed data." The Decoder, 14 July 2023. https://the-decoder.com/metas-new-state-of-the-art-versatile-image-model-is-trained-solely-on-licensed-data/ ↩
Chameleon Team (FAIR at Meta). "Chameleon: Mixed-Modal Early-Fusion Foundation Models." arXiv:2405.09818, 16 May 2024. https://arxiv.org/abs/2405.09818 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Llama 3.2