Chameleon (Meta AI)
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,538 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,538 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chameleon is a family of mixed-modal foundation models released by Meta AI's Fundamental AI Research (FAIR) group in 2024. It was introduced in the paper "Chameleon: Mixed-Modal Early-Fusion Foundation Models," credited to the "Chameleon Team," posted to arXiv on 16 May 2024 [1]. The models treat images and text as sequences of discrete tokens and process them with a single transformer trained from scratch over the combined token stream. This design, which the authors call early fusion, lets one network read and write documents that interleave text and images in arbitrary order. Meta released two sizes, with 7 billion and 34 billion parameters, under a research license in June 2024, but withheld the image-generation capability that the research model possessed [2][3].
This article concerns the Meta FAIR model. Several unrelated products share the name "Chameleon," including database and software tools; they are not connected to the work described here.
Most multimodal systems of the early 2020s used what the Chameleon authors describe as late fusion. In that approach a separate, often frozen, image encoder produces embeddings that are projected into a pretrained large language model, which keeps text and vision largely in distinct pathways and merges them only partway through the network. Flamingo and IDEFICS are examples of this pattern.
Chameleon instead converts images into discrete tokens drawn from a fixed vocabulary, the same kind of object as a text token, and concatenates them with text tokens into one sequence. A single transformer is then trained autoregressively to predict the next token regardless of its modality. Because there is no modality-specific encoder bolted onto a text backbone, and no late merge step, the authors call the architecture early fusion and token-based. The approach builds on two earlier Meta efforts, CM3 (2022) and CM3Leon (2023), which applied autoregressive token modeling to mixed-modal documents and to text-to-image generation respectively [1].
Chameleon's transformer follows the LLaMA-2 recipe (RMSNorm, the SwiGLU activation, and rotary position embeddings) as a starting point, with a 4,096-token context window [1]. Training a single model over both modalities turned out to be unstable at scale, so the team made two main architectural changes. First, they applied query-key normalization (QK-Norm), normalizing the query and key vectors before the attention dot product. Second, for the 34B model they reordered the layer normalizations to sit after the attention and feed-forward blocks, following the scheme used in the Swin Transformer. They also added a z-loss term (a small penalty of the form 10^-5 times the squared log of the softmax partition function) to keep the output logits well behaved. The 7B model used dropout of 0.1, while the 34B model used no dropout but relied on grouped-query attention and the norm reordering for stability [1].
Images are turned into tokens by an image tokenizer based on the approach of Gafni et al. (2022), the Make-A-Scene work. It encodes a 512 by 512 image into 1,024 discrete tokens from a learned codebook of 8,192 entries, and it was trained only on licensed images [1]. Text and image tokens share one vocabulary built with the SentencePiece BPE library; the combined vocabulary contains 65,536 tokens, of which 8,192 are the image codebook entries [1].
| Property | Chameleon-7B | Chameleon-34B |
|---|---|---|
| Parameters | 7 billion | 34 billion |
| Context length | 4,096 tokens | 4,096 tokens |
| Grouped-query attention | No | Yes |
| Dropout | 0.1 | 0.0 |
| QK-Norm | Yes | Yes |
| Training tokens | 4.4 trillion | 4.4 trillion |
| Epochs over data | 2.1 | 2.1 |
| Peak learning rate | 1.0 x 10^-4 | 1.0 x 10^-4 |
Sources: paper Table 1 and the training section [1].
Pretraining used a large corpus of mixed-modal data. The paper describes roughly 2.9 trillion text-only tokens, about 1.4 billion text-image pairs (tokenized into roughly 1.5 trillion tokens), and around 400 billion tokens of naturally interleaved text and image data [1]. Both released sizes were trained on 4.4 trillion tokens, corresponding to about 2.1 passes over the dataset. Training proceeded in two stages, the second of which upweighted higher-quality and instruction-style data. After pretraining, the models went through a supervised fine-tuning stage on instruction data, and the released checkpoints were additionally safety tuned [1][2].
In its full research form Chameleon can take any mix of text and images as input and produce any mix of text and images as output, including long-form documents that alternate between the two. On text-only tasks it was reported to outperform LLaMA-2 and to be competitive with Mixtral 8x7B and Gemini Pro; on image understanding it reached strong results in image captioning and visual question answering [1].
| Benchmark | Chameleon-34B | Comparison point |
|---|---|---|
| MMLU (5-shot) | 65.8 | Mixtral 8x7B: 70.6 |
| GSM8k (maj@32) | 77.0 | Mixtral 8x7B: 75.1 |
| COCO captioning (CIDEr, fine-tuned) | 140.8 | Flamingo-80B (fine-tuned): 138.1 |
| Flickr30k captioning (CIDEr, fine-tuned) | 82.3 | Flamingo-80B (fine-tuned): ~78.4 |
| VQAv2 (multitask) | 69.6 | Gemini Pro: 71.2 |
Numbers from the paper's evaluation tables [1].
The most novel evaluation concerned long-form mixed-modal generation, where a single prompt expects an interleaved answer of text and pictures. The team ran human preference studies comparing Chameleon-34B against Gemini Pro and against GPT-4o's predecessor GPT-4V. Because those systems did not generate images directly at the time, their outputs were augmented with separately generated images for the comparison. In pairwise judgments Chameleon-34B was preferred 60.4 percent of the time over the augmented Gemini Pro (435 wins, 362 ties, 251 losses) and 51.6 percent of the time over the augmented GPT-4V (375 wins, 331 ties, 342 losses) [1]. In an absolute task-fulfillment rating, Chameleon completely fulfilled 55.2 percent of tasks, against 37.6 percent for the Gemini setup and 44.7 percent for the GPT-4V setup [1]. These results came with the usual caveats about prompt selection and the artificial image augmentation used for the baselines.
Meta announced the public release of Chameleon on 18 June 2024 as part of a batch of FAIR research artifacts [2]. The released items were the 7B and 34B model weights, fast GPU inference code, the tokenizers, and evaluation prompts, distributed under the Chameleon Research License after an access request [3].
The key restriction was on image output. Although the research model could generate images, Meta did not ship that capability. The FAIR blog stated plainly: "At this time, we are not releasing the Chameleon image generation model," and described the released models as supporting "mixed-modal inputs and text-only output to be used for research purposes" [2]. In other words, the public checkpoints accept interleaved text and images as input but emit only text. Meta framed this as a safety-motivated decision, noting that the released models were safety tuned and that risks remained despite mitigation. On Hugging Face the checkpoints are published as facebook/chameleon-7b and facebook/chameleon-30b; the larger one is the 34B-class model from the paper, listed under the rounded "30b" name and tagged for image-text-to-text use [4][5]. The GitHub repository was later archived and made read-only on 1 September 2025 [3].
Chameleon drew attention as one of the more committed attempts to unify modalities inside a single token-based transformer rather than stitching a vision module onto a language model. Commentators noted both the elegance of the early-fusion formulation and the practical limits of the public release, since the withheld image generation was the part that distinguished it most from existing open vision-language models.
Within Meta's own research, Chameleon sits in a clear line. Its discrete-token approach descended from CM3 and CM3Leon, and it was soon followed by Transfusion (Zhou et al., arXiv, August 2024), which kept a single transformer backbone but generated images with a diffusion objective on continuous image patches instead of predicting discrete image tokens. The Transfusion paper used Chameleon as its main point of comparison and reported better text-to-image results at a fraction of the compute, suggesting that discrete image tokens were a limiting factor for generation quality [6]. The early-fusion idea, jointly embedding text and vision tokens into one backbone from the start of pretraining, later reappeared in Meta's Llama 4 models (April 2025), which Meta described as natively multimodal with text-vision early fusion, though Llama 4 used a continuous vision encoder based on MetaCLIP rather than Chameleon's discrete tokenizer [7].