# Chameleon (Meta AI)

> Source: https://aiwiki.ai/wiki/chameleon
> Updated: 2026-06-27
> Categories: AI Models, Meta AI, Multimodal AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Chameleon** is a family of early-fusion, token-based mixed-modal foundation models from [Meta](/wiki/meta) AI's Fundamental AI Research (FAIR) group that represents both images and text as discrete tokens in a single unified [transformer](/wiki/transformer), so one network can reason over and generate interleaved text and images in any order [1]. Introduced in the paper "Chameleon: Mixed-Modal Early-Fusion Foundation Models" (arXiv, 16 May 2024) and released publicly on 18 June 2024, it shipped in 7-billion-parameter and 34-billion-parameter sizes under a research license, though Meta withheld the model's image-generation capability so the public checkpoints accept mixed-modal input but emit text only [1][2][3]. The paper's central claim is direct: "We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence." [1]

This article concerns the Meta FAIR model. Several unrelated products share the name "Chameleon," including database and software tools; they are not connected to the work described here.

## What is Chameleon?

Chameleon is a [multimodal](/wiki/multimodal) [large language model](/wiki/large_language_model) that does away with the separate vision encoder used by most multimodal systems. Instead of bolting an image model onto a text backbone, it converts images into sequences of discrete tokens, the same kind of object as a text token, and concatenates them with text tokens into one stream that a single [transformer](/wiki/transformer) predicts autoregressively, token by token, regardless of modality [1]. The authors call this design early fusion and token-based, and they summarize the core idea plainly: "By quantizing images into discrete tokens, analogous to words in text, we can apply the same transformer architecture to sequences of both image and text tokens." [1] Because text and images live in the same token space, one network can read and write documents that interleave the two in arbitrary order. Meta released two sizes, Chameleon-7B and Chameleon-34B, in June 2024 under a research license [2][3].

## What is early fusion, and how does it differ from late fusion?

Most [multimodal](/wiki/multimodal) systems of the early 2020s used what the Chameleon authors describe as late fusion. In that approach a separate, often frozen, image encoder produces embeddings that are projected into a pretrained [large language model](/wiki/large_language_model), which keeps text and vision largely in distinct pathways and merges them only partway through the network. Flamingo and IDEFICS are examples of this pattern.

Chameleon instead converts images into discrete tokens drawn from a fixed vocabulary, the same kind of object as a text token, and concatenates them with text tokens into one sequence. A single transformer is then trained autoregressively to predict the next token regardless of its modality. Because there is no modality-specific encoder bolted onto a text backbone, and no late merge step, the authors call the architecture early fusion and token-based [1]. The approach builds on two earlier Meta efforts, CM3 (2022) and CM3Leon (2023), which applied autoregressive token modeling to mixed-modal documents and to text-to-image generation respectively [1].

## How is Chameleon built, and how does it tokenize images?

Chameleon's transformer follows the [LLaMA](/wiki/llama)-2 recipe (RMSNorm, the SwiGLU activation, and rotary position embeddings) as a starting point, with a 4,096-token context window [1]. Training a single model over both modalities turned out to be unstable at scale, so the team made two main architectural changes. First, they applied query-key normalization (QK-Norm), normalizing the query and key vectors before the attention dot product. Second, for the 34B model they reordered the layer normalizations to sit after the attention and feed-forward blocks, following the scheme used in the Swin Transformer. They also added a z-loss term (a small penalty of the form 10^-5 times the squared log of the softmax partition function) to keep the output logits well behaved. The 7B model used dropout of 0.1, while the 34B model used no dropout but relied on grouped-query attention and the norm reordering for stability [1].

Images are turned into tokens by an image tokenizer based on the approach of Gafni et al. (2022), the Make-A-Scene work. It encodes a 512 by 512 image into 1,024 discrete tokens from a learned codebook of 8,192 entries, and it was trained only on licensed images [1]. The paper flags a known limit of this scheme: "A core weakness of our tokenizer is in reconstructing images with a large amount of text, therefore upper bounding the capability of our models, when it comes to heavy OCR-related tasks." [1] Text and image tokens share one vocabulary built with the SentencePiece BPE library; the combined vocabulary contains 65,536 tokens, of which 8,192 are the image codebook entries [1].

| Property | Chameleon-7B | Chameleon-34B |
|---|---|---|
| Parameters | 7 billion | 34 billion |
| Context length | 4,096 tokens | 4,096 tokens |
| Grouped-query attention | No | Yes |
| Dropout | 0.1 | 0.0 |
| QK-Norm | Yes | Yes |
| Training tokens | 4.4 trillion | 4.4 trillion |
| Epochs over data | 2.1 | 2.1 |
| Peak learning rate | 1.0 x 10^-4 | 1.0 x 10^-4 |

Sources: paper Table 1 and the training section [1].

## What data was Chameleon trained on?

Pretraining used a large corpus of mixed-modal data. The paper describes roughly 2.9 trillion text-only tokens, about 1.4 billion text-image pairs (tokenized into roughly 1.5 trillion tokens), and around 400 billion tokens of naturally interleaved text and image data [1]. Both released sizes were trained on 4.4 trillion tokens, corresponding to about 2.1 passes over the dataset. Training proceeded in two stages, the second of which upweighted higher-quality and instruction-style data. After pretraining, the models went through a supervised fine-tuning stage on instruction data, and the released checkpoints were additionally safety tuned [1][2].

## What can Chameleon do, and how does it score on benchmarks?

In its full research form Chameleon can take any mix of text and images as input and produce any mix of text and images as output, including long-form documents that alternate between the two. On text-only tasks it was reported to outperform LLaMA-2 and to be competitive with Mixtral 8x7B and Gemini Pro; on image understanding it reached strong results in image captioning and visual question answering [1].

| Benchmark | Chameleon-34B | Comparison point |
|---|---|---|
| MMLU (5-shot) | 65.8 | Mixtral 8x7B: 70.6 |
| GSM8k (maj@32) | 77.0 | Mixtral 8x7B: 75.1 |
| COCO captioning (CIDEr, fine-tuned) | 140.8 | Flamingo-80B (fine-tuned): 138.1 |
| Flickr30k captioning (CIDEr, fine-tuned) | 82.3 | Flamingo-80B (fine-tuned): ~78.4 |
| VQAv2 (multitask) | 69.6 | Gemini Pro: 71.2 |

Numbers from the paper's evaluation tables [1].

The most novel evaluation concerned long-form mixed-modal generation, where a single prompt expects an interleaved answer of text and pictures. The team ran human preference studies comparing Chameleon-34B against Gemini Pro and against [GPT-4o](/wiki/gpt_4o)'s predecessor GPT-4V. Because those systems did not generate images directly at the time, their outputs were augmented with separately generated images for the comparison. In pairwise judgments Chameleon-34B was preferred 60.4 percent of the time over the augmented Gemini Pro (435 wins, 362 ties, 251 losses) and 51.6 percent of the time over the augmented GPT-4V (375 wins, 331 ties, 342 losses) [1]. In an absolute task-fulfillment rating, Chameleon completely fulfilled 55.2 percent of tasks, against 37.6 percent for the Gemini setup and 44.7 percent for the GPT-4V setup [1]. These results came with the usual caveats about prompt selection and the artificial image augmentation used for the baselines.

## What was released, and what did Meta withhold?

Meta announced the public release of Chameleon on 18 June 2024 as part of a batch of FAIR research artifacts [2]. The released items were the 7B and 34B model weights, fast GPU inference code, the tokenizers, and evaluation prompts, distributed under the Chameleon Research License after an access request [3].

The key restriction was on image output. Although the research model could generate images, Meta did not ship that capability. The FAIR blog stated plainly: "At this time, we are not releasing the Chameleon image generation model," and described the released models as supporting "mixed-modal inputs and text-only output to be used for research purposes" [2]. In other words, the public checkpoints accept interleaved text and images as input but emit only text. Meta framed this as a safety-motivated decision, noting that the released models were safety tuned and that risks remained despite mitigation. On Hugging Face the checkpoints are published as `facebook/chameleon-7b` and `facebook/chameleon-30b`; the larger one is the 34B-class model from the paper, listed under the rounded "30b" name and tagged for image-text-to-text use [4][5]. The GitHub repository was later archived and made read-only on 1 September 2025 [3].

## How does Chameleon compare to other multimodal models, and what came after it?

Chameleon drew attention as one of the more committed attempts to unify modalities inside a single token-based transformer rather than stitching a vision module onto a language model. Commentators noted both the elegance of the early-fusion formulation and the practical limits of the public release, since the withheld image generation was the part that distinguished it most from existing open vision-language models.

Within Meta's own research, Chameleon sits in a clear line. Its discrete-token approach descended from CM3 and CM3Leon, and it was soon followed by Transfusion (Zhou et al., arXiv, August 2024), which kept a single transformer backbone but generated images with a diffusion objective on continuous image patches instead of predicting discrete image tokens. The Transfusion paper used Chameleon as its main point of comparison and reported better text-to-image results at a fraction of the compute, suggesting that discrete image tokens were a limiting factor for generation quality [6]. The early-fusion idea, jointly embedding text and vision tokens into one backbone from the start of pretraining, later reappeared in Meta's [Llama](/wiki/llama) 4 models (April 2025), which Meta described as natively multimodal with text-vision early fusion, though Llama 4 used a continuous vision encoder based on MetaCLIP rather than Chameleon's discrete tokenizer [7].

## References

1. Chameleon Team (FAIR at Meta). "Chameleon: Mixed-Modal Early-Fusion Foundation Models." arXiv:2405.09818, 16 May 2024. https://arxiv.org/abs/2405.09818 (HTML: https://arxiv.org/html/2405.09818v1)
2. Meta AI. "Sharing new research, models, and datasets from Meta FAIR." Meta AI Blog, 18 June 2024. https://ai.meta.com/blog/meta-fair-research-new-releases/
3. facebookresearch/chameleon. "Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR." GitHub. https://github.com/facebookresearch/chameleon
4. Hugging Face. "facebook/chameleon-7b" model card. https://huggingface.co/facebook/chameleon-7b
5. Hugging Face. "facebook/chameleon-30b" model card. https://huggingface.co/facebook/chameleon-30b
6. Zhou, C., et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model." arXiv:2408.11039, August 2024. https://arxiv.org/abs/2408.11039
7. Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, 5 April 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

