Llama 3.2 Vision
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,357 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,357 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama 3.2 Vision is the multimodal branch of Meta's Llama 3.2 model family, released on September 25, 2024 at the Meta Connect 2024 developer conference. The release introduced two vision-capable models, Llama 3.2 11B Vision and Llama 3.2 90B Vision, alongside lightweight text-only 1B and 3B variants.[^1][^2] Llama 3.2 Vision is Meta's first openly distributed multimodal Llama series: the architecture extends a frozen Llama 3.1 text decoder with a separately trained vision encoder and image adapter consisting of cross-attention layers, rather than the simpler projection used by models such as LLaVA.[^3][^4] The models accept text and a single image as input and emit text only, and they support a 128,000-token context window inherited from the underlying Llama 3.1 backbones.[^4][^5] At launch the models were distributed under the Llama 3.2 Community License, which initially excluded individuals domiciled in the European Union and EU-based companies from the multimodal weights while keeping the text-only 1B and 3B variants available worldwide.[^6][^7]
| Item | Value |
|---|---|
| Developer | Meta AI |
| Initial release | September 25, 2024[^1] |
| Sizes | 11B (10.6B actual) and 90B (88.8B actual) parameters[^4][^5] |
| Variants | Base (pre-trained) and Instruct (instruction-tuned) for each size[^4][^5] |
| Vision encoder | ViT-H/14 with 8 added gated self-attention layers, approximately 850M parameters[^3][^8] |
| Adapter | Cross-attention layers inserted after every fourth self-attention block of the text decoder[^3][^4] |
| Text backbone | Llama 3.1 8B (for 11B Vision) and Llama 3.1 70B (for 90B Vision), frozen during vision training[^3][^9] |
| Pre-training data | 6 billion image-text pairs[^4][^5] |
| Knowledge cutoff | December 2023[^4][^5] |
| Context length | 128,000 tokens[^1][^4] |
| License | Llama 3.2 Community License, with EU restriction on multimodal weights[^6][^7] |
| Release venue | Meta Connect 2024 keynote, Menlo Park, California[^1][^10] |
Meta released the Llama 3.1 series, including a 405B-parameter dense model, on July 23, 2024. The accompanying technical report, "The Llama 3 Herd of Models," disclosed that Meta had been experimenting with multimodal extensions, integrating image, video, and speech via a compositional approach in which a frozen language model is paired with separately trained encoders and adapters. The paper noted that these multimodal models were "still under active development and not yet ready for release."[^8][^9] Llama 3.2 Vision is the productized form of the image-text branch of that effort.[^3][^4]
Meta unveiled Llama 3.2 on September 25, 2024 during the Meta Connect 2024 keynote in Menlo Park, California, where CEO Mark Zuckerberg framed the release as "the first open-source multimodal" Llama and emphasized that the same release also covered lightweight 1B and 3B text-only models tuned for on-device deployment.[^1][^10][^11] The blog post accompanying the keynote bundled the four models, the Llama Stack reference distribution, and a vision-capable safety classifier called Llama Guard 3 11B Vision into a single launch.[^1]
The 11B and 90B vision weights were posted to llama.com and to Hugging Face on launch day under repositories such as meta-llama/Llama-3.2-11B-Vision and meta-llama/Llama-3.2-11B-Vision-Instruct.[^4][^5] On the same day, Amazon Web Services made both vision models available on Amazon Bedrock in the US West (Oregon) region with cross-region inference into US East (Ohio) and US East (N. Virginia).[^12] Microsoft listed Llama-3.2-11B-Vision-Instruct and Llama-3.2-90B-Vision-Instruct in the Azure AI model catalog, deployable as managed compute or via a serverless API, and Google added all four Llama 3.2 models to Vertex AI Model Garden for self-service deployment.[^13][^14] IBM watsonx.ai added both vision models the following month.[^15]
Llama 3.2 Vision was succeeded on the text side by Llama 3.3 70B in December 2024, which did not include a multimodal variant, and then by Llama 4 Scout and Maverick in April 2025, which adopted a natively multimodal mixture-of-experts architecture rather than the cross-attention adapter design used in Llama 3.2 Vision.[^16]
The Llama 3.2 Vision image encoder follows the design described in the Llama 3 herd paper. It begins from a Vision Transformer of size ViT-H/14 with roughly 630M parameters, pre-trained contrastively on 2.5 billion image-text pairs at 224 by 224 resolution for five epochs.[^8][^17] The 14 by 14 patch projection uses a Conv2d layer with kernel size 14 and stride 14, mapping each non-overlapping image patch to a 1,280-dimensional embedding before the ViT transformer blocks.[^17] To recover spatial detail that contrastive pre-training does not preserve well, features from the 4th, 8th, 16th, 24th, and 31st transformer blocks are exposed in addition to the final layer.[^8][^17] Meta then prepends 8 additional gated self-attention layers, bringing the encoder to 40 transformer blocks and approximately 850M parameters; the attention and feed-forward sub-layers in these added blocks use tanh gating so the encoder can warm-start from the contrastive checkpoint without destabilizing the pretrained representations.[^8][^17]
Rather than projecting image features into the text token stream as a prefix (the LLaVA-style strategy used by many open vision-language models), Llama 3.2 Vision injects image information into the text decoder through dedicated cross-attention layers. New cross-attention blocks are inserted after every fourth self-attention block of the text decoder; the keys and values for these layers come from the vision encoder output, while the queries come from the text token stream.[^3][^8] The cross-attention layers also use grouped-query attention (GQA) for memory efficiency, matching the GQA configuration of the text backbones.[^4][^8]
This design has two important consequences. First, the parameter overhead of the adapter is bounded: the 11B and 90B headline parameter counts (with actual values of 10.6B and 88.8B respectively) account for the original Llama 3.1 8B and 70B backbones plus the encoder, the cross-attention layers, and the projection that connects them.[^4][^5] Second, because the text decoder weights are frozen during vision training, a Llama 3.2 11B Vision model is reported by Meta to be a drop-in replacement for Llama 3.1 8B on text-only inputs, and 90B Vision serves the same role for Llama 3.1 70B.[^1][^3]
Llama 3.2 Vision-Instruct uses 560-pixel tiles, while the 11B Vision base checkpoint uses 448-pixel tiles; large images are partitioned into tiles that are individually encoded and then concatenated.[^18] Because the cross-attention layers are organized around the most recent image in a conversation, the model attends to the last image when multiple images are provided in a multi-turn exchange; using two images simultaneously requires user-side concatenation.[^18]
Adapter pre-training, the first heavy stage, uses approximately 6 billion image-text pairs.[^4][^5] During this phase, Meta updates the image encoder parameters and the new cross-attention layers but explicitly freezes the language-model parameters: the Llama 3 paper states "we intentionally do not update the language-model parameters" during adapter training, which is the design choice that preserves text-only performance.[^8][^9] Pre-training proceeds in two sub-stages, a large-scale noisy stage followed by an annealing stage on smaller, higher-quality in-domain data.[^1][^4]
After pre-training, Llama 3.2 Vision is post-trained with multiple rounds of supervised fine-tuning (SFT), rejection sampling, and direct preference optimization (DPO), the same pipeline used for the Llama 3.1 text models.[^1][^4] The instruction-tuning data includes publicly available vision instruction datasets and more than three million synthetically generated examples; the synthetic data is filtered and reranked by Llama 3.1 acting as a judge and by a reward model trained for the vision setting.[^4][^5][^19] The combination of supervised fine-tuning, rejection sampling, and Direct Preference Optimization is summarized in the launch blog as "several rounds of alignment" and is what distinguishes the Instruct checkpoints from the base weights.[^1][^19]
Meta's model card discloses training compute in H100-80GB GPU hours. The 11B Vision model used 147,000 hours for pre-training stage 1, 98,000 hours for stage 2 annealing, 896 hours for SFT, and 224 hours for RLHF and DPO, for a total of about 246,000 GPU hours.[^4] The 90B Vision model used 885,000 hours for each of the two pre-training stages, 3,072 hours for SFT, and 2,048 hours for RLHF and DPO, totaling approximately 1.78 million GPU hours.[^5] Across both vision models the aggregate training was 2.02 million H100 GPU hours; Meta reports 584 tons of CO2-equivalent emissions on a location-based basis and 0 tons on a market-based basis after applying its renewable energy purchases.[^4][^5]
The model cards published by Meta on Hugging Face provide benchmark numbers for both base and instruction-tuned checkpoints. The instruction-tuned 90B Vision model is positioned by Meta as competitive with Claude 3 Haiku and GPT-4o-mini, while the 11B Vision model is meant to fit in the same role as those models at a smaller footprint.[^1][^4][^5]
| Benchmark | 11B Base | 11B Instruct | 90B Base | 90B Instruct |
|---|---|---|---|---|
| MMMU (val, CoT) | 41.7 | 50.7 | 49.3 | 60.3 |
| ChartQA (test, CoT, relaxed) | n/a | 83.4 | 54.2 (test) | 85.5 |
| DocVQA (test, ANLS) | 62.3 (val) | 88.4 | 70.7 (val) | 90.1 |
| AI2 Diagram (test) | 62.4 | 91.1 | 75.3 | 92.3 |
| MathVista (testmini) | n/a | 51.5 | n/a | 57.3 |
| VQAv2 (test) | 66.8 (val) | 75.2 | 73.6 (val) | 78.1 |
The 90B Instruct configuration also reports 86.0 macro-average accuracy on MMLU with chain-of-thought, reflecting that the frozen Llama 3.1 70B text backbone is preserved.[^5] Meta describes the vision models as "drop-in replacements for their corresponding text model equivalents" with comparable text-only quality and superior performance on image understanding tasks compared with closed competitors of similar size.[^1]
The 11B Vision base model is a pre-trained checkpoint suitable for fine-tuning; the 11B Vision-Instruct model is the post-trained chat checkpoint that ships in the consumer Meta AI experience and on cloud APIs. The Hugging Face repository meta-llama/Llama-3.2-11B-Vision-Instruct is the canonical source.[^4] On a single H100 the 11B Vision-Instruct model fits in roughly 20 GB of GPU memory in bf16; 4-bit quantization brings it under 10 GB, which is the configuration most third-party demos use.[^18]
The 90B variants extend the same recipe to Llama 3.1 70B as the text backbone. Hugging Face hosts the base and Instruct checkpoints at meta-llama/Llama-3.2-90B-Vision and meta-llama/Llama-3.2-90B-Vision-Instruct.[^5] The 90B Instruct model is the highest-quality member of the Llama 3.2 family and is the default option on Bedrock, Vertex AI, and Azure for users who want maximum image-understanding accuracy from an open-weight Meta model.[^12][^13][^14]
Released in parallel with the vision models were Llama 3.2 1B and Llama 3.2 3B, dense text-only models distilled from Llama 3.1 8B for edge AI and mobile inference. These were enabled on Qualcomm and MediaTek hardware on launch day and optimized for Arm processors; both support the 128k context length, though quantized variants for mobile have a reduced 8k window.[^1][^11]
Llama Guard 3 11B Vision is a fine-tuned moderation model built on the same vision architecture and released the same day, intended to filter multimodal input and output for safety. It is governed by the same Llama 3.2 Community License as the underlying vision base.[^1][^20]
Llama 3.2 is governed by the Llama 3.2 Community License Agreement, a custom commercial license rather than a recognized open-source license. The headline restriction is the 700-million-monthly-active-user threshold inherited from Llama 2, above which a separate Meta license is required.[^21] The Llama 3.2 release also added a use-policy clause specific to multimodal weights.[^6][^21]
The license states that for multimodal models included in Llama 3.2, the rights granted are not granted to individuals domiciled in the European Union or to companies whose principal place of business is in the EU. The carve-out is limited to multimodal weights; the text-only Llama 3.2 1B and 3B remained licensable in the EU under the same agreement, and earlier text-only releases such as Llama 3 and Llama 3.1 are not affected.[^6][^7][^22] End users of products and services that incorporate the vision models are exempt: a European consumer using a downstream product built on Llama 3.2 Vision is not restricted, only the development entity must be outside the EU.[^7][^22] Industry coverage attributed the carve-out to Meta's broader dispute with European data-protection regulators over training Meta AI on EU users' Facebook and Instagram content, which had been paused in June 2024 at the request of the Irish Data Protection Commission.[^22][^23]
Weights for all four Llama 3.2 sizes were posted to llama.com and to Hugging Face on launch day. Through the Meta AI consumer app and meta.ai web experience, Meta integrated Llama 3.2 Vision into image-based conversations in supported regions; in Europe the vision functionality is delivered through Meta-operated products rather than via developer access to the weights.[^1][^7]
The Hugging Face Hub repositories meta-llama/Llama-3.2-11B-Vision, meta-llama/Llama-3.2-11B-Vision-Instruct, meta-llama/Llama-3.2-90B-Vision, and meta-llama/Llama-3.2-90B-Vision-Instruct are gated under the Community License. Inference requires the MllamaForConditionalGeneration class added in transformers 4.45.0; the modeling code uses a dedicated mllama architecture name to capture the cross-attention adapter design.[^4][^5][^18]
Amazon Bedrock hosts both vision models with cross-region inference into US East. Microsoft Azure AI Foundry (formerly Azure AI Studio) lists both Instruct models in its catalog with managed compute deployment. Google Vertex AI Model Garden includes all four Llama 3.2 sizes; IBM watsonx.ai added both vision models in October 2024. Oracle Cloud Infrastructure Generative AI offers both vision models across multiple regions.[^12][^13][^14][^15]
The Llama Stack reference distribution and on-device runtimes such as PyTorch ExecuTorch target the 1B and 3B text-only models for mobile, while the 11B and 90B Vision models run on standard GPU inference stacks including llama.cpp (with experimental multimodal support), Ollama, and inference-as-a-service providers such as Together AI, Fireworks, and Groq listed in Meta's launch blog.[^1][^24]
Llama 3.2 Vision was the first time Meta released openly distributed multimodal Llama weights. Within the open-weight ecosystem the launch was significant for three reasons. First, the cross-attention adapter design demonstrated a path to multimodality that does not disturb the underlying text model, allowing Meta to ship vision capabilities without retraining the Llama 3.1 text backbones from scratch.[^3][^8] Second, the 90B Instruct model put open weights into the same benchmark range as closed mid-tier vision models like Claude 3 Haiku and GPT-4o-mini on MMMU, DocVQA, AI2D, and ChartQA, which had previously been a closed-model regime.[^1][^4][^5] Third, the release coincided with the public availability of competing open multimodal models including Qwen2-VL and Pixtral 12B from Mistral, marking the late-2024 window in which open-weight vision-language modeling closed much of the gap to proprietary systems.[^25]
Several limitations were documented at launch or surfaced in early reporting. Image inputs are restricted to English even though the text-only Llama 3.2 family officially supports eight languages, which constrains multilingual document understanding.[^4][^5] The cross-attention layers attend to the most recent image, so prompts containing multiple images in different turns lose access to earlier visuals unless concatenated on the client side.[^18] Outputs are text only: Llama 3.2 Vision cannot generate images or speech and cannot accept video or audio inputs, capabilities Meta described in the Llama 3 herd paper but did not productize in this release.[^1][^8] The EU restriction created a fragmented developer landscape in which European startups cannot directly distribute products based on Llama 3.2 Vision weights, a fact criticized by both European and US commentators at the time.[^22][^23] Finally, the 90B Instruct model still trails proprietary frontier systems such as GPT-4o on MMMU and complex multi-step visual reasoning, as noted by independent secondary coverage.[^25]
| Model | Release | Parameters | Strategy | Weights | Notes |
|---|---|---|---|---|---|
| Llama 3.2 11B Vision-Instruct | Sep 25, 2024[^1] | 10.6B[^4] | Frozen Llama 3.1 8B plus ViT-H/14 plus cross-attention adapter[^3][^8] | Open under Community License (no EU)[^6][^7] | First open Meta multimodal[^1] |
| Llama 3.2 90B Vision-Instruct | Sep 25, 2024[^1] | 88.8B[^5] | Frozen Llama 3.1 70B plus ViT-H/14 plus cross-attention adapter[^3][^8] | Open under Community License (no EU)[^6][^7] | Highest-quality member of family[^5] |
| LLaVA 1.5 | Oct 2023[^26] | 7B and 13B[^26] | MLP projection of CLIP features into Vicuna decoder[^26] | Apache 2.0[^26] | Influential prefix-projection baseline[^26] |
| Pixtral 12B | Sep 2024[^27] | 12B[^27] | Native multimodal Mistral architecture with image tokens[^27] | Apache 2.0[^27] | Released same month as Llama 3.2 Vision[^27] |
| Qwen2.5-VL | Jan 2025[^28] | 3B/7B/72B[^28] | Naive dynamic resolution ViT plus decoder-only LLM[^28] | Apache 2.0 (most sizes)[^28] | Successor to Qwen2-VL (Aug 2024) that competed with Llama 3.2 Vision[^28] |