Llama 3.2 Vision

Large Language Models Meta AI Multimodal AI

21 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

28 citations

Revision

v3 · 4,167 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Llama 3.2 Vision is the set of multimodal (image-plus-text) models in Meta's Llama 3.2 family, released on September 25, 2024 at the Meta Connect 2024 developer conference. It comprises two vision-capable vision-language models, Llama 3.2 11B Vision and Llama 3.2 90B Vision, that accept text and a single image as input and return text output, and it shipped alongside lightweight text-only 1B and 3B variants.^[1]^[2] The vision models add a separately trained image encoder and a cross-attention adapter on top of the existing Llama 3.1 text backbones (8B for the 11B Vision model and 70B for the 90B Vision model), enabling image reasoning, document, chart, and diagram understanding, image captioning, and visual grounding. Crucially, the language-model weights are frozen during vision training, so the resulting models behave as drop-in replacements for the corresponding text-only models.^[1]^[3]^[4] Llama 3.2 Vision is Meta's first openly distributed multimodal Llama series, distributed under the Llama 3.2 Community License, which initially excluded individuals domiciled in the European Union and EU-based companies from the multimodal weights while keeping the text-only 1B and 3B variants available worldwide.^[4]^[6]^[7]

Infobox

Item	Value
Developer	Meta AI
Initial release	September 25, 2024^[1]
Type	Multimodal vision-language model (text + image in, text out)^[4]
Sizes	11B (10.6B actual) and 90B (88.8B actual) parameters^[4]^[5]
Variants	Base (pre-trained) and Instruct (instruction-tuned) for each size^[4]^[5]
Vision encoder	ViT-H/14 with 8 added gated self-attention layers, approximately 850M parameters^[3]^[8]
Adapter	Cross-attention layers inserted after every fourth self-attention block of the text decoder^[3]^[4]
Text backbone	Llama 3.1 8B (for 11B Vision) and Llama 3.1 70B (for 90B Vision), frozen during vision training^[3]^[9]
Pre-training data	6 billion image-text pairs^[4]^[5]
Knowledge cutoff	December 2023^[4]^[5]
Context length	128,000 tokens^[1]^[4]
License	Llama 3.2 Community License, with EU restriction on multimodal weights^[6]^[7]
Release venue	Meta Connect 2024 keynote, Menlo Park, California^[1]^[10]

What is Llama 3.2 Vision?

Llama 3.2 Vision is the multimodal branch of Meta's Llama 3.2 model family: a pair of open-weight vision-language models, at 11B and 90B parameters, that read an image together with a text prompt and respond in text.^[1]^[4] Meta released it on September 25, 2024, describing the launch in its announcement as bringing "small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices."^[1] Each vision size ships in two checkpoints: a pre-trained Base model intended for fine-tuning and an instruction-tuned Instruct model used for chat and image question answering.^[4]^[5] The official model card defines the collection as "a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out)," and the headline 11B and 90B figures correspond to roughly 10.6B and 88.8B actual parameters once the vision encoder and adapter are counted alongside the Llama 3.1 text backbone.^[4]^[5] The models inherit a 128,000-token context window from the underlying Llama 3.1 backbones and were trained with a December 2023 knowledge cutoff.^[4]^[5] Llama 3.2 Vision is governed by the Llama 3.2 Community License, which the model card calls "a custom, commercial license agreement" rather than a recognized open-source license.^[4]^[6]

How does Llama 3.2 Vision work?

Llama 3.2 Vision is built on top of the existing Llama 3.1 text-only models: the 11B Vision model uses the Llama 3.1 8B backbone and the 90B Vision model uses the Llama 3.1 70B backbone.^[4]^[9] Rather than feeding image features into the text stream as a prefix (the LLaVA-style strategy used by many open vision-language models), Meta integrates vision through a separately trained adapter. The model card states that "Llama 3.2-Vision is built on top of Llama 3.1 text-only model" using "a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM."^[4] These cross-attention blocks are inserted after every fourth self-attention block of the text decoder, with keys and values coming from the vision encoder and queries from the text token stream; they also use grouped-query attention to match the text backbone.^[3]^[8] The image encoder itself is a Vision Transformer of size ViT-H/14 with 8 added gated self-attention layers, totaling roughly 850M parameters.^[3]^[8]

The defining design choice is that the text weights are frozen while the vision components are trained. The adapter was pre-trained on approximately 6 billion image-text pairs, during which Meta updated the image encoder and the new cross-attention layers but deliberately left the language model untouched. As the Llama 3 herd paper puts it, "we intentionally do not update the language-model parameters" during adapter training, a decision that preserves text-only quality and prevents catastrophic forgetting.^[4]^[8] Because of this, Meta reports in the launch blog that the vision models "are drop-in replacements for their corresponding text model equivalents, while exceeding on image understanding tasks compared to closed models, such as Claude 3 Haiku."^[1] After pre-training, the Instruct checkpoints undergo several rounds of supervised fine-tuning, rejection sampling, and Direct Preference Optimization, the same alignment pipeline used for the Llama 3.1 text models.^[1]^[4]

What can Llama 3.2 Vision do?

The Llama 3.2 Vision Instruct models are, in Meta's words, "optimized for visual recognition, image reasoning, captioning, and answering general questions about an image."^[4] The model card lists supported tasks including Visual Question Answering (VQA) and visual reasoning, Document Visual Question Answering (DocVQA), image captioning, image-text retrieval, and visual grounding, where the model can pinpoint objects in an image based on a natural-language description.^[4] Meta highlights "document-level understanding including charts and graphs," so the models can, for example, answer questions about sales figures plotted in a graph or extract information from maps and diagrams.^[1] Inputs are limited to text plus a single image; the models do not generate images, accept video or audio, and image inputs are officially supported only in English even though the text-only 1B and 3B models cover eight languages.^[4]^[5]

On standard multimodal benchmarks the 90B Instruct model reaches 60.3 on MMMU (validation, chain-of-thought), 90.1 on DocVQA, 85.5 on ChartQA, and 92.3 on AI2 Diagram, while preserving 86.0 macro-average accuracy on MMLU from the frozen Llama 3.1 70B text backbone; the 11B Instruct model scores 50.7 on MMMU, 88.4 on DocVQA, and 83.4 on ChartQA.^[4]^[5] Meta positions the 90B Vision model as competitive with closed mid-tier vision models such as Claude 3 Haiku and GPT-4o-mini.^[1]^[5] A fuller benchmark table appears in the Benchmarks section below.

History

Path from Llama 3.1 to a vision family

Meta released the Llama 3.1 series, including a 405B-parameter dense model, on July 23, 2024. The accompanying technical report, "The Llama 3 Herd of Models," disclosed that Meta had been experimenting with multimodal extensions, integrating image, video, and speech via a compositional approach in which a frozen language model is paired with separately trained encoders and adapters. The paper noted that these multimodal models were "still under active development and not yet ready for release."^[8]^[9] Llama 3.2 Vision is the productized form of the image-text branch of that effort.^[3]^[4]

Meta Connect 2024 launch

Meta unveiled Llama 3.2 on September 25, 2024 during the Meta Connect 2024 keynote in Menlo Park, California, where CEO Mark Zuckerberg framed the release as "the first open-source multimodal" Llama and emphasized that the same release also covered lightweight 1B and 3B text-only models tuned for on-device deployment.^[1]^[10]^[11] The blog post accompanying the keynote bundled the four models, the Llama Stack reference distribution, and a vision-capable safety classifier called Llama Guard 3 11B Vision into a single launch.^[1]

Distribution timeline

The 11B and 90B vision weights were posted to llama.com and to Hugging Face on launch day under repositories such as meta-llama/Llama-3.2-11B-Vision and meta-llama/Llama-3.2-11B-Vision-Instruct.^[4]^[5] On the same day, Amazon Web Services made both vision models available on Amazon Bedrock in the US West (Oregon) region with cross-region inference into US East (Ohio) and US East (N. Virginia).^[12] Microsoft listed Llama-3.2-11B-Vision-Instruct and Llama-3.2-90B-Vision-Instruct in the Azure AI model catalog, deployable as managed compute or via a serverless API, and Google added all four Llama 3.2 models to Vertex AI Model Garden for self-service deployment.^[13]^[14] IBM watsonx.ai added both vision models the following month.^[15]

Successors

Llama 3.2 Vision was succeeded on the text side by Llama 3.3 70B in December 2024, which did not include a multimodal variant, and then by Llama 4 Scout and Maverick in April 2025, which adopted a natively multimodal mixture-of-experts architecture rather than the cross-attention adapter design used in Llama 3.2 Vision.^[16]

Architecture

Vision encoder

The Llama 3.2 Vision image encoder follows the design described in the Llama 3 herd paper. It begins from a Vision Transformer of size ViT-H/14 with roughly 630M parameters, pre-trained contrastively on 2.5 billion image-text pairs at 224 by 224 resolution for five epochs.^[8]^[17] The 14 by 14 patch projection uses a Conv2d layer with kernel size 14 and stride 14, mapping each non-overlapping image patch to a 1,280-dimensional embedding before the ViT transformer blocks.^[17] To recover spatial detail that contrastive pre-training does not preserve well, features from the 4th, 8th, 16th, 24th, and 31st transformer blocks are exposed in addition to the final layer.^[8]^[17] Meta then prepends 8 additional gated self-attention layers, bringing the encoder to 40 transformer blocks and approximately 850M parameters; the attention and feed-forward sub-layers in these added blocks use tanh gating so the encoder can warm-start from the contrastive checkpoint without destabilizing the pretrained representations.^[8]^[17]

Image adapter and cross-attention insertion

Rather than projecting image features into the text token stream as a prefix (the LLaVA-style strategy used by many open vision-language models), Llama 3.2 Vision injects image information into the text decoder through dedicated cross-attention layers. New cross-attention blocks are inserted after every fourth self-attention block of the text decoder; the keys and values for these layers come from the vision encoder output, while the queries come from the text token stream.^[3]^[8] The cross-attention layers also use grouped-query attention (GQA) for memory efficiency, matching the GQA configuration of the text backbones.^[4]^[8]

This design has two important consequences. First, the parameter overhead of the adapter is bounded: the 11B and 90B headline parameter counts (with actual values of 10.6B and 88.8B respectively) account for the original Llama 3.1 8B and 70B backbones plus the encoder, the cross-attention layers, and the projection that connects them.^[4]^[5] Second, because the text decoder weights are frozen during vision training, a Llama 3.2 11B Vision model is reported by Meta to be a drop-in replacement for Llama 3.1 8B on text-only inputs, and 90B Vision serves the same role for Llama 3.1 70B.^[1]^[3]

Image processing and tile sizes

Llama 3.2 Vision-Instruct uses 560-pixel tiles, while the 11B Vision base checkpoint uses 448-pixel tiles; large images are partitioned into tiles that are individually encoded and then concatenated.^[18] Because the cross-attention layers are organized around the most recent image in a conversation, the model attends to the last image when multiple images are provided in a multi-turn exchange; using two images simultaneously requires user-side concatenation.^[18]

Training

Pre-training of the adapter

Adapter pre-training, the first heavy stage, uses approximately 6 billion image-text pairs.^[4]^[5] During this phase, Meta updates the image encoder parameters and the new cross-attention layers but explicitly freezes the language-model parameters: the Llama 3 paper states "we intentionally do not update the language-model parameters" during adapter training, which is the design choice that preserves text-only performance.^[8]^[9] Pre-training proceeds in two sub-stages, a large-scale noisy stage followed by an annealing stage on smaller, higher-quality in-domain data.^[1]^[4]

Supervised fine-tuning and preference optimization

After pre-training, Llama 3.2 Vision is post-trained with multiple rounds of supervised fine-tuning (SFT), rejection sampling, and direct preference optimization (DPO), the same pipeline used for the Llama 3.1 text models.^[1]^[4] The instruction-tuning data includes publicly available vision instruction datasets and more than three million synthetically generated examples; the synthetic data is filtered and reranked by Llama 3.1 acting as a judge and by a reward model trained for the vision setting.^[4]^[5]^[19] The combination of supervised fine-tuning, rejection sampling, and Direct Preference Optimization is summarized in the launch blog as "several rounds of alignment" and is what distinguishes the Instruct checkpoints from the base weights.^[1]^[19]

Compute and energy

Meta's model card discloses training compute in H100-80GB GPU hours. The 11B Vision model used 147,000 hours for pre-training stage 1, 98,000 hours for stage 2 annealing, 896 hours for SFT, and 224 hours for RLHF and DPO, for a total of about 246,000 GPU hours.^[4] The 90B Vision model used 885,000 hours for each of the two pre-training stages, 3,072 hours for SFT, and 2,048 hours for RLHF and DPO, totaling approximately 1.78 million GPU hours.^[5] Across both vision models the aggregate training was 2.02 million H100 GPU hours; Meta reports 584 tons of CO2-equivalent emissions on a location-based basis and 0 tons on a market-based basis after applying its renewable energy purchases.^[4]^[5]

Benchmarks

The model cards published by Meta on Hugging Face provide benchmark numbers for both base and instruction-tuned checkpoints. The instruction-tuned 90B Vision model is positioned by Meta as competitive with Claude 3 Haiku and GPT-4o-mini, while the 11B Vision model is meant to fit in the same role as those models at a smaller footprint.^[1]^[4]^[5]

Benchmark	11B Base	11B Instruct	90B Base	90B Instruct
MMMU (val, CoT)	41.7	50.7	49.3	60.3
ChartQA (test, CoT, relaxed)	n/a	83.4	54.2 (test)	85.5
DocVQA (test, ANLS)	62.3 (val)	88.4	70.7 (val)	90.1
AI2 Diagram (test)	62.4	91.1	75.3	92.3
MathVista (testmini)	n/a	51.5	n/a	57.3
VQAv2 (test)	66.8 (val)	75.2	73.6 (val)	78.1

The 90B Instruct configuration also reports 86.0 macro-average accuracy on MMLU with chain-of-thought, reflecting that the frozen Llama 3.1 70B text backbone is preserved.^[5] Meta describes the vision models as "drop-in replacements for their corresponding text model equivalents" with comparable text-only quality and superior performance on image understanding tasks compared with closed competitors of similar size.^[1]

Variants

Llama 3.2 11B Vision and 11B Vision-Instruct

The 11B Vision base model is a pre-trained checkpoint suitable for fine-tuning; the 11B Vision-Instruct model is the post-trained chat checkpoint that ships in the consumer Meta AI experience and on cloud APIs. The Hugging Face repository meta-llama/Llama-3.2-11B-Vision-Instruct is the canonical source.^[4] On a single H100 the 11B Vision-Instruct model fits in roughly 20 GB of GPU memory in bf16; 4-bit quantization brings it under 10 GB, which is the configuration most third-party demos use.^[18]

Llama 3.2 90B Vision and 90B Vision-Instruct

The 90B variants extend the same recipe to Llama 3.1 70B as the text backbone. Hugging Face hosts the base and Instruct checkpoints at meta-llama/Llama-3.2-90B-Vision and meta-llama/Llama-3.2-90B-Vision-Instruct.^[5] The 90B Instruct model is the highest-quality member of the Llama 3.2 family and is the default option on Bedrock, Vertex AI, and Azure for users who want maximum image-understanding accuracy from an open-weight Meta model.^[12]^[13]^[14]

Companion text-only releases

Released in parallel with the vision models were Llama 3.2 1B and Llama 3.2 3B, dense text-only models distilled from Llama 3.1 8B for edge AI and mobile inference. These were enabled on Qualcomm and MediaTek hardware on launch day and optimized for Arm processors; both support the 128k context length, though quantized variants for mobile have a reduced 8k window.^[1]^[11]

Llama Guard 3 11B Vision

Llama Guard 3 11B Vision is a fine-tuned moderation model built on the same vision architecture and released the same day, intended to filter multimodal input and output for safety. It is governed by the same Llama 3.2 Community License as the underlying vision base.^[1]^[20]

Licensing and the EU restriction

Is Llama 3.2 Vision open source?

Llama 3.2 is governed by the Llama 3.2 Community License Agreement, which the model card describes as "a custom, commercial license agreement" rather than a recognized open-source license, so Llama 3.2 Vision is better described as open-weight than open-source.^[4]^[21] The headline restriction is the 700-million-monthly-active-user threshold inherited from Llama 2, above which a separate Meta license is required.^[21] The Llama 3.2 release also added a use-policy clause specific to multimodal weights.^[6]^[21]

EU exclusion

The license states that for multimodal models included in Llama 3.2, the rights granted are not granted to individuals domiciled in the European Union or to companies whose principal place of business is in the EU. The carve-out is limited to multimodal weights; the text-only Llama 3.2 1B and 3B remained licensable in the EU under the same agreement, and earlier text-only releases such as Llama 3 and Llama 3.1 are not affected.^[6]^[7]^[22] End users of products and services that incorporate the vision models are exempt: a European consumer using a downstream product built on Llama 3.2 Vision is not restricted, only the development entity must be outside the EU.^[7]^[22] Industry coverage attributed the carve-out to Meta's broader dispute with European data-protection regulators over training Meta AI on EU users' Facebook and Instagram content, which had been paused in June 2024 at the request of the Irish Data Protection Commission.^[22]^[23]

Availability

Direct downloads and Meta AI

Weights for all four Llama 3.2 sizes were posted to llama.com and to Hugging Face on launch day. Through the Meta AI consumer app and meta.ai web experience, Meta integrated Llama 3.2 Vision into image-based conversations in supported regions; in Europe the vision functionality is delivered through Meta-operated products rather than via developer access to the weights.^[1]^[7]

Hugging Face

The Hugging Face Hub repositories meta-llama/Llama-3.2-11B-Vision, meta-llama/Llama-3.2-11B-Vision-Instruct, meta-llama/Llama-3.2-90B-Vision, and meta-llama/Llama-3.2-90B-Vision-Instruct are gated under the Community License. Inference requires the MllamaForConditionalGeneration class added in transformers 4.45.0; the modeling code uses a dedicated mllama architecture name to capture the cross-attention adapter design.^[4]^[5]^[18]

Cloud providers

Amazon Bedrock hosts both vision models with cross-region inference into US East. Microsoft Azure AI Foundry (formerly Azure AI Studio) lists both Instruct models in its catalog with managed compute deployment. Google Vertex AI Model Garden includes all four Llama 3.2 sizes; IBM watsonx.ai added both vision models in October 2024. Oracle Cloud Infrastructure Generative AI offers both vision models across multiple regions.^[12]^[13]^[14]^[15]

On-device and partner inference

The Llama Stack reference distribution and on-device runtimes such as PyTorch ExecuTorch target the 1B and 3B text-only models for mobile, while the 11B and 90B Vision models run on standard GPU inference stacks including llama.cpp (with experimental multimodal support), Ollama, and inference-as-a-service providers such as Together AI, Fireworks, and Groq listed in Meta's launch blog.^[1]^[24]

Significance

Llama 3.2 Vision was the first time Meta released openly distributed multimodal Llama weights. Within the open-weight ecosystem the launch was significant for three reasons. First, the cross-attention adapter design demonstrated a path to multimodality that does not disturb the underlying text model, allowing Meta to ship vision capabilities without retraining the Llama 3.1 text backbones from scratch.^[3]^[8] Second, the 90B Instruct model put open weights into the same benchmark range as closed mid-tier vision models like Claude 3 Haiku and GPT-4o-mini on MMMU, DocVQA, AI2D, and ChartQA, which had previously been a closed-model regime.^[1]^[4]^[5] Third, the release coincided with the public availability of competing open multimodal models including Qwen2-VL and Pixtral 12B from Mistral, marking the late-2024 window in which open-weight vision-language modeling closed much of the gap to proprietary systems.^[25]

Limitations

Several limitations were documented at launch or surfaced in early reporting. Image inputs are restricted to English even though the text-only Llama 3.2 family officially supports eight languages, which constrains multilingual document understanding.^[4]^[5] The cross-attention layers attend to the most recent image, so prompts containing multiple images in different turns lose access to earlier visuals unless concatenated on the client side.^[18] Outputs are text only: Llama 3.2 Vision cannot generate images or speech and cannot accept video or audio inputs, capabilities Meta described in the Llama 3 herd paper but did not productize in this release.^[1]^[8] The EU restriction created a fragmented developer landscape in which European startups cannot directly distribute products based on Llama 3.2 Vision weights, a fact criticized by both European and US commentators at the time.^[22]^[23] Finally, the 90B Instruct model still trails proprietary frontier systems such as GPT-4o on MMMU and complex multi-step visual reasoning, as noted by independent secondary coverage.^[25]

Model	Release	Parameters	Strategy	Weights	Notes
Llama 3.2 11B Vision-Instruct	Sep 25, 2024^[1]	10.6B^[4]	Frozen Llama 3.1 8B plus ViT-H/14 plus cross-attention adapter^[3]^[8]	Open under Community License (no EU)^[6]^[7]	First open Meta multimodal^[1]
Llama 3.2 90B Vision-Instruct	Sep 25, 2024^[1]	88.8B^[5]	Frozen Llama 3.1 70B plus ViT-H/14 plus cross-attention adapter^[3]^[8]	Open under Community License (no EU)^[6]^[7]	Highest-quality member of family^[5]
LLaVA 1.5	Oct 2023^[26]	7B and 13B^[26]	MLP projection of CLIP features into Vicuna decoder^[26]	Apache 2.0^[26]	Influential prefix-projection baseline^[26]
Pixtral 12B	Sep 2024^[27]	12B^[27]	Native multimodal Mistral architecture with image tokens^[27]	Apache 2.0^[27]	Released same month as Llama 3.2 Vision^[27]
Qwen2.5-VL	Jan 2025^[28]	3B/7B/72B^[28]	Naive dynamic resolution ViT plus decoder-only LLM^[28]	Apache 2.0 (most sizes)^[28]	Successor to Qwen2-VL (Aug 2024) that competed with Llama 3.2 Vision^[28]

References

Meta AI, "Llama 3.2: Revolutionizing edge AI and vision with open, customizable models", Meta AI Blog, 2024-09-25. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. Accessed 2026-05-21. ↩
Meta Quest Blog, "Meta Connect 2024 Keynote Recap: Quest 3S, Llama 3.2, AI Wearables, Mixed Reality", Meta, 2024-09-25. https://www.meta.com/blog/connect-2024-keynote-recap-quest-3s-llama-3-2-ai-wearables-mixed-reality/. Accessed 2026-05-21. ↩
Meta AI, "Llama 3.2 Vision Model Card", meta-llama/llama-models GitHub, 2024-09-25. https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md. Accessed 2026-05-21. ↩
Meta AI, "meta-llama/Llama-3.2-11B-Vision (model card)", Hugging Face, 2024-09-25. https://huggingface.co/meta-llama/Llama-3.2-11B-Vision. Accessed 2026-06-28. ↩
Meta AI, "meta-llama/Llama-3.2-90B-Vision (model card)", Hugging Face, 2024-09-25. https://huggingface.co/meta-llama/Llama-3.2-90B-Vision. Accessed 2026-05-21. ↩
Meta AI, "Llama 3.2 Community License Agreement", llama.com, 2024-09-25. https://www.llama.com/llama3_2/license/. Accessed 2026-05-21. ↩
Slator, "Meta Rolls Out Multimodal Llama 3.2 but Not in Europe", Slator, 2024-09-26. https://slator.com/meta-rolls-out-multimodal-llama-3-2-but-not-in-europe/. Accessed 2026-05-21. ↩
Meta Llama Team, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-05-21. ↩
Meta AI, "The Llama 3 Herd of Models (research publication page)", AI at Meta, 2024-07-23. https://ai.meta.com/research/publications/the-llama-3-herd-of-models/. Accessed 2026-05-21. ↩
Tech Funding News, "Meta Connect 2024, Llama 3.2 launch: Giving AI eyes and a voice to rival OpenAI and Anthropic", Tech Funding News, 2024-09-25. https://techfundingnews.com/meta-launches-llama-3-2-multimodal-ai/. Accessed 2026-05-21. ↩
InfoQ, "Meta Releases Llama 3.2 with Vision, Voice, and Open Customizable Models", InfoQ, 2024-10-04. https://www.infoq.com/news/2024/10/llama-3-2-multimodal/. Accessed 2026-05-21. ↩
Amazon Web Services, "Introducing Llama 3.2 models from Meta in Amazon Bedrock: A new generation of multimodal vision and lightweight models", AWS News Blog, 2024-09-25. https://aws.amazon.com/blogs/aws/introducing-llama-3-2-models-from-meta-in-amazon-bedrock-a-new-generation-of-multimodal-vision-and-lightweight-models/. Accessed 2026-05-21. ↩
Microsoft, "Llama-3.2-11B-Vision-Instruct model catalog entry", Azure AI Foundry, 2024-09-25. https://ai.azure.com/catalog/models/Llama-3.2-11B-Vision-Instruct. Accessed 2026-05-21. ↩
Google Cloud, "Llama 3.2: Meta's New Generation of Models on Vertex AI", Google Cloud Blog, 2024-09-25. https://cloud.google.com/blog/products/ai-machine-learning/llama-3-2-metas-new-generation-models-vertex-ai. Accessed 2026-05-21. ↩
IBM, "Meta's Llama 3.2 models now available on watsonx, including multimodal 11B and 90B models", IBM Think, 2024-10-09. https://www.ibm.com/think/news/meta-llama-3-2-models. Accessed 2026-05-21. ↩
Meta AI, "The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation", Meta AI Blog, 2025-04-05. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed 2026-05-21. ↩
J. Qi, "Inside Multimodal LLaMA 3.2: Understanding Meta's Vision-Language Model Architecture", Medium, 2024-10-15. https://j-qi.medium.com/inside-mllama-3-2-understanding-metas-vision-language-model-architecture-ae12ad24dcbf. Accessed 2026-05-21. ↩
Hugging Face, "Llama can now see and run on your device: welcome Llama 3.2", Hugging Face Blog, 2024-09-25. https://huggingface.co/blog/llama32. Accessed 2026-05-21. ↩
R. Rastogi, "Papers Explained 187d: Llama 3.2", Medium, 2024-10-02. https://ritvik19.medium.com/papers-explained-187d-llama-3-2-e517fa1f2528. Accessed 2026-05-21. ↩
J. Chi et al., "Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations", arXiv:2411.10414, 2024-11-15. https://arxiv.org/abs/2411.10414. Accessed 2026-05-21. ↩
Meta AI, "Llama 3.2 Acceptable Use Policy", llama.com, 2024-09-25. https://www.llama.com/llama3_2/use-policy/. Accessed 2026-05-21. ↩
T. Pasberg, "Why does Meta restrict the usage of Llama 3.2 in the EU?", Medium, 2024-10-04. https://medium.com/@thomas-pasberg/why-does-meta-restrict-the-usage-of-llama3-2-in-the-eu-4079946abb07. Accessed 2026-05-21. ↩
S. Zan, "Using Llama Models in the EU", zansara.dev, 2025-05-16. https://www.zansara.dev/posts/2025-05-16-llama-eu-ban/. Accessed 2026-05-21. ↩
Meta AI, "Cloud partners", llama.com getting the models docs, 2024-09-25. https://www.llama.com/docs/getting-the-models/405b-partners/. Accessed 2026-05-21. ↩
N. Lambert, "Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem", Interconnects, 2024-09-26. https://www.interconnects.ai/p/molmo-and-llama-3-vision. Accessed 2026-05-21. ↩
H. Liu et al., "Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)", arXiv:2310.03744, 2023-10-05. https://arxiv.org/abs/2310.03744. Accessed 2026-05-21. ↩
Mistral AI, "Announcing Pixtral 12B", Mistral AI News, 2024-09-17. https://mistral.ai/news/pixtral-12b/. Accessed 2026-05-21. ↩
Qwen Team, "Qwen2.5-VL Technical Report", arXiv:2502.13923, 2025-02-19. https://arxiv.org/abs/2502.13923. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

InternVideo Pixtral Pixtral Large Vision language model

Infobox

What is Llama 3.2 Vision?

How does Llama 3.2 Vision work?

What can Llama 3.2 Vision do?

History

Path from Llama 3.1 to a vision family

Meta Connect 2024 launch

Distribution timeline

Successors

Architecture

Vision encoder

Image adapter and cross-attention insertion

Image processing and tile sizes

Training

Pre-training of the adapter

Supervised fine-tuning and preference optimization

Compute and energy

Benchmarks

Variants

Llama 3.2 11B Vision and 11B Vision-Instruct

Llama 3.2 90B Vision and 90B Vision-Instruct

Companion text-only releases

Llama Guard 3 11B Vision

Licensing and the EU restriction

Is Llama 3.2 Vision open source?

EU exclusion

Availability

Direct downloads and Meta AI

Hugging Face

Cloud providers

On-device and partner inference

Significance

Limitations

Comparison with related work

See also

References

Improve this article

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Muse Spark

Chameleon (Meta AI)

CM3leon

ImageBind

What links here

Related Articles

Llama 3.2

Llama 4 Scout and Maverick

Muse Spark

Chameleon (Meta AI)

CM3leon

ImageBind

What links here