| Llama 3.2 | |
|---|---|
| Developer | Meta |
| Release date | September 25, 2024 |
| Announced at | Meta Connect 2024 |
| Model sizes | 1B, 3B (text-only); 11B, 90B (vision) |
| Architecture | Auto-regressive transformer; vision models use cross-attention adapter |
| Context length | 128,000 tokens |
| Training data cutoff | December 2023 |
| Modalities | Text (1B, 3B); Text + Image (11B, 90B) |
| License | Llama 3.2 Community License (EU restrictions on vision models) |
| Predecessor | Llama 3.1 |
| Website | llama.com |
Llama 3.2 is a family of open-weight large language models developed by Meta and released on September 25, 2024, at the company's annual Meta Connect developer conference. The release introduced two distinct product lines: lightweight text-only models at the 1 billion and 3 billion parameter scales, designed for edge and on-device deployment, and vision-capable multimodal models at the 11 billion and 90 billion parameter scales, capable of processing and reasoning over images alongside text. Llama 3.2 marked the first time Meta introduced image understanding into the LLaMA family, making it a significant inflection point in the lineage of Meta's open-weight models.
The release built directly on Llama 3.1, which had established a new bar for open-weight models with its 405 billion parameter flagship. Llama 3.2 extended that foundation in two complementary directions: downward toward the extreme efficiency required by smartphones and edge hardware, and sideways into multimodal AI territory that had previously been the domain of proprietary systems. All four models share a 128,000-token context window and support eight languages in their text modes. The vision models support image input in English only at launch.
Alongside the core models, Meta released updated safety infrastructure in the form of Llama Guard 3, including a dedicated vision-capable variant for classifying multimodal content. The full suite was made available through Meta's llama.com portal, Hugging Face, and over 25 cloud and infrastructure partners including AWS, Microsoft Azure, Google Cloud, and Oracle Cloud.
Meta's public model releases began with the original LLaMA in February 2023, a family of models ranging from 7 billion to 65 billion parameters trained on publicly available data. That first release established Meta's strategy of releasing research-grade open-weight models to the broader AI community, a posture the company has maintained and expanded through successive generations.
LLaMA 3 followed in April 2024, introducing the 8B and 70B models with a substantially expanded tokenizer vocabulary of 128,000 tokens, grouped-query attention across all sizes, and a much larger pretraining corpus. The 8B model in particular became a widely adopted baseline for fine-tuning and derivative work across the open-source community.
Llama 3.1, released in July 2024, was notable for two reasons. First, it introduced the 405 billion parameter model, the largest openly released model at the time of its launch and competitive with leading proprietary systems on several academic benchmarks. Second, it extended context length across the entire family to 128,000 tokens and introduced multilingual instruction following in eight languages. Llama 3.1 also added tool-calling capability and was positioned explicitly as a foundation for agentic workflows.
Despite these advances, the Llama 3 and 3.1 families remained text-only. Competing models from Anthropic, Google, and OpenAI had already incorporated vision capabilities, and demand for open-weight multimodal alternatives was growing. Llama 3.2 addressed this gap directly.
Meta Connect is Meta's annual developer and consumer hardware conference, typically focused on virtual reality, augmented reality, and the Meta Quest product line. In 2024, Meta used the event on September 25 to also announce major AI developments, including Llama 3.2 and new voice AI features for its consumer products. The choice of venue underscored Meta's positioning of the small Llama 3.2 models as components intended for on-device AI in consumer hardware, including Meta's Ray-Ban AI glasses and Meta Quest headsets.
Llama 3.2 is not a single release but a coordinated family of four models organized into two distinct product lines with different design goals.
The first line consists of the 1B and 3B text-only models. These are designed for scenarios where compute, memory, and power constraints dominate the deployment environment: smartphones, edge servers, IoT devices, and offline-capable applications. They inherit the architecture of Llama 3.1 but are substantially smaller, trained using a combination of large-scale pretraining and knowledge distillation from larger models.
The second line consists of the 11B and 90B vision models. These are designed for tasks that require understanding images alongside text: document analysis, visual question answering, chart and diagram interpretation, and image captioning. They are built on frozen Llama 3.1 language models with a separately trained vision adapter attached through cross-attention layers, preserving all text capabilities of the underlying Llama 3.1 base while adding image understanding.
Both lines support a 128,000-token context window, multilingual text, and instruction-following via supervised fine-tuning and reinforcement learning from human feedback.
The 1B and 3B models were designed from the outset to run on consumer devices and edge hardware without requiring a cloud connection. This focus shaped every aspect of their development, from architectural choices to training methods to the quantization formats made available at launch.
On a OnePlus 12 smartphone with an ARM CPU, the 1B model in SpinQuant quantized form achieves approximately 50 tokens per second decode throughput with a time-to-first-token of 0.3 seconds and a memory footprint of just 1,921 MB. The same model in BF16 precision runs at 19.2 tokens per second with over 3 GB of memory usage, illustrating the significance of quantization for mobile deployment.
The 3B model reached approximately 2.27 million monthly downloads on Hugging Face within months of its release, reflecting strong adoption by developers building mobile and edge applications.
Both the 1B and 3B models share the same transformer architecture as Llama 3.1 8B, including grouped-query attention for efficient key-value cache management during inference, shared input and output embeddings, and the 128,000-token vocabulary with tiktoken-based byte pair encoding. The primary difference from the 8B base model is parameter count: 1.23 billion and 3.21 billion respectively, achieved through structured pruning and then recovered through distillation training.
Context length remains 128,000 tokens, the same as Llama 3.1, which is unusual for models at this scale. Most small models at the 1B to 3B range impose much shorter context limits due to the quadratic scaling of attention computation, but the Llama 3.2 small models inherit the full context infrastructure of their larger predecessor.
Rather than training 1B and 3B models from scratch, Meta derived them from Llama 3.1 8B using a two-stage process of structured pruning followed by knowledge distillation.
In the pruning stage, structured portions of the 8B model were systematically removed to arrive at the target parameter counts. Structured pruning removes entire components such as attention heads, feed-forward layer neurons, or transformer layers, as opposed to unstructured pruning which zeros out individual weights. Structured pruning produces models with regular shapes that map efficiently onto hardware accelerators without requiring sparse computation support.
The pruned models initially exhibit degraded performance because the removed components had been contributing meaningfully to the model's learned representations. To recover capability, Meta applied knowledge distillation, a technique where the smaller student model is trained to match the output distributions of one or more larger teacher models rather than simply fitting ground-truth labels. For Llama 3.2, the teachers were the Llama 3.1 8B and Llama 3.1 70B models, whose logits at each token position were used as soft targets during pretraining.
The distillation was combined with standard large-scale pretraining on up to 9 trillion tokens from publicly available sources, with a data cutoff of December 2023. Post-training then applied multiple rounds of supervised fine-tuning, rejection sampling, and direct preference optimization to produce instruction-following variants.
Total compute for the 1B model was 370,000 H100 GPU hours, producing 107 metric tons of CO2-equivalent emissions on a location-based accounting basis. The 3B model required 460,000 H100 GPU hours and 133 tons of CO2-equivalent. Meta offset these emissions entirely through renewable energy purchasing, resulting in zero market-based emissions.
The 1B and 3B models support multilingual text generation in English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Both support tool calling with user-defined tool specifications without requiring prior fine-tuning examples, a zero-shot tool use capability that makes them suitable for agentic applications.
Instructed variants support summarization, question answering, instruction following, and code generation. The 3B model in particular performs competitively with larger models on instruction following tasks. On the IFEval benchmark, the 3B instruction-tuned model scores 77.01, essentially matching the Llama 3.1 8B instruction model's score of 76.49, despite having less than half the parameters.
The 1B model can also serve as a speculative decoding draft model for the larger Llama 3.1 8B, improving end-to-end generation throughput when both models are available.
Meta released official quantized variants of both models at launch to support deployment across the broadest possible range of hardware:
The linear layers in quantized variants use 4-bit groupwise weights with group size 32 combined with 8-bit per-token dynamic activations. The classification layer uses 8-bit per-channel weights, and the embedding layer uses 8-bit per-channel quantization.
The 11B and 90B models represent the first members of the Llama family to support visual input. Prior Llama releases were text-only, and Meta's earlier multimodal research work such as ImageBind had not been integrated into the publicly released Llama models. Llama 3.2 Vision closed this gap and positioned the Llama ecosystem as a competitive alternative to proprietary multimodal systems.
The vision models are available in both base (pre-trained) and instruct (fine-tuned for dialogue) variants. At launch, image-text prompting is supported in English only, while text-only prompting supports all eight languages available in the text models.
The vision models are built by attaching a separately trained vision adapter to frozen Llama 3.1 language models:
The naming reflects total parameter count including the vision adapter, not just the language model backbone.
The image encoder is a Vision Transformer (ViT) based on the ViT-H/14 architecture with 14-pixel patch size. Meta extended the standard ViT-H with 8 additional gated self-attention layers, producing a final encoder with two stages: a 32-layer primary encoder followed by an 8-layer global encoder, totaling 40 transformer blocks. The encoder operates at tile sizes of 448 pixels for the 11B base model and 560 pixels for the 11B instruct and all 90B variants.
The vision adapter connects the image encoder to the language model through cross-attention layers inserted at regular intervals into the transformer stack. Specifically, cross-attention layers are integrated after every fourth self-attention layer in the language model, occurring at layers 3, 8, 13, 18, 23, 28, 33, and 38. At these points, the language model's hidden states attend over the image encoder's output representations, allowing visual information to influence text generation without requiring every transformer layer to process image tokens.
A key architectural decision was to freeze the language model parameters during vision adapter training. Only the image encoder and cross-attention layers were updated during the vision pre-training phase. This design has two important practical consequences. First, it preserves all text-only capabilities of the underlying Llama 3.1 model without degradation, making the vision models true drop-in replacements for their text-only counterparts in text-only workflows. Second, it makes the vision models excellent starting points for downstream fine-tuning on domain-specific visual tasks, since the language model parameters reflect the full Llama 3.1 training and have not been perturbed by vision training.
Vision pre-training used a dataset of 6 billion image-text pairs assembled from diverse sources, with careful attention to data mixture and quality. The training was structured in multiple stages, with the vision adapter components trained while language model weights remained frozen.
Total compute for the combined 11B and 90B vision model training was 2.02 million H100-80GB GPU hours. This produced 584 metric tons of CO2-equivalent on a location-based basis, offset to zero on a market-based basis through Meta's renewable energy purchasing.
The 90B model was trained in two main pretraining phases totaling approximately 885,000 GPU hours for each of the primary and annealing stages, with additional compute for supervised fine-tuning and reinforcement learning from human feedback.
Post-training applied multiple rounds of supervised fine-tuning followed by rejection sampling and direct preference optimization to optimize instruction following, safety behavior, and multimodal dialogue quality. Instruction fine-tuning data included over 3 million synthetically generated image-text examples.
The 11B and 90B vision models support a range of visual understanding tasks:
Images are processed by dividing them into overlapping tiles of the configured pixel size, with each tile encoded independently and the resulting representations passed through the global encoder stage before cross-attention integration.
Llama 3.2 is released under the Llama 3.2 Community License, a custom commercial license distinct from standard open-source licenses. The license permits commercial use, fine-tuning, and redistribution with the following conditions:
These terms are substantially similar to the Llama 3.1 Community License, continuing Meta's practice of permissive but not fully open-source licensing.
The Llama 3.2 Community License contains a specific restriction that applies exclusively to the 11B and 90B multimodal models and not to the 1B and 3B text-only models. The license states that the rights granted under the agreement "are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union."
An exception preserves access for end users: European individuals can still use Llama 3.2 Vision through third-party products and services that incorporate the models, even if those services are built by non-EU entities. The restriction applies to developers and companies seeking to build with the vision models directly, not to consumers using a product that embeds them.
Meta has not publicly explained the EU vision restriction in a detailed statement. Observers and legal commentators have offered several competing explanations. One widely cited theory is that the restriction relates to uncertainty about compliance with the EU AI Act, which was signed into law in August 2024 and whose provisions were still being interpreted at the time of Llama 3.2's release. Another theory focuses on GDPR: Meta has previously been fined under GDPR for privacy violations and paused plans to train models on public EU user content due to regulatory pushback; the EU restriction on vision models may reflect ongoing caution about deploying AI trained on data whose collection practices could face EU regulatory scrutiny.
Critics have questioned the regulatory justification for the restriction. Other vision language models, including those from Qwen (developed in China by Alibaba) and Pixtral (developed in France by Mistral AI, an EU company), were released globally without equivalent EU restrictions at roughly the same time. This inconsistency has led some observers to suggest that the restriction may reflect litigation risk concerns specific to Meta's data practices rather than a general requirement of EU AI regulation.
Alongside the core Llama 3.2 models, Meta released updated versions of its Llama Guard safety classification system. Llama Guard is a family of models designed not to generate content but to classify whether a given prompt or response violates defined safety categories. It functions as a separately deployed safety layer that developers can integrate into their applications to screen model inputs and outputs.
Llama Guard 3 1B is a text-safety classifier derived from the Llama 3.2 1B model through pruning and additional quantization. In its quantized form, the model weighs approximately 438 MB, down from the 2,858 MB of its BF16 parent. This extreme compression makes it practical to run Llama Guard on the same device as the model being guarded, with minimal additional memory overhead.
Llama Guard 3 11B Vision is a multimodal safety classifier based on the Llama 3.2 11B Vision architecture. It accepts both image and text inputs and classifies content across 13 hazard categories drawn from the MLCommons safety taxonomy:
The classifier accepts multimodal prompts consisting of a single image plus text, as well as text-only prompts. Images are rescaled into four 560-by-560 pixel chunks before encoding. On content safety classification benchmarks, Llama Guard 3 11B Vision achieves an F1 score of 0.938 on response classification tasks with a precision of 0.961 and a false positive rate of 0.016, substantially outperforming GPT-4o at 0.667 F1 on the same task.
The safety categories can be customized or selectively disabled by developers who need to adapt the classifier's behavior to their specific application domain.
The following table shows benchmark scores for the Llama 3.2 instruction-tuned text models alongside Llama 3.1 8B for reference.
| Benchmark | Llama 3.2 1B Instruct | Llama 3.2 3B Instruct | Llama 3.1 8B Instruct |
|---|---|---|---|
| MMLU (5-shot) | 49.3 | 63.4 | 69.4 |
| GSM8K (CoT) | 44.4 | 77.7 | 84.5 |
| MATH (CoT) | -- | 48.0 | 51.9 |
| ARC-C | 59.4 | 78.6 | 83.4 |
| IFEval | 59.5 | 77.4 | 76.5 |
| Hellaswag | 41.2 | 69.8 | 82.0 |
| AlpacaEval (LC) | 7.17 | 20.88 | 25.74 |
The 3B model's IFEval score of 77.4 is essentially equal to the 8B model's 76.5, a result Meta highlighted as evidence that distillation effectively transfers instruction-following capability beyond what raw parameter count would predict.
The following table shows benchmark scores for the instruction-tuned vision models.
| Benchmark | Metric | Llama 3.2 11B | Llama 3.2 90B |
|---|---|---|---|
| VQAv2 (test) | Accuracy | 75.2% | 78.1% |
| DocVQA (test) | ANLS | 88.4 | 90.1 |
| ChartQA (test, CoT) | Relaxed accuracy | 83.4% | 85.5% |
| AI2 Diagram (test) | Accuracy | 91.1% | 92.3% |
| MMMU (val, CoT) | Micro avg accuracy | 50.7% | 60.3% |
| MathVista | Accuracy | -- | 57.3% |
| MMLU (CoT) | Macro avg accuracy | 73.0% | 86.0% |
| MATH (CoT) | Final EM | 51.9% | 68.0% |
| GPQA | Accuracy | -- | 46.7% |
For the pre-trained base models, the Open LLM Leaderboard at the time of release showed the following averages across BBH, MATH Level 5, GPQA, MUSR, and MMLU-PRO:
| Model | Average |
|---|---|
| Llama 3.2 1B | 1.88 |
| Llama 3.2 3B | 8.00 |
| Llama 3.1 8B | 14.00 |
These base scores reflect the expected gap caused by reduced parameter count, which the instruction-tuning and distillation process partially closes in the instruct variants.
At the time of release, the 90B vision model was positioned as competitive with proprietary vision models at the smaller end of the commercial market.
| Model | Developer | Parameters | VQAv2 | DocVQA | ChartQA | MMMU |
|---|---|---|---|---|---|---|
| Llama 3.2 11B Vision | Meta | 11B | 75.2% | 88.4 | 83.4% | 50.7% |
| Llama 3.2 90B Vision | Meta | 90B | 78.1% | 90.1 | 85.5% | 60.3% |
| Claude 3 Haiku | Anthropic | Undisclosed | 74.4% | ~88 | ~81% | 50.2% |
| GPT-4o-mini | OpenAI | Undisclosed | -- | -- | -- | 59.4% |
| Pixtral 12B | Mistral AI | 12B | -- | 90.7 | 81.8% | 52.5% |
On visual QA and document understanding tasks, the Llama 3.2 90B model performed comparably to or slightly above Claude 3 Haiku, while the 11B model was roughly on par with Haiku depending on the specific benchmark. GPT-4o-mini maintained a lead on MMMU, a test of broad multidisciplinary reasoning that incorporates images, scoring 59.4% compared to 60.3% for the 90B model. In mathematical reasoning benchmarks such as MATH, GPT-4o-mini scored approximately 70.2% compared to 68.0% for the 90B model.
Meta described the vision models as competitive with Claude 3 Haiku and GPT-4o-mini on image recognition and visual understanding tasks.
The 1B and 3B models are the primary candidates for on-device and edge deployment scenarios:
The availability of SpinQuant and QLoRA quantized variants makes deployment feasible on ARM processors, with Meta confirming Qualcomm and MediaTek platform support at launch. Meta noted that the ARM architecture targets cover approximately 99% of mobile devices in active use.
The 11B and 90B vision models are suited for enterprise and research workflows involving document-heavy content:
The long 128,000-token context window makes both the text and vision models suitable for retrieval-augmented generation (RAG) pipelines where retrieved documents are passed into the context window alongside a query. For vision-capable RAG systems, the 11B and 90B models can process retrieved images and documents in a single forward pass, enabling systems that retrieve and reason over both text and visual content.
The small models' support for tool calling additionally makes them applicable to agentic RAG workflows where the model must decide which tools to invoke to retrieve relevant information before generating a response.
The 1B model can serve as a draft model for speculative decoding with the Llama 3.1 8B model as the verifier. In speculative decoding, a fast smaller model generates candidate token sequences that a larger model then verifies and accepts or rejects. When the draft model's predictions align well with the larger model, this technique substantially increases tokens-per-second throughput without changing output quality. The shared vocabulary and architectural lineage between Llama 3.2 1B and Llama 3.1 8B make this pairing particularly effective.
The Llama 3.2 release received broadly positive coverage from the AI developer community. The introduction of vision capabilities into the Llama family was described by multiple observers as a significant milestone for open-weight AI, bringing Meta's public model releases into parity with proprietary multimodal systems for the first time.
The small model line attracted particular enthusiasm. The combination of long context (128K tokens), competitive instruction-following performance, and broad hardware support made the 1B and 3B models highly practical for a segment of developers who had previously needed to use much larger or cloud-dependent models for tasks like summarization and tool use. InfoQ reported that the 3B model's instruction-following performance matching the 8B model on IFEval was a notable result that suggested distillation had closed the expected capability gap between the size tiers.
Partner companies including AMD, AWS, Databricks, Dell, Google Cloud, Groq, IBM, Intel, Microsoft Azure, NVIDIA, Oracle Cloud, and Snowflake launched support for the models on the release date, indicating strong advance coordination and ecosystem readiness.
The EU restriction on vision models generated criticism from the European AI developer community and was widely reported in AI media outlets. Critics noted that other open-weight vision models were available in the EU without restriction, and argued that Meta's choice to exclude EU developers created a fragmentation of the open-weight ecosystem. The restriction also prompted legal analysis about whether the stated justifications held up under scrutiny, with several commentators concluding that the real motivation likely involved GDPR compliance concerns specific to how Meta assembled its training data.
Some developers noted that despite its strong benchmark performance relative to Claude 3 Haiku, the 90B vision model still lagged behind GPT-4o (not GPT-4o-mini) and Claude 3.5 Sonnet on more complex multimodal reasoning tasks, limiting its applicability as a drop-in replacement for the most capable proprietary systems.
At launch, image-text prompting is limited to English even though text-only prompting supports eight languages. Developers building multilingual applications that need to process non-English text alongside images must either handle language-switching in their application layer or wait for future model versions with expanded multimodal language support.
The vision models are optimized to process one image per conversation, attending to the last image provided when multiple images appear in a context window. Applications requiring comparison of multiple images simultaneously must structure their prompts accordingly or implement external batching.
The EU restriction on the multimodal models limits European developers who wish to build products directly on Llama 3.2 Vision. While end users can access vision capabilities through compliant third-party services, EU-based companies building AI products on top of the vision models directly cannot do so under the standard license terms.
All Llama 3.2 models share a training data cutoff of December 2023, meaning they lack awareness of events, publications, or developments after that date. Applications requiring up-to-date information must supplement the models with retrieval systems or tool access to external knowledge sources.
Despite the strong distillation results, the 1B and 3B models have an inherent capability ceiling compared to larger models on complex reasoning tasks. The 1B model in particular scores 44.4 on GSM8K, compared to 84.5 for the 8B model, reflecting a meaningful gap in mathematical problem solving. Applications requiring reliable arithmetic or multi-step logical reasoning should prefer the larger models in the Llama family.