Llama 3.2
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,573 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,573 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Llama 3.2 | |
|---|---|
| Developer | Meta |
| Release date | September 25, 2024 |
| Announced at | Meta Connect 2024 |
| Model sizes | 1B, 3B (text-only); 11B, 90B (vision) |
| Architecture | Auto-regressive transformer; vision models add a cross-attention image adapter on frozen Llama 3.1 backbones |
| Context length | 128,000 tokens |
| Training data cutoff | December 2023 |
| Modalities | Text (1B, 3B); Text + Image (11B, 90B) |
| License | Llama 3.2 Community License (EU restriction on vision models) |
| Predecessor | Llama 3.1 |
| Successor | Llama 3.3 (Dec 2024, 70B refresh); Llama 4 (Apr 2025) |
| Companion release | Llama Stack, Llama Guard 3 |
| Website | llama.com |
Llama 3.2 is a family of open-weight large language models developed by Meta and released on September 25, 2024, at the company's annual Meta Connect developer conference.[^1] The release expanded the Llama family along two new axes simultaneously: downward to lightweight 1 billion and 3 billion parameter text-only models designed for on-device AI inference on smartphones and edge hardware, and outward into multimodal territory with 11 billion and 90 billion parameter vision-capable models — the first Llama models able to process images.[^1][^2]
The two families were engineered with different methods and toward different deployment targets. The 1B and 3B text models were derived from Llama 3.1 8B via structured pruning followed by knowledge distillation from Llama 3.1 8B and 70B teachers, producing compact models that nevertheless retain the 128,000-token context window of their larger ancestors.[^1][^3] The 11B and 90B vision models, by contrast, were built by attaching a separately trained image encoder to frozen Llama 3.1 8B and 70B language backbones via cross-attention layers, rather than retraining the language model end-to-end on multimodal data.[^1][^4]
Alongside the four core models, Meta released Llama Stack, an official agent, RAG, and evaluation framework, and updated its safety classifier family with Llama Guard 3 1B and Llama Guard 3 11B Vision.[^1][^5] The 11B and 90B vision models were released globally with one notable exception: Meta excluded developers domiciled or headquartered in the European Union from the license grant for the multimodal models, citing the regulatory environment around the EU AI Act and ongoing GDPR disputes over training data.[^6][^7] All four models were made available through llama.com, Hugging Face, and over 25 cloud and infrastructure partners on day one.[^1]
Meta's open-weight language model strategy began with the original LLaMA in February 2023 and accelerated rapidly through 2024. LLaMA 3 (April 2024) introduced the 8B and 70B models with a 128,000-token vocabulary and grouped-query attention across all sizes. Llama 3.1 (July 2024) extended the family to a 405-billion-parameter flagship, raised the context window to 128,000 tokens across all sizes, added native tool calling, and shipped under a revised community license intended for agentic deployments.[^8]
Despite these advances, the Llama 3 and 3.1 families remained text-only. Competing models from Anthropic, Google DeepMind, and OpenAI — including Claude 3 Haiku, Gemini 1.5, and GPT-4o-mini — had all incorporated vision input, leaving a perceived gap in the open-weight ecosystem.[^9] Separately, the rise of small high-performing models such as Gemma 2 2B and Phi-3 mini from competitors had made it clear that sub-10-billion-parameter LLMs were a strategically important category for on-device and mobile deployments.[^10]
Meta Connect is Meta's annual developer and consumer hardware conference, historically anchored to virtual and augmented reality announcements and the Meta Quest product line. On September 25, 2024, CEO Mark Zuckerberg used the keynote to introduce Llama 3.2 alongside updates to Meta AI, the Ray-Ban Meta smart glasses, and Quest hardware.[^11] The choice of venue underscored Meta's positioning of the lightweight Llama 3.2 models as on-device components for consumer hardware, including AI features running locally on smart glasses and headsets rather than in the cloud.[^1][^11]
The release filled two gaps in Meta's open-weight lineup simultaneously: the absence of competitive small models for edge deployment, and the absence of any vision capability anywhere in the Llama family.
The 1B and 3B models were designed from the outset for scenarios in which compute, memory, and power constraints dominate the deployment environment: smartphones, edge servers, IoT devices, and offline-capable applications.[^1] This focus shaped every aspect of their development, from architectural inheritance to training methodology to the quantized formats released at launch.
Meta highlighted on-device summarization, instruction following, prompt rewriting, and local tool use as the canonical applications for these models.[^1] The goal was to enable practical, private AI experiences that run entirely on the user's device without round-tripping to a server.
Both the 1B and 3B models share the transformer architecture of Llama 3.1 8B, including grouped-query attention for efficient key-value cache management, shared input and output embeddings, and the 128,000-token tiktoken-based BPE vocabulary.[^1][^3] The primary difference from the 8B base model is parameter count — 1.23 billion and 3.21 billion respectively — achieved through structured pruning rather than independent architecture design.[^3]
Notably, the 128,000-token context window is preserved at both sizes. Most small open models in the 1B–3B range impose much shorter context limits because the quadratic scaling of attention computation becomes burdensome at small scale, but the Llama 3.2 small models inherit the full long-context infrastructure of their larger ancestors.[^1][^3]
Rather than training the 1B and 3B models from scratch, Meta derived them from Llama 3.1 8B using a two-stage process: structured pruning followed by knowledge distillation.[^1][^3]
In the pruning stage, structured portions of the 8B model — entire attention heads, feed-forward neurons, and transformer layers — were systematically removed to reach the target parameter counts. Structured pruning produces models with regular shapes that map efficiently onto hardware accelerators without requiring sparse computation support, in contrast to unstructured pruning, which zeros out individual weights.[^3]
The pruned models exhibited degraded performance because the removed components had been carrying meaningful learned representations. To recover capability, Meta applied knowledge distillation: the smaller "student" model was trained to match the output token distributions of larger "teacher" models rather than simply fitting hard ground-truth labels. For Llama 3.2 1B and 3B, the teachers were Llama 3.1 8B and Llama 3.1 70B, whose logits at each token position served as soft targets during pretraining.[^1][^3]
Distillation was combined with standard large-scale pretraining on up to 9 trillion tokens from publicly available sources, with a December 2023 data cutoff.[^3] Post-training applied multiple rounds of supervised fine-tuning, rejection sampling, and direct preference optimization to produce instruction-following variants.[^3]
Total compute was 370,000 H100 GPU hours for the 1B model and 460,000 H100 GPU hours for the 3B model, producing 107 and 133 metric tons respectively of CO₂-equivalent on a location-based accounting basis, offset to zero on a market-based basis through Meta's renewable energy purchasing.[^3]
The 1B and 3B models were released with explicit hardware partner support for on-device deployment. Meta announced day-one optimization in collaboration with Arm, Qualcomm, and MediaTek, with the ARM architecture targets covering approximately 99% of mobile devices in active use.[^1] The companion PyTorch ExecuTorch runtime was positioned as the primary path for on-device inference, alongside Ollama for single-node deployment.[^1]
To support deployment across the broadest possible range of hardware, Meta released official quantized variants of both models at launch:
Linear layers in the quantized variants use 4-bit groupwise weights with group size 32 combined with 8-bit per-token dynamic activations. The classification layer uses 8-bit per-channel weights, and the embedding layer uses 8-bit per-channel quantization.[^3]
On a OnePlus 12 smartphone, Meta reported that the 1B model in SpinQuant form achieves approximately 50 decode tokens per second with a time-to-first-token of 0.3 seconds and a memory footprint of about 1,921 MB. The same model at BF16 precision runs at 19.2 tokens per second with a memory footprint over 3 GB, illustrating the practical importance of quantization for mobile deployment.[^1]
The 1B and 3B instruction-tuned models support multilingual text generation in English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.[^1] Both support zero-shot tool calling with user-defined tool specifications, enabling agentic applications without requiring tool-specific fine-tuning examples.[^1]
The 1B model can also serve as a speculative-decoding draft model for the larger Llama 3.1 8B verifier, improving end-to-end generation throughput when both models are available, since the shared vocabulary and architectural lineage make the pairing particularly effective.[^1]
The 11B and 90B models are the first members of the Llama family to support visual input.[^1][^4] Prior Llama releases were text-only, and Meta's earlier multimodal research efforts — including ImageBind — had not been integrated into the publicly released Llama line. Llama 3.2 closed this gap and positioned the Llama ecosystem as a credible open-weight alternative to closed multimodal systems at the small-to-medium scale.[^9]
At launch, image-plus-text prompting is supported in English only, while text-only prompting on these models supports the full eight languages of the lightweight line.[^3]
The vision models are built by attaching a separately trained image adapter to a frozen Llama 3.1 language backbone, rather than retraining the language model on multimodal data:
The naming reflects total parameter count including the adapter, not just the language model backbone.
The image encoder is a Vision Transformer (ViT) based on the ViT-H/14 architecture with a 14-pixel patch size. Meta extended the standard ViT-H with 8 additional gated self-attention layers, producing a final encoder with two stages: a 32-layer primary encoder followed by an 8-layer global encoder, totaling 40 transformer blocks. The encoder operates at tile sizes of 448 pixels for the 11B base model and 560 pixels for the 11B instruct and all 90B variants.[^3]
The adapter connects the image encoder to the language model through cross-attention layers inserted at regular intervals into the language model's transformer stack. Cross-attention layers are placed after every fourth self-attention layer in the language model — at layers 3, 8, 13, 18, 23, 28, 33, and 38. At these points, the language model's hidden states attend to the image encoder's output representations, allowing visual information to influence text generation without requiring every transformer layer to process image tokens.[^3]
A critical architectural decision was to freeze the language model parameters during vision adapter training. Only the image encoder and cross-attention layers were updated during the vision pre-training phase.[^1][^3] This design has two important practical consequences:
Inputs to the vision models consist of image plus text; outputs are text only. Multi-image conversations are supported, but the models are optimized to attend to the most recently provided image when several appear in a context window.[^3]
Vision pre-training used a dataset of approximately 6 billion image-text pairs, structured in multiple stages with the vision adapter components trained while language model weights remained frozen.[^3] Post-training applied multiple rounds of supervised fine-tuning, rejection sampling, and direct preference optimization to optimize instruction following, safety behavior, and multimodal dialogue quality. Instruction fine-tuning data included over 3 million synthetically generated image-text examples.[^3]
Total compute for combined 11B and 90B vision training was approximately 2.02 million H100-80GB GPU hours, producing 584 metric tons of CO₂-equivalent on a location-based basis, again offset to zero on a market-based basis.[^3]
The 11B and 90B vision models support a broad range of image-understanding tasks:
Images are processed by dividing them into overlapping tiles at the configured pixel size, with each tile encoded independently and the resulting representations passed through the global encoder stage before cross-attention integration with the language model.[^3]
Across both families, Llama 3.2 follows the post-training methodology established by Llama 3.1, with refinements specific to the small-model and multimodal cases.[^3][^8]
Pretraining. The 1B and 3B models combine standard next-token prediction on up to 9 trillion publicly available tokens with knowledge-distillation losses against Llama 3.1 8B and 70B teachers. Vision models layer image-text pretraining on top of frozen Llama 3.1 backbones using 6 billion image-text pairs.[^3]
Supervised fine-tuning (SFT). Each model undergoes SFT on curated instruction-following data; vision models additionally see synthetic image-text instructions generated by larger Meta models.[^3]
Rejection sampling. Multiple completions are sampled per prompt; only those scoring above a quality threshold under reward and safety models are retained as additional SFT data.[^3]
Direct preference optimization (DPO). Pairwise preference data is used to align outputs with human preferences without requiring a separate reward model in the loss, a technique Meta adopted across the Llama 3 generation.[^3][^8]
Safety post-training. Each model receives an additional safety alignment pass to reduce harmful outputs, complemented at deployment time by the Llama Guard 3 classifier family running externally.[^1]
Released alongside the core models on September 25, 2024, Llama Stack is Meta's official framework for building agentic and retrieval-augmented applications on Llama models.[^1][^12] First proposed as an RFC in July 2024, Llama Stack standardizes APIs for inference, safety, memory, tool use, agentic execution, telemetry, and evaluation.[^12]
Key elements of the launch release:
Llama Stack reached its first stable release in January 2025, by which time it had become Meta's recommended path for production agentic applications on Llama.[^13]
Meta released two updated members of its Llama Guard safety-classification family with Llama 3.2.[^1] Unlike the base Llama models, Llama Guard models are designed not to generate content but to classify whether a prompt or response violates defined safety categories, functioning as a separately deployed safety layer alongside the model being guarded.
Llama Guard 3 1B is a text-safety classifier derived from Llama 3.2 1B through additional pruning and quantization. In its quantized form, the model weighs approximately 438 MB, down from the 2,858 MB BF16 parent.[^3] This extreme compression makes it practical to run Llama Guard on the same device as the model being guarded.
Llama Guard 3 11B Vision is a multimodal safety classifier based on the Llama 3.2 11B Vision architecture. It accepts both image and text inputs and classifies content across 13 hazard categories drawn from the MLCommons safety taxonomy, including violent crimes, child sexual exploitation, defamation, hate speech, suicide and self-harm, and elections.[^14] Images are rescaled into four 560-by-560 pixel chunks before encoding. Meta reported Llama Guard 3 11B Vision achieving an F1 score of 0.938 on response classification, substantially outperforming GPT-4o at 0.667 F1 on the same task.[^14]
Llama 3.2 1B and 3B were benchmarked against contemporary small open models including Gemma 2 2B and Phi-3 mini variants. Selected results for instruction-tuned models:[^1][^3]
| Benchmark | Llama 3.2 1B | Llama 3.2 3B | Llama 3.1 8B |
|---|---|---|---|
| MMLU (5-shot) | 49.3 | 63.4 | 69.4 |
| GSM8K (CoT) | 44.4 | 77.7 | 84.5 |
| MATH (CoT) | — | 48.0 | 51.9 |
| ARC-C | 59.4 | 78.6 | 83.4 |
| IFEval | 59.5 | 77.4 | 76.5 |
| Hellaswag | 41.2 | 69.8 | 82.0 |
The 3B model's IFEval score of 77.4 essentially matches the 8B model's 76.5, a result Meta highlighted as evidence that pruning plus distillation transfers instruction-following capability beyond what raw parameter count would predict.[^1] Public comparisons against Gemma 2 2B IT (IFEval 61.9) and Phi-3.5-mini IT (IFEval 59.2) showed the Llama 3.2 3B leading on instruction following, while Phi-3.5-mini retained an edge on math benchmarks such as GSM8K.[^15]
Selected results for the instruction-tuned vision models:[^3]
| Benchmark | Metric | Llama 3.2 11B | Llama 3.2 90B |
|---|---|---|---|
| VQAv2 (test) | Accuracy | 75.2% | 78.1% |
| DocVQA (test) | ANLS | 88.4 | 90.1 |
| ChartQA (test, CoT) | Relaxed accuracy | 83.4% | 85.5% |
| AI2 Diagram (test) | Accuracy | 91.1% | 92.3% |
| MMMU (val, CoT) | Micro avg | 50.7% | 60.3% |
| MathVista | Accuracy | — | 57.3% |
| MMLU (CoT) | Macro avg | 73.0% | 86.0% |
| MATH (CoT) | Final EM | 51.9% | 68.0% |
Meta positioned the 11B and 90B models as competitive with Claude 3 Haiku and GPT-4o-mini on image recognition and visual understanding tasks.[^1] On visual QA and document understanding tasks, the 90B model performed comparably to or slightly above Claude 3 Haiku, while the 11B model was roughly on par with Haiku depending on the benchmark. GPT-4o-mini retained a lead on broad multidisciplinary reasoning benchmarks such as MMMU.[^1][^9]
Llama 3.2 is released under the Llama 3.2 Community License, a custom commercial license distinct from standard open-source licenses.[^16] Its core terms continue the Llama 3.1 framework: permits commercial use, fine-tuning, and redistribution, but requires a "Built with Llama" attribution on derived products and a "Llama" prefix on the names of derivative models. Distributions must include the full license text, and organizations whose Llama-powered products exceed 700 million monthly active users must seek an explicit commercial license from Meta.[^16]
The Llama 3.2 Community License contains a restriction that applies exclusively to the 11B and 90B multimodal models and not to the 1B and 3B text-only models. The acceptable-use policy states that the rights granted under the agreement "are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union."[^6][^17]
An exception preserves access for end users: European individuals can still use Llama 3.2 Vision through third-party products and services that incorporate the models, even when those services are built by non-EU entities.[^17] The restriction applies to developers and companies seeking to build with the vision models directly, not to consumers using a product that embeds them.
Meta has not published a fully detailed legal justification, but the company has publicly cited the "unpredictable" nature of European AI regulation and the EU AI Act in particular.[^6][^7] Independent reporting has identified two intertwined causes:
Critics have questioned whether the regulatory justification is sufficient. Pixtral 12B from Mistral AI (an EU company) and Qwen vision models from Alibaba were released globally without equivalent EU restrictions in roughly the same window, which has led some observers to suggest that the restriction reflects litigation risk concerns specific to Meta's data practices rather than a general requirement of EU AI regulation.[^6][^7]
The EU vision restriction would later set the template for Meta's release pattern: the company's Llama 4 multimodal models, released in April 2025, were similarly excluded from EU developer access, while the text-only Llama 3.3 (December 2024) was released globally without restriction.[^7][^19]
The Llama 3.2 release received broadly positive coverage from the AI developer community. The introduction of vision capabilities into the Llama family was described by multiple commentators as a significant milestone for open-weight AI, bringing Meta's public releases into approximate parity with closed multimodal systems at the small-to-medium scale for the first time.[^9][^20]
The lightweight line attracted particular enthusiasm. The combination of a 128K-token context window, competitive instruction-following performance, and broad hardware support made the 1B and 3B models highly practical for developers who had previously needed to use much larger or cloud-dependent models for tasks like summarization and tool use. InfoQ and other outlets singled out the 3B model's IFEval performance matching the 8B model as evidence that distillation had closed the expected capability gap between size tiers.[^20]
The 3B model accumulated approximately 2.27 million monthly downloads on Hugging Face within months of release, reflecting strong adoption by developers building mobile and edge applications.[^21]
Partner companies including AMD, AWS, Databricks, Dell, Google Cloud, Groq, IBM, Intel, Microsoft Azure, NVIDIA, Oracle Cloud, and Snowflake launched same-day support for the models, indicating strong advance coordination and ecosystem readiness.[^1]
The EU vision restriction generated criticism from the European AI developer community and was widely reported in AI media outlets including Slator, Silicon Republic, and DeepLearning.AI's The Batch.[^6][^7][^18] Some developers also noted that despite strong benchmark performance relative to Claude 3 Haiku, the 90B vision model still lagged behind GPT-4o (not GPT-4o-mini) and Claude 3.5 Sonnet on more complex multimodal reasoning tasks, limiting its applicability as a drop-in replacement for the most capable proprietary systems.[^9]
Llama 3.3 was released on December 6, 2024, as a single 70B instruction-tuned text-only refresh.[^22] Meta positioned it as delivering 405B-class instruction-following and multilingual quality at roughly the inference cost of a 70B model: IFEval 92.1% and MATH 77.0%, comparable to or exceeding Llama 3.1 405B on those benchmarks. Llama 3.3 retained the 128K context window, the eight-language coverage, and the December 2023 data cutoff. Because Llama 3.3 was text-only, it was released globally without an EU restriction.[^22]
Llama 4 was released on April 5, 2025, marking Meta's shift to a natively multimodal, mixture-of-experts architecture rather than the cross-attention adapter approach pioneered in Llama 3.2.[^19] The initial release included Llama 4 Scout (17B active parameters, 16 experts) and Llama 4 Maverick (17B active parameters, 128 experts), with a much larger Llama 4 Behemoth model announced as a "teacher" model still in training. Like the Llama 3.2 vision models, Llama 4 was excluded from EU developer access at launch on similar regulatory grounds.[^7][^19]
The progression from Llama 3.2 to Llama 4 represents a substantive architectural turn: Llama 3.2 used a cross-attention bolt-on to add vision to text-trained backbones, while Llama 4 was trained from the start to be multimodal, using MoE rather than dense architectures and supporting text, image, and video input in a unified model.[^19]
Llama 3.2 occupies a distinctive position in the Llama lineage as the release that simultaneously opened two major new fronts for Meta's open-weight strategy: lightweight on-device deployment and multimodal vision. Although it has been superseded for top-end quality by Llama 3.3 (text-only) and Llama 4 (natively multimodal MoE), Llama 3.2 remains heavily used in three contexts:[^22][^19]
The release also established several patterns that have persisted in Meta's later Llama releases: a strict separation between text-only and multimodal license terms, a coordinated cloud and hardware partner ecosystem on launch day, a dedicated agent and RAG framework (Llama Stack) shipping alongside the core models, and the routine release of matching safety classifiers in the Llama Guard line.[^1][^12]