LLaMA 3
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v9 ยท 8,840 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v9 ยท 8,840 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLaMA 3 (Large Language Model Meta AI 3) is a family of open-weight large language models developed and released by Meta beginning in April 2024. The LLaMA 3 series represents a significant leap over its predecessor, LLaMA 2, in both scale and capability, expanding training data from roughly 2 trillion tokens to more than 15 trillion, broadening multilingual coverage, lengthening context windows from 4K up to 128K tokens, and adding multimodal vision-language models alongside lightweight models designed for on-device deployment. With the release of LLaMA 3.1 405B in July 2024, Meta introduced the largest openly available language model at the time, directly challenging proprietary models from OpenAI, Google, and Anthropic. The LLaMA 3 family has become one of the most widely adopted open-weight model families in the AI ecosystem, accumulating over 1.2 million downloads in its first week alone and spawning a broad ecosystem of fine-tuned variants, developer tools, and enterprise integrations [1]. By December 2024, Meta reported that Llama models had been downloaded over 650 million times across all platforms, representing a tenfold increase over the prior year [2].
The family was developed by Meta's GenAI organization, with research and engineering effort spanning more than 500 contributors listed as authors on the accompanying technical paper, "The Llama 3 Herd of Models" (arXiv:2407.21783) [3]. Each release iteration introduced new capabilities: LLaMA 3 (April 2024) launched the redesigned tokenizer and architecture; LLaMA 3.1 (July 2024) added the 405B frontier-scale variant and 128K context windows; LLaMA 3.2 (September 2024) introduced vision-language models and edge-optimized 1B/3B variants; and LLaMA 3.3 (December 2024) closed the family with a single 70B model that approached 405B-class quality at a fraction of the inference cost. The LLaMA 3 generation was succeeded in April 2025 by Llama 4, which moved to a mixture-of-experts architecture and native multimodal pretraining.
The LLaMA 3 family follows two prior generations of Meta's open language models. The original LLaMA (February 2023) introduced models from 7B to 65B parameters under a research-only license, with weights leaked publicly within days and quickly catalyzing a wave of community fine-tunes such as Alpaca, Vicuna, and Guanaco. LLaMA 2 (July 2023) was Meta's first openly licensed flagship language model series, available for both research and most commercial use under the Llama 2 Community License. LLaMA 2 introduced grouped-query attention (only on the 70B variant), a 4,096-token context window, and 2 trillion training tokens [3].
By late 2023, the open-weight landscape had become competitive. Mistral AI released Mistral 7B and the Mixtral 8x7B mixture-of-experts model, 01.AI released the Yi series, and Alibaba released the Qwen family. At the same time, proprietary frontier models including GPT-4, Claude 2, and Gemini 1.0 demonstrated capabilities that no open-weight model could match. Meta's stated goal with LLaMA 3 was to close that gap, and the LLaMA 3.1 405B release in mid-2024 marked the first time an open-weight model met or exceeded proprietary frontier performance on most public benchmarks [3][8].
Meta released the LLaMA 3 family in four major waves over an eight month period in 2024, each introducing new model sizes, capabilities, or architectural improvements. The following table summarizes all major releases in the LLaMA 3 generation.
| Model | Release Date | Parameters | Context Length | Key Features |
|---|---|---|---|---|
| LLaMA 3 8B | April 18, 2024 | 8B | 8,192 | Dense transformer, GQA, 128K vocabulary |
| LLaMA 3 70B | April 18, 2024 | 70B | 8,192 | Dense transformer, GQA, 128K vocabulary |
| LLaMA 3.1 8B | July 23, 2024 | 8B | 128,000 | Extended context, multilingual (8 languages) |
| LLaMA 3.1 70B | July 23, 2024 | 70B | 128,000 | Extended context, multilingual (8 languages) |
| LLaMA 3.1 405B | July 23, 2024 | 405B | 128,000 | Largest open-weight model, trained on 16K H100 GPUs |
| LLaMA 3.2 1B | September 25, 2024 | 1B | 128,000 | Lightweight text model for edge/mobile |
| LLaMA 3.2 3B | September 25, 2024 | 3B | 128,000 | Lightweight text model for edge/mobile |
| LLaMA 3.2 11B-Vision | September 25, 2024 | 11B | 128,000 | Multimodal (text + image input), vision encoder |
| LLaMA 3.2 90B-Vision | September 25, 2024 | 90B | 128,000 | Multimodal (text + image input), vision encoder |
| LLaMA 3.3 70B | December 6, 2024 | 70B | 128,000 | 405B-class quality at a fraction of inference cost |
In parallel with the main model releases, Meta shipped a growing safety stack (Llama Guard 2 in April 2024, Llama Guard 3 in July 2024, Llama Guard 3 Vision in September 2024, plus Code Shield, CyberSec Eval 2 and 3, and Prompt Guard) and a developer framework called Llama Stack. The first iteration of Meta AI, Meta's consumer-facing assistant, was rebuilt on top of LLaMA 3 at the April 2024 launch and rolled out across Facebook, Instagram, WhatsApp, and Messenger [1][17].
LLaMA 3 uses a relatively standard decoder-only transformer architecture, but with several important design choices that improve efficiency and performance at scale. The core design follows the post-LayerNorm pattern used in LLaMA 2 but with several scaled-up choices: a much larger vocabulary, grouped-query attention at every model size, and a higher RoPE base frequency to enable long-context extension.
The three primary backbone sizes in the LLaMA 3 family differ in layer depth, model width, feed-forward network dimension, and attention head count. The following table details the architectural specifications for each size [3].
| Parameter | 8B | 70B | 405B |
|---|---|---|---|
| Layers | 32 | 80 | 126 |
| Model Dimension | 4,096 | 8,192 | 16,384 |
| FFN Dimension | 14,336 | 28,672 | 53,248 |
| Attention Heads | 32 | 64 | 128 |
| Key-Value Heads | 8 | 8 | 8 |
| Attention Head Dimension | 128 | 128 | 128 |
| Vocabulary Size | 128,000 | 128,000 | 128,000 |
| Activation Function | SwiGLU | SwiGLU | SwiGLU |
| Normalization | RMSNorm | RMSNorm | RMSNorm |
| Positional Encoding | RoPE (theta=500K) | RoPE (theta=500K) | RoPE (theta=500K) |
All three model sizes share the same fundamental building blocks: SwiGLU activation functions in the feed-forward layers, Root Mean Square Normalization (RMSNorm) for internal state normalization, and Rotary Positional Embeddings (RoPE) for encoding positional information. The consistent use of 8 key-value heads across all sizes, regardless of the number of query heads, is a defining characteristic of the LLaMA 3 architecture [3]. Compared to LLaMA 2, the only true architectural changes are the larger vocabulary, the universal use of GQA, the higher RoPE base frequency, and the deeper, wider 405B variant. Meta deliberately resisted introducing more exotic mechanisms (such as state-space layers, retrieval modules, or sparse experts), citing training stability and deployability across heterogeneous hardware as priorities for an open-weight release [3].
The LLaMA 3 tokenizer uses a vocabulary of 128,000 tokens based on byte pair encoding (BPE) via the tiktoken library, a fourfold increase over the 32,000-token vocabulary in LLaMA 2. This larger vocabulary encodes language more efficiently, reducing the number of tokens required to represent a given text passage and thereby improving both throughput and effective context utilization. In practice, the 128K vocabulary compresses English text by roughly 15% more tokens per passage compared to the LLaMA 2 tokenizer, which means more text fits within the same context window [1]. Meta retrained the tokenizer with explicit weighting toward non-English text, code, and mathematical symbols, which improved compression for non-Latin scripts and reduced the tokenization disparity between English and other languages that had been a noted weakness of LLaMA 2 [3]. The chat template uses special header tokens (such as <|begin_of_text|>, <|start_header_id|>, and <|eot_id|>) to mark turn boundaries, replacing the inline [INST] markers used in LLaMA 2.
Grouped-query attention (GQA) is used across all LLaMA 3 model sizes, including the 8B variant. In GQA, multiple query heads share a smaller number of key-value heads (8 key-value heads in the case of LLaMA 3), which reduces the memory footprint of the key-value cache during inference and improves decoding speed without meaningful degradation in output quality. For the 8B model with 32 query heads and 8 key-value heads, each key-value head is shared across 4 query heads. For the 70B model with 64 query heads, each key-value head is shared across 8 query heads. For the 405B model with 128 query heads, each key-value head serves 16 query heads. LLaMA 2 had used GQA only in its 70B variant, so extending it to all sizes was a notable architectural decision [4].
GQA's main practical benefit is on inference: KV cache memory scales with the number of key-value heads rather than query heads, so a 70B model with 64 query heads but 8 KV heads consumes about 1/8 the cache of a same-shape model that used full multi-head attention. This enables longer context generation on commodity hardware and reduces the cost of multi-tenant serving in cloud deployments [4].
LLaMA 3 employs Rotary Positional Embeddings (RoPE) to encode positional information. RoPE applies a rotation matrix to encode absolute position while simultaneously incorporating relative position information directly into the self-attention computation. For the LLaMA 3 series, Meta increased the RoPE base frequency hyperparameter to 500,000 (up from 10,000 in LLaMA 2), which enables better support for longer context lengths of up to 8,192 tokens in the initial release and 128,000 tokens in LLaMA 3.1 [3]. The higher base frequency stretches the rotation period, allowing the model to distinguish positions over longer ranges without running into the periodic aliasing that would occur at lower frequencies.
For the 128K context extension in LLaMA 3.1, Meta further scaled RoPE using a custom interpolation scheme combined with continued pretraining on long-context documents. The team gradually expanded context length over several training stages: 8K, then 16K, then 32K, 64K, and finally 128K, with each stage using a curated mixture of long documents (books, code repositories, research papers) to teach the model to attend over the new range. The 128K extension preserved short-context quality while delivering near-perfect retrieval scores on the needle-in-a-haystack benchmark across the full window [3].
Unlike some competing models that use mixture-of-experts (MoE) architectures, all LLaMA 3 models employ a dense transformer design in which every parameter is active during inference. Meta chose this approach for its simplicity, training stability, and ease of deployment, though it means that inference cost scales linearly with parameter count [3]. The team explicitly evaluated MoE alternatives during the design phase but concluded that the dense architecture provided better training stability at the scales they targeted and was simpler to optimize for deployment across diverse hardware platforms. The decision was reversed in the Llama 4 generation, which moved to MoE in 2025.
LLaMA 3 models were pretrained on over 15 trillion tokens of text collected from publicly available sources. This represents a roughly sevenfold increase over the 2 trillion tokens used for LLaMA 2. Over 5% of the training data (approximately 800 million tokens) consisted of text in more than 30 non-English languages, improving multilingual performance [1].
Meta developed custom data filtering pipelines using multiple stages of quality control. The pipeline included heuristic filters for removing low-quality content, NSFW classifiers, and text quality classifiers trained specifically for this purpose. For quality scoring, Meta used DistilRoBERTa models trained on web data that had been annotated by LLaMA 2 itself, creating a bootstrapping approach where the previous generation model helped curate data for the next [3]. Specialized classifiers were also trained for code and reasoning content, using prompt-tuned models to identify web pages containing mathematical deductions, STEM reasoning, and code interleaved with natural language.
The deduplication process operated at both the document and line levels. Document-level deduplication used MinHash-based near-duplicate detection to remove redundant content across the corpus. Line-level deduplication employed heuristics such as duplicated n-gram coverage ratios to strip out repetitive content like logging messages or error traces. Additionally, token-distribution Kullback-Leibler divergence was used to filter out documents containing abnormal distributions of tokens compared to the overall training corpus [3].
The approximate data mix composition for pretraining was as follows [3]:
| Data Category | Share of Training Mix |
|---|---|
| General knowledge (web text) | ~50% |
| Mathematical and reasoning data | ~25% |
| Code | ~17% |
| Multilingual data | ~8% |
Meta noted that even the multilingual budget understates the actual coverage, because much of the "general knowledge" web text contains code-switched or non-English passages that are not classified into the multilingual bucket. Internal annealing experiments, which trained smaller proxy models on candidate data mixes for short runs, helped Meta tune category weightings before committing to the full 15 trillion token run [3]. Knowledge cutoff dates differed slightly across releases: December 2023 for LLaMA 3.1 8B, 70B, and 405B, with later versions inheriting the same cutoff [16].
The LLaMA 3.1 405B model was trained on a cluster of 16,384 NVIDIA H100 80GB GPUs, making it one of the largest single training runs conducted on publicly disclosed infrastructure at the time [5]. The total training compute for the 405B model was approximately 3.8 x 10^25 FLOPs, and the run consumed a cumulative 39.3 million GPU hours on H100-80GB hardware (rated at a thermal design power of 700W per GPU) [3]. The 8B model used roughly 1.46 million GPU hours, the 70B about 7.0 million GPU hours, and the 405B about 30.84 million GPU hours, for a combined 39.3 million GPU hours across the three Llama 3.1 variants [16].
Meta drew on two purpose-built 24,576-GPU H100 clusters described in March 2024 by its data center engineering team. One cluster used a RoCE (RDMA over Converged Ethernet) fabric assembled from Arista 7800 switches with Wedge400 and Minipack2 OCP rack switches, while the second used NVIDIA Quantum-2 400 Gbps InfiniBand. Both clusters housed GPUs in Meta's Grand Teton OCP chassis and used Meta's Tectonic distributed flash storage system, accessed through a Filesystem in Userspace (FUSE) layer, for training data and checkpoints [11]. Meta's stated 2024 buildout target was to operate the equivalent of nearly 600,000 H100s by the end of the year, including roughly 350,000 H100s.
Training a model of this scale required advanced distributed training techniques. Meta employed a 4D parallelism strategy that combined four forms of parallelism simultaneously [3]:
| Parallelism Type | Description |
|---|---|
| Tensor Parallelism (TP) | Splits individual weight matrices across multiple GPUs within a node |
| Pipeline Parallelism (PP) | Distributes different layers of the model across groups of GPUs |
| Context Parallelism (CP) | Splits long input sequences across GPUs for memory efficiency |
| Data Parallelism (DP) | Replicates the model and distributes training batches across groups |
For the 405B model the dominant configuration was 8-way tensor parallelism within nodes, 16-way pipeline parallelism across racks, context parallelism for sequences longer than 8K tokens, and fully sharded data parallelism (FSDP) at the outermost level. The achieved effective utilization, expressed as model FLOPs utilization (MFU) on H100 hardware, was reported in the technical paper at over 400 TFLOPs per GPU during the 16K-GPU training run [1].
The pretraining run for the 405B model took approximately 54 days. During that period, the cluster experienced 466 job interruptions, of which 47 were planned (for automated maintenance) and 419 were unexpected. The unexpected interruptions broke down as follows: 148 were caused by faulty GPUs (30.1%), 72 by GPU HBM3 memory errors (17.2%), 35 by network switch and cable problems (8.4%), 19 by GPU SRAM memory issues (4.5%), and 17 by GPU system processor failures (4.1%). Only two CPU failures were recorded during the entire period. Despite these challenges, the team achieved over 90% effective training time through automated checkpointing and rapid recovery procedures [6]. Independent reporting interpreted these statistics as a roughly one-failure-every-three-hours cadence on the 16K-GPU cluster, underscoring how reliability engineering becomes a first-class research concern at frontier scale [6].
The models were trained on sequences of 8,192 tokens using document-level masking to prevent self-attention from crossing document boundaries. An important finding from the training process was that model performance continued to improve log-linearly well beyond the Chinchilla-optimal compute allocation. While the Chinchilla-optimal amount of training data for an 8B parameter model is roughly 200 billion tokens, Meta observed continued gains when training on two orders of magnitude more data (15 trillion tokens), which informed their decision to overtrain the smaller models significantly [1]. The same logic applied at the largest scale: compute-optimal scaling laws would have suggested a smaller dataset for a 405B parameter model, but Meta deliberately overtrained to improve inference-time efficiency, since serving costs scale with parameter count rather than with the original training token budget.
Meta reported location-based CO2 emissions of approximately 11,390 metric tons for the LLaMA 3.1 family and 12.9 metric tons for the LLaMA 3.3 70B fine-tuning run, with a market-based net of zero achieved by matching electricity consumption with renewable energy purchases [16][7].
Instruct-tuned variants of all LLaMA 3 models underwent supervised fine-tuning on publicly available instruction datasets along with over 10 million human-annotated examples. Post-training also included reinforcement learning from human feedback (RLHF) using rejection sampling and direct preference optimization (DPO) to align model outputs with human preferences.
The post-training pipeline followed multiple iterative rounds. In each round, the model was first fine-tuned on curated instruction data (SFT), then improved through rejection sampling (RS) where the model generated multiple candidate responses and a reward model selected the best ones, and finally refined through DPO where the model learned to prefer higher-quality outputs over lower-quality alternatives [3]. This iterative approach allowed each round to build on the improvements of the previous round, progressively improving both helpfulness and safety. Several rounds also incorporated synthetic data generation, using earlier checkpoints to bootstrap higher quality instruction data for later rounds.
Meta diverged from the prevailing PPO-based RLHF approach used by OpenAI and others, opting for a simpler combination of rejection sampling plus DPO. Internal experiments showed comparable or better quality with substantially less training compute and better stability, which the team credited as one reason the post-training stack scaled cleanly from 8B to 405B [3]. A reward model trained on a mix of public preference datasets and Meta's own human comparison annotations served as the scorer for rejection sampling. The team explicitly avoided training a separate reward model for safety, instead relying on the same general reward model with safety-specific preference data injected into the mix.
For LLaMA 3.3, the fine-tuning data was further expanded to over 25 million synthetically generated examples, reflecting Meta's growing investment in synthetic data generation as a post-training technique [7]. The synthetic data generation pipeline used the larger 405B teacher to create candidate prompts and responses across math, code, instruction following, and tool use, with LLM-based classifiers filtering out low-quality completions before adding them to the SFT mix.
The initial LLaMA 3 release on April 18, 2024 included two dense text models: 8B and 70B parameters. Both models were released in base (pretrained) and instruct-tuned variants. Despite the relatively short 8,192-token context window, these models set new standards for open-weight performance at their respective sizes [1].
The initial LLaMA 3 models demonstrated strong results across standard benchmarks for their parameter counts [3].
| Benchmark | LLaMA 3 8B | LLaMA 3 70B |
|---|---|---|
| MMLU (5-shot) | 69.4 | 83.6 |
| HumanEval (0-shot) | 72.6 | 80.5 |
| GSM8K (8-shot, CoT) | 84.5 | 95.1 |
| MATH (0-shot, CoT) | 51.9 | 68.0 |
| ARC-Challenge (25-shot) | 83.4 | 94.8 |
The LLaMA 3 8B outperformed the previous LLaMA 2 70B on several benchmarks despite being nearly nine times smaller, illustrating the compounding benefits of more training data, a larger vocabulary, and grouped-query attention at all scales [1]. The 70B model was competitive with the best proprietary models available in early 2024, including GPT-3.5 Turbo and early versions of Claude 3 Sonnet. Meta's own benchmark sweep at launch indicated that LLaMA 3 70B Instruct beat Gemini Pro 1.5 and Claude 3 Sonnet on most evaluations, though Anthropic and Google would soon ship Claude 3.5 Sonnet (June 2024) and updated Gemini variants [17].
The April 2024 release drew immediate attention. Within the first week, the models had been downloaded over 1.2 million times across Hugging Face, Meta's own llama.com portal, and partner platforms [1]. Coverage in IEEE Spectrum, the New York Times, the Financial Times, and Reuters framed the release as a strategic move that further legitimized open-weight frontier development. The launch was paired with the relaunch of Meta AI as a Llama-3-powered consumer assistant, integrated into Facebook, Instagram, WhatsApp, and Messenger search bars and accessible via a standalone meta.ai web interface, with real-time information drawn from Bing and Google Search integrations [17].
The LLaMA 3.1 release on July 23, 2024 updated all three model sizes (8B, 70B, and 405B) with extended context lengths of 128,000 tokens and multilingual support for eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This was a meaningful expansion from the English-centric focus of earlier LLaMA releases [5][16]. The release was timed deliberately to coincide with the publication of "The Llama 3 Herd of Models" preprint and a Mark Zuckerberg open letter titled "Open Source AI Is the Path Forward" [9].
The LLaMA 3.1 405B model was the largest openly available language model at the time of its release. With 405 billion parameters and a 128K token context window, it was positioned as a direct competitor to proprietary frontier models like GPT-4 and Claude 3 Opus [5]. Press estimates valued the GPU hardware required for the training run at roughly $400 million in retail H100 prices, though Meta's effective costs were lower because the GPUs were owned and amortized across multiple workloads [16].
The 405B variant was the first frontier-scale Meta model to natively support function calling and multi-step tool use, enabled by extensive tool-use data injected during post-training. It accepts JSON tool definitions and emits structured tool calls in a Python-like syntax that downstream agents can parse. Tool definitions are presented to the model through the chat template's system prompt, and the model can invoke tools either inline (mid-response) or as standalone responses, depending on the task [3].
The following table compares the LLaMA 3.1 models against leading proprietary models on standard benchmarks. All scores are for the instruct-tuned variants [3][5][16].
| Benchmark | LLaMA 3.1 8B | LLaMA 3.1 70B | LLaMA 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| MMLU (5-shot) | 73.0 | 86.0 | 87.3 | 88.7 | 88.3 |
| MMLU-Pro (5-shot, CoT) | -- | -- | 73.3 | -- | -- |
| HumanEval (0-shot) | 72.6 | 80.5 | 89.0 | 90.2 | 92.0 |
| GSM8K (8-shot, CoT) | 84.5 | 95.1 | 96.8 | 96.1 | 96.4 |
| MATH (0-shot, CoT) | 51.9 | 68.0 | 73.8 | 76.6 | 71.1 |
| ARC-Challenge (25-shot) | 83.4 | 94.8 | 96.9 | -- | -- |
| GPQA (0-shot) | -- | 48.0 | 50.7 | -- | -- |
| Tool Use (BFCL) | -- | 77.5 | 88.5 | 83.6 | 90.2 |
| Multilingual (MGSM) | -- | 86.9 | 91.6 | 90.5 | 91.6 |
On the Scale AI SEAL leaderboard, LLaMA 3.1 405B ranked second in math and reasoning, fourth in coding, and first in instruction following [5]. Experimental evaluations suggested the 405B model performed on par with GPT-4-0125-Preview and Claude 3.5 Sonnet, winning and losing roughly the same percentage of head-to-head comparisons, though it fell slightly behind GPT-4o in direct matchups, winning only 19.1% of comparisons [8]. Independent evaluations from Vellum and Promptfoo confirmed that the 405B model traded wins with proprietary frontier models across coding, reasoning, and math benchmarks, with no single model dominating [16].
Meta CEO Mark Zuckerberg described the 405B release as a pivotal moment for open-source AI, comparing it to the role Linux played in democratizing server operating systems. The release was accompanied by a blog post titled "Open Source AI Is the Path Forward," in which Zuckerberg argued that openly available models would ultimately outcompete closed alternatives through community-driven innovation [9]. Zuckerberg also framed open weights as strategically aligned with Meta's business: unlike OpenAI or Anthropic, Meta does not monetize model access directly, so open distribution does not cannibalize a core product line and instead lowers Meta's cost of innovation through ecosystem feedback [9].
The release marked the first time Meta had ever released model weights for a frontier-scale system, and several commentators including Andrej Karpathy and Andrew Ng described the 405B launch as a turning point that would force closed-model providers to compete more aggressively on price and quality. By the end of 2024, hosted Llama usage by token volume across major cloud partners had more than doubled in the May-July window, and monthly Llama usage at the largest cloud providers had grown roughly tenfold from January to July 2024 [10].
Released on September 25, 2024 at Meta Connect, LLaMA 3.2 expanded the family in two significant directions: multimodal vision-language models and lightweight models for edge devices [10].
The LLaMA 3.2 11B-Vision and 90B-Vision models accept both text and image inputs and produce text outputs. They were designed as drop-in replacements for their text-only counterparts while adding image reasoning capabilities. These models can perform tasks such as document understanding (including charts and graphs), image captioning, and visual grounding (identifying objects in images based on natural language descriptions) [10].
The vision capability was integrated through a separately trained vision adapter, an approach Meta describes as "late fusion." The adapter consists of a dedicated image encoder paired with a series of cross-attention layers that feed image encoder representations into the core language model. The adapter weights were trained on text-image pairs to align the visual representations with the language model's internal representations, while the pretrained language model weights remained largely frozen during this alignment stage. This design allowed the vision models to retain the full text performance of their LLaMA 3.1 counterparts while gaining image understanding capabilities, and supports tasks such as visual question answering, chart interpretation, and document analysis without disturbing the text-only behavior of the original models [10].
The image encoder was based on a vision transformer pretrained on a large corpus of image-text pairs filtered for relevance and quality. Cross-attention layers were inserted at regular intervals throughout the language model so that text tokens could attend to image patch features when relevant, but text-only inference paths remain unaffected. The total parameter count of the 11B-Vision model includes about 8B inherited from LLaMA 3.1 8B, plus the encoder and adapter weights; the 90B-Vision builds analogously on top of LLaMA 3.1 70B [10].
The following table shows benchmark scores for the instruct-tuned vision models on image understanding tasks [11].
| Benchmark | LLaMA 3.2 11B-Vision | LLaMA 3.2 90B-Vision |
|---|---|---|
| MMMU (val, CoT) | 50.7 | 60.3 |
| MMMU-Pro Standard | 33.0 | 45.2 |
| MathVista | 51.5 | 57.3 |
| ChartQA (CoT) | 83.4 | 85.5 |
| AI2 Diagram (test) | 91.1 | 92.3 |
| DocVQA (test) | 88.4 | 90.1 |
| VQAv2 (test) | 75.2 | 78.1 |
Meta reported that the 11B-Vision model outperformed Claude 3 Haiku and was competitive with GPT-4o mini on image recognition and visual understanding tasks [10].
The LLaMA 3.2 1B and 3B models were designed for on-device deployment on mobile phones and edge hardware. Despite their small size, they support the full 128K token context window and achieve competitive performance on summarization, instruction following, and text rewriting tasks [10]. Both models were created through a combination of structured pruning of LLaMA 3.1 8B and knowledge distillation, in which the smaller models learn from the logits of larger 8B and 70B teachers. This pipeline allowed Meta to compress capable behavior into much smaller parameter budgets while keeping the same tokenizer and chat template [11].
The following table compares the LLaMA 3.2 lightweight models against similarly sized competitors [10].
| Benchmark | LLaMA 3.2 1B | LLaMA 3.2 3B | Gemma 2 2.6B | Phi 3.5-mini |
|---|---|---|---|---|
| MMLU (5-shot) | 49.3 | 63.4 | 57.8 | 69.0 |
| IFEval (Instruction Following) | 59.5 | 77.4 | 61.9 | 59.2 |
| ARC-Challenge | 59.4 | 78.6 | 76.7 | 87.4 |
| Tool Use (BFCL V2) | 25.7 | 67.0 | -- | -- |
| NIH/Multi-Needle | 75.0 | 84.7 | -- | -- |
The 3B model outperformed Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, and tool use tasks, while the 1B model was competitive with Gemma on general knowledge benchmarks [10]. Meta worked with Qualcomm and MediaTek, the two largest mobile system-on-a-chip companies, to optimize these models for mobile processors. The models were also enabled for Arm processors, which provide the foundational compute layer for 99% of mobile devices [10]. Quantized variants released in the same family (using QAT+LoRA and SpinQuant) target 4-bit weights with 8-bit activations, allowing them to run on smartphone-class hardware in 8K context windows.
The Llama 3.2 Community License Agreement carries a regional restriction not present in earlier releases: rights granted under the agreement are not extended to individuals domiciled in, or companies with their principal place of business in, the European Union for the multimodal models. The text-only 1B and 3B models are not subject to this restriction. Meta cited regulatory uncertainty under the GDPR and the EU AI Act, particularly around the use of public Facebook and Instagram data in training, as the basis for the carve-out. End users of products and services that incorporate the multimodal models remain unaffected by the restriction; only the right to develop and host the models is curtailed in the EU [12]. The carveout drew criticism from European AI policy commentators who argued that it deepened the gap between US and EU access to frontier-scale open weights.
Released on December 6, 2024, LLaMA 3.3 70B is an efficiency-optimized model that delivers performance comparable to the much larger LLaMA 3.1 405B at a fraction of the computational cost [7].
The model retains the same architecture as LLaMA 3.1 70B (80 layers, 8,192 model dimension, 64 attention heads) but incorporates advances in post-training techniques. Meta applied expanded supervised fine-tuning with over 25 million synthetically generated examples, along with improved rejection sampling and direct preference optimization rounds, which yielded substantial gains in reasoning, mathematics, instruction following, and tool use [7]. No pretraining changes were made; the underlying LLaMA 3.1 70B base checkpoint was reused, with all gains coming from the post-training stack. Meta did not release a separately fine-tuned base model in the 3.3 release; only the instruction-tuned variant is available.
The following table compares LLaMA 3.3 70B against its predecessor and several leading models [7][12].
| Benchmark | LLaMA 3.1 70B | LLaMA 3.3 70B | LLaMA 3.1 405B | GPT-4o | Gemini 1.5 Pro |
|---|---|---|---|---|---|
| MMLU (0-shot, CoT) | 86.0 | 86.0 | 88.6 | 87.5 | 87.1 |
| MMLU-Pro (5-shot, CoT) | 66.4 | 68.9 | -- | 73.8 | 76.1 |
| IFEval (Instruction Following) | 87.5 | 92.1 | 88.6 | 84.6 | 81.9 |
| HumanEval (0-shot) | 80.5 | 88.4 | 89.0 | 86.0 | -- |
| MATH (0-shot, CoT) | 68.0 | 77.0 | 73.8 | 76.9 | 82.9 |
| GPQA Diamond (0-shot, CoT) | 48.0 | 50.5 | 50.7 | -- | 53.5 |
| Multilingual MGSM (0-shot) | 86.9 | 91.1 | -- | 90.6 | -- |
| NIH/Multi-Needle (Long Context) | 97.5 | 97.5 | 98.1 | -- | 94.7 |
| Tool Use (BFCL v2, 0-shot) | 77.5 | 77.3 | -- | -- | 80.3 |
LLaMA 3.3 70B trailed LLaMA 3.1 405B by under 2% on six out of nine evaluated benchmarks and achieved higher scores on three of them (IFEval, HumanEval, and MATH). The model outperformed Google Gemini 1.5 Pro, OpenAI GPT-4o, and Amazon Nova Pro on several benchmarks [7].
With an output speed of approximately 82.9 tokens per second (median across providers) and generation costs roughly five times lower than the 405B model, LLaMA 3.3 70B quickly became one of the most cost-effective open-weight models available. Independent benchmarks measured inference speeds reaching 276 tokens per second on Groq hardware. At typical API pricing of $0.10 per million input tokens and $0.40 per million output tokens, the model was approximately 25 times cheaper than GPT-4o for equivalent workloads [12]. The release was widely interpreted as a demonstration that Meta could continue to extract value from the 405B teacher long after its initial training, by distilling its capabilities into smaller models that the broader community could actually run.
Alongside the LLaMA 3 models, Meta developed and released a layered safety stack: Llama Guard for content classification, Code Shield for filtering insecure code generation, CyberSec Eval for measuring cybersecurity behavior, and Prompt Guard for detecting prompt-injection and jailbreak attempts.
Llama Guard is a safety classification system designed to moderate inputs and outputs of language model applications [13]. It treats safety classification as a structured generation problem: given a transcript of user prompt or model response and a list of hazard categories, the model outputs whether the content is safe and, if not, which hazard categories it violated. Because Llama Guard runs as a separate model rather than as a fine-tuned safety policy on the base LLM, the same system can be used to moderate any third-party model and can be customized to enforce different category lists.
| Version | Base Model | Release | Key Capabilities |
|---|---|---|---|
| Llama Guard (original) | LLaMA 2 7B | December 2023 | Text input/output classification |
| Llama Guard 2 | LLaMA 3 8B | April 2024 | Updated taxonomy, improved accuracy |
| Llama Guard 3 (8B) | LLaMA 3.1 8B | July 2024 | 8 languages, tool call safety, MLCommons taxonomy |
| Llama Guard 3 1B | LLaMA 3.2 1B | September 2024 | Compact safety classifier for on-device use |
| Llama Guard 3 Vision (11B) | LLaMA 3.2 11B | September 2024 | Multimodal (text + image) moderation |
| Llama Guard 4 (12B) | LLaMA 4 Scout | April 2025 | Unified text + vision safety, pruned architecture |
Llama Guard 3 categorizes content according to the MLCommons AI Safety Working Group taxonomy, which includes 13 hazard categories ranging from violent crimes and sex-related crimes to defamation, indiscriminate weapons, and code interpreter abuse. Versions 3 and later support multilingual classification across the same eight languages as the base LLaMA 3.1 models [13].
Code Shield is an inference-time filter that scans model-generated code for known insecure patterns before returning it to the user. It uses static analysis tools to detect common security weaknesses such as hard-coded credentials, unsafe deserialization, command injection, SQL injection, and use of cryptographic primitives in insecure modes. Code Shield was released alongside Llama 3 and is intended to be deployed as a wrapper around any code-generating LLM, not just LLaMA models [1].
CyberSec Eval (versions 2 and 3, released alongside Llama 3 and Llama 3.1 respectively) is a benchmark suite that measures three properties of LLM behavior: propensity to generate insecure code, susceptibility to prompt injection attacks, and willingness to assist with cyberattacks. CyberSec Eval 3 added test categories for malicious code execution and exploit generation, and Meta uses the suite both internally during post-training and as a public benchmark for the broader community [3].
Prompt Guard, released in July 2024, is a small classifier model that detects jailbreak prompts and prompt-injection attempts in user input. It is designed to be deployed in front of a primary LLM in agent and retrieval-augmented setups where untrusted text might be concatenated into the model's context. The classifier outputs a label indicating whether each input segment is benign, an injection attempt, or a jailbreak attempt, and applications can choose to drop, sanitize, or pass through the input based on that label [13].
Introduced alongside LLaMA 3.1 in July 2024, Llama Stack is a developer framework that standardizes the building blocks for constructing AI applications on top of LLaMA models [14].
Llama Stack provides a unified API layer covering inference, retrieval-augmented generation (RAG), AI agents, tool use, safety moderation, and evaluation. It supports a plugin architecture that allows different backend implementations for local development, on-premises servers, cloud environments, and mobile devices. SDKs are available for Python, TypeScript, iOS, and Android [14].
The ecosystem launched with over 25 integration partners, including AWS, NVIDIA, Databricks, Groq, Dell, Microsoft Azure, Google Cloud, IBM, Intel, Oracle Cloud, AMD, and Snowflake. The framework's reference implementation is open-source on GitHub and includes example apps that demonstrate composing inference, vector retrieval, safety filtering, and tool use into agent loops without writing infrastructure code from scratch.
The LLaMA 3 family has had a substantial impact on the open-source AI landscape. The models have been downloaded hundreds of millions of times from Hugging Face and other platforms, and they form the foundation for thousands of community fine-tunes and specialized models.
LLaMA 3 weights are hosted on Meta's official llama.com portal, the Hugging Face Hub, Kaggle Models, and through cloud partner catalogs at AWS, Azure, Google Cloud, Databricks, Snowflake, and NVIDIA NIM. Hosted inference is offered by Together AI, Fireworks AI, Replicate, Groq, Cerebras, DeepInfra, Perplexity, and several others, often at substantially lower per-token cost than the closed proprietary alternatives.
The following table summarizes representative hosted-inference providers and the latency/cost profile they typically advertise for LLaMA 3.x models, as of late 2024 and early 2025.
| Provider | Notable LLaMA 3 Models Hosted | Distinguishing Feature |
|---|---|---|
| Groq | 3.1 8B, 3.1 70B, 3.3 70B | LPU custom hardware, 200-800 tokens/sec output |
| Cerebras | 3.1 8B, 3.1 70B, 3.3 70B | Wafer-scale CS-3, very low time-to-first-token |
| Together AI | Full 3.x lineup, vision models | Custom inference kernels, batch APIs |
| Fireworks AI | Full 3.x lineup | Function calling and structured output |
| DeepInfra | Full 3.x lineup | Low per-token pricing |
| Replicate | 3.x text and vision | Pay-per-second model hosting |
| Perplexity Sonar | 3.1 70B variants | Search-augmented serving |
| AWS Bedrock | 3.1, 3.2, 3.3 | Managed enterprise endpoints |
| Azure AI Foundry | 3.1, 3.2, 3.3 | Microsoft-managed enterprise endpoints |
| Google Cloud Vertex AI | 3.1, 3.2, 3.3 | Vertex Model Garden hosting |
| NVIDIA NIM | 3.1, 3.3 | Optimized TRT-LLM inference containers |
| Databricks Mosaic | 3.1, 3.3 | Foundation Model Serving with LoRA fine-tuning |
Several factors contributed to this adoption. First, the 8B and 70B models hit practical sweet spots for cost and capability, making them suitable for a wide range of applications from chatbots to code assistants. Second, Meta's partnerships with cloud providers (AWS, Azure, Google Cloud) and hardware vendors (NVIDIA, Qualcomm, MediaTek) ensured that the models were immediately deployable across diverse infrastructure. Third, the release of the 405B model demonstrated that open-weight models could compete with the best proprietary systems on standard benchmarks, lending credibility to the broader open-source AI movement [9].
Meta released torchtune alongside the LLaMA 3 family, a PyTorch-native library designed for fine-tuning and experimenting with LLMs. The library provides memory-efficient training recipes and integrates with platforms such as Hugging Face, Weights & Biases, and EleutherAI's evaluation harness, making it straightforward for researchers and developers to adapt LLaMA 3 models to custom tasks and domains [2].
The model family also seeded a large derivative ecosystem. Public fine-tunes built on LLaMA 3 backbones include code specialists such as Code Llama derivatives and dedicated tool-use variants released by enterprise vendors. Notable third-party fine-tunes include Hermes 3 from Nous Research (an instruction-following variant of the 405B and 70B with relaxed refusal behaviors), Reflection 70B (a chain-of-thought-tuned variant from earlier 2024 that drew controversy over its claims), DeepSeek-V2 distillations on LLaMA 3 backbones, and a steady stream of role-play, narrative, and domain-specific (medical, legal, finance) fine-tunes hosted on Hugging Face. The competitive pressure from LLaMA 3 is credited with influencing other organizations to open their models or adopt more permissive licensing. Even OpenAI CEO Sam Altman acknowledged in late 2024 that the company "may need to pursue a more rigorous open source strategy" in response to the rise of open models [9].
LLaMA 3 deployments span a wide range of production use cases, including coding assistance (in tools such as GitHub Copilot alternatives, Continue, and Cursor self-hosted endpoints), customer support automation, document analysis and summarization, retrieval-augmented question answering for enterprise knowledge bases, internal Slack and Microsoft Teams assistants, agent platforms (with the 405B and 70B providing the planning core), and on-device assistants for mobile and embedded products. The 1B and 3B variants, in particular, have been adopted for local applications where data residency or offline operation is required, including healthcare and government deployments.
Meta's own deployment of LLaMA 3 inside its consumer surfaces is itself the largest single use case: Meta AI, the assistant integrated into Facebook, Instagram, WhatsApp, Messenger, Ray-Ban Meta smart glasses, and the standalone meta.ai web app, runs on a variant of the LLaMA 3 family and crossed 600 million monthly active users by the end of 2024 according to Meta's earnings disclosures [17].
LLaMA 3 models are released under the Llama 3 Community License, which Meta describes as "open" but which the Open Source Initiative has determined does not meet the formal Open Source Definition [15]. Each minor release ships with its own license document; the Llama 3, 3.1, 3.2, and 3.3 community licenses are similar but differ in several important details.
Key restrictions in the license include:
The license is governed under California law. Despite these restrictions, the Llama Community License is considerably more permissive than fully proprietary licenses and has enabled widespread commercial and research adoption. The Open Source Initiative argues that the user-cap clause and the trademark requirement together disqualify the license from being considered open source under the Open Source Definition, while Meta and several legal commentators argue that, for practical purposes, the license is functionally equivalent to permissive open source for the vast majority of users [15].
The LLaMA 3 family received broadly positive reception from the AI community, but several recurring criticisms accompanied the releases. Critics highlighted that the term "open" was being used to describe a license that did not meet the OSI's formal definition, and that pretraining data sources were not disclosed in detail. The Open Source Initiative published explicit objections to the licensing language, and academic groups argued that without source data, claims about model behavior could not be fully audited [15].
The vision models drew criticism for the EU carve-out, with European AI policy commentators arguing that the restriction widened a transatlantic gap in access to frontier-scale open weights and complicated the position of European startups that wanted to build on Meta's stack [12]. Some commentators also noted that benchmark scores published by Meta favored particular evaluation prompts and few-shot configurations, and independent reproductions occasionally found smaller margins than reported in the model cards.
Despite these criticisms, the consensus among practitioners by the end of 2024 was that LLaMA 3 had reset expectations for open-weight models. Independent leaderboards including Chatbot Arena and Scale AI's SEAL placed LLaMA 3.1 405B and 3.3 70B near the top of the open-weight category and within striking distance of proprietary frontier models on most public tasks [5][8].
The LLaMA 3 family's release reshaped the competitive landscape of the AI industry in 2024. Before LLaMA 3, the prevailing view in much of the industry was that only closed-model companies with proprietary data advantages could produce frontier-quality systems. The LLaMA 3.1 405B result, matching or exceeding GPT-4 Turbo on multiple benchmarks, challenged that assumption directly.
The influence extended beyond direct adoption. Google accelerated the release cadence of its Gemma open-weight models, Mistral AI expanded its own open-weight offerings, and several Chinese AI labs (including Alibaba with Qwen, 01.AI with Yi, and DeepSeek with V2 and V3) released increasingly competitive open-weight alternatives, creating a feedback loop that accelerated the entire open-weight ecosystem. By early 2025, open-weight models from Chinese labs had begun to match or exceed LLaMA 3.x on several benchmarks, illustrating how quickly the global landscape was responding.
The LLaMA 3 family also proved that overtraining smaller models well beyond the Chinchilla-optimal compute budget could produce highly capable models at accessible sizes. The finding that an 8B model trained on 15 trillion tokens could match or exceed the performance of 70B models trained on only 2 trillion tokens influenced training strategies across the industry and contributed to the trend of prioritizing data quality and quantity over raw parameter count [3]. Industry training reports from Mistral AI, DeepSeek, and Google all converged on similar conclusions in subsequent releases.
On April 5, 2025, Meta released the first models in the Llama 4 generation, marking a significant architectural departure from LLaMA 3 [16].
Llama 4 models use a mixture-of-experts (MoE) architecture, replacing the dense transformer design of LLaMA 3. In MoE models, each layer contains multiple expert subnetworks, and a gating function routes each token to only a subset of experts, allowing the model to maintain the knowledge capacity of a much larger model while keeping inference costs proportional to only the active parameters.
Llama 4 also introduces native multimodality through early fusion, meaning that text and image inputs are processed together from the earliest layers of the model rather than being handled by separate encoders that are later combined [16].
| Model | Total Parameters | Active Parameters | Experts | Context Length |
|---|---|---|---|---|
| Llama 4 Scout | ~109B | 17B | 16 | 10,000,000 |
| Llama 4 Maverick | ~400B | 17B | 128 | 1,000,000 |
| Llama 4 Behemoth (in training) | ~2T | 288B | -- | -- |
Llama 4 Scout uses 16 experts per MoE layer with 17 billion active parameters per token out of approximately 109 billion total. Its most striking feature is an industry-leading context window of 10 million tokens, vastly exceeding any previous open model [16].
Llama 4 Maverick scales up to 128 routed experts (plus a shared expert) per MoE layer, with MoE and dense layers alternating so that experts are applied in half of the layers. Each token is routed to the shared expert and one of the 128 routed experts. Maverick supports a context window of 1 million tokens and was reported to outperform GPT-4o and Gemini 2.0 Flash across a broad range of benchmarks [16].
Llama 4 Behemoth is a forthcoming model with approximately 2 trillion total parameters and around 288 billion active parameters. Meta described it as in training at the time of the April 2025 announcement and has positioned it as a teacher for future Llama distillations rather than a primary release target.
Meta published a detailed technical report titled "The Llama 3 Herd of Models" (arXiv:2407.21783), first posted on July 31, 2024 and last revised on November 23, 2024. The paper provides comprehensive documentation of the training methodology, scaling experiments, architectural decisions, and evaluation results for the LLaMA 3 and 3.1 families [3]. The paper is 92 pages long and covers topics including data curation, pretraining scaling laws, long-context extension, multilingual training, post-training alignment, safety evaluations, and inference optimization. With over 500 listed authors led by Aaron Grattafiori and Abhimanyu Dubey, it is one of the most widely cited AI papers of 2024 and serves as the canonical reference for the LLaMA 3 series.
The paper also discusses unreleased multimodal experiments in which Meta integrated image, video, and speech encoders into LLaMA 3 backbones via compositional approaches similar to those used in LLaMA 3.2 Vision. These experimental variants achieved competitive results on image, video, and speech recognition benchmarks but were not released to the public; some of the engineering work flowed into the Llama 3.2 Vision and Llama 4 multimodal releases.
As of May 2026, the LLaMA model family remains one of the two dominant open-weight model ecosystems in the AI industry, alongside Mistral AI's model family. The LLaMA 3.3 70B and Llama 4 Scout and Maverick models are widely deployed across cloud platforms, enterprise applications, and research institutions. The Llama Stack developer framework continues to expand with new integrations and tooling.
Meta has committed to continuing its open-weight release strategy, with Llama 4 Behemoth expected as the next major release. The company's approach of releasing frontier-scale models openly has reshaped industry dynamics and established a viable alternative to the closed-model paradigm championed by OpenAI and others. The LLaMA 3 family in particular continues to see heavy use as a baseline and as a fine-tuning target, in part because the fully dense architecture is easier to fine-tune and serve than the newer MoE-based Llama 4 models on hardware without specialized routing kernels.