LLaMA 3

LLaMA 3 (Large Language Model Meta AI 3) is a family of open-weight large language models developed and released by Meta beginning in April 2024. The LLaMA 3 series represents a significant leap over its predecessor, LLaMA 2, in both scale and capability, expanding training data from roughly 2 trillion tokens to more than 15 trillion, broadening multilingual coverage, lengthening context windows from 4K up to 128K tokens, and adding multimodal vision-language models alongside lightweight models designed for on-device deployment. With the release of LLaMA 3.1 405B in July 2024, Meta introduced the largest openly available language model at the time, directly challenging proprietary models from OpenAI, Google, and Anthropic. The LLaMA 3 family has become one of the most widely adopted open-weight model families in the AI ecosystem, accumulating over 1.2 million downloads in its first week alone and spawning a broad ecosystem of fine-tuned variants, developer tools, and enterprise integrations [1]. By December 2024, Meta reported that Llama models had been downloaded over 650 million times across all platforms, representing a tenfold increase over the prior year [2].

The family was developed by Meta's GenAI organization, with research and engineering effort spanning more than 500 contributors listed as authors on the accompanying technical paper, "The Llama 3 Herd of Models" (arXiv:2407.21783) [3]. Each release iteration introduced new capabilities: LLaMA 3 (April 2024) launched the redesigned tokenizer and architecture; LLaMA 3.1 (July 2024) added the 405B frontier-scale variant and 128K context windows; LLaMA 3.2 (September 2024) introduced vision-language models and edge-optimized 1B/3B variants; and LLaMA 3.3 (December 2024) closed the family with a single 70B model that approached 405B-class quality at a fraction of the inference cost. The LLaMA 3 generation was succeeded in April 2025 by Llama 4, which moved to a mixture-of-experts architecture and native multimodal pretraining.

background and predecessors

The LLaMA 3 family follows two prior generations of Meta's open language models. The original LLaMA (February 2023) introduced models from 7B to 65B parameters under a research-only license, with weights leaked publicly within days and quickly catalyzing a wave of community fine-tunes such as Alpaca, Vicuna, and Guanaco. LLaMA 2 (July 2023) was Meta's first openly licensed flagship language model series, available for both research and most commercial use under the Llama 2 Community License. LLaMA 2 introduced grouped-query attention (only on the 70B variant), a 4,096-token context window, and 2 trillion training tokens [3].

By late 2023, the open-weight landscape had become competitive. Mistral AI released Mistral 7B and the Mixtral 8x7B mixture-of-experts model, 01.AI released the Yi series, and Alibaba released the Qwen family. At the same time, proprietary frontier models including GPT-4, Claude 2, and Gemini 1.0 demonstrated capabilities that no open-weight model could match. Meta's stated goal with LLaMA 3 was to close that gap, and the LLaMA 3.1 405B release in mid-2024 marked the first time an open-weight model met or exceeded proprietary frontier performance on most public benchmarks [3][8].

release timeline

Meta released the LLaMA 3 family in four major waves over an eight month period in 2024, each introducing new model sizes, capabilities, or architectural improvements. The following table summarizes all major releases in the LLaMA 3 generation.

Model	Release Date	Parameters	Context Length	Key Features
LLaMA 3 8B	April 18, 2024	8B	8,192	Dense transformer, GQA, 128K vocabulary
LLaMA 3 70B	April 18, 2024	70B	8,192	Dense transformer, GQA, 128K vocabulary
LLaMA 3.1 8B	July 23, 2024	8B	128,000	Extended context, multilingual (8 languages)
LLaMA 3.1 70B	July 23, 2024	70B	128,000	Extended context, multilingual (8 languages)
LLaMA 3.1 405B	July 23, 2024	405B	128,000	Largest open-weight model, trained on 16K H100 GPUs
LLaMA 3.2 1B	September 25, 2024	1B	128,000	Lightweight text model for edge/mobile
LLaMA 3.2 3B	September 25, 2024	3B	128,000	Lightweight text model for edge/mobile
LLaMA 3.2 11B-Vision	September 25, 2024	11B	128,000	Multimodal (text + image input), vision encoder
LLaMA 3.2 90B-Vision	September 25, 2024	90B	128,000	Multimodal (text + image input), vision encoder
LLaMA 3.3 70B	December 6, 2024	70B	128,000	405B-class quality at a fraction of inference cost

In parallel with the main model releases, Meta shipped a growing safety stack (Llama Guard 2 in April 2024, Llama Guard 3 in July 2024, Llama Guard 3 Vision in September 2024, plus Code Shield, CyberSec Eval 2 and 3, and Prompt Guard) and a developer framework called Llama Stack. The first iteration of Meta AI, Meta's consumer-facing assistant, was rebuilt on top of LLaMA 3 at the April 2024 launch and rolled out across Facebook, Instagram, WhatsApp, and Messenger [1][17].

architecture

LLaMA 3 uses a relatively standard decoder-only transformer architecture, but with several important design choices that improve efficiency and performance at scale. The core design follows the post-LayerNorm pattern used in LLaMA 2 but with several scaled-up choices: a much larger vocabulary, grouped-query attention at every model size, and a higher RoPE base frequency to enable long-context extension.

model dimensions

The three primary backbone sizes in the LLaMA 3 family differ in layer depth, model width, feed-forward network dimension, and attention head count. The following table details the architectural specifications for each size [3].

Parameter	8B	70B	405B
Layers	32	80	126
Model Dimension	4,096	8,192	16,384
FFN Dimension	14,336	28,672	53,248
Attention Heads	32	64	128
Key-Value Heads	8	8	8
Attention Head Dimension	128	128	128
Vocabulary Size	128,000	128,000	128,000
Activation Function	SwiGLU	SwiGLU	SwiGLU
Normalization	RMSNorm	RMSNorm	RMSNorm
Positional Encoding	RoPE (theta=500K)	RoPE (theta=500K)	RoPE (theta=500K)

All three model sizes share the same fundamental building blocks: SwiGLU activation functions in the feed-forward layers, Root Mean Square Normalization (RMSNorm) for internal state normalization, and Rotary Positional Embeddings (RoPE) for encoding positional information. The consistent use of 8 key-value heads across all sizes, regardless of the number of query heads, is a defining characteristic of the LLaMA 3 architecture [3]. Compared to LLaMA 2, the only true architectural changes are the larger vocabulary, the universal use of GQA, the higher RoPE base frequency, and the deeper, wider 405B variant. Meta deliberately resisted introducing more exotic mechanisms (such as state-space layers, retrieval modules, or sparse experts), citing training stability and deployability across heterogeneous hardware as priorities for an open-weight release [3].

tokenizer

The LLaMA 3 tokenizer uses a vocabulary of 128,000 tokens based on byte pair encoding (BPE) via the tiktoken library, a fourfold increase over the 32,000-token vocabulary in LLaMA 2. This larger vocabulary encodes language more efficiently, reducing the number of tokens required to represent a given text passage and thereby improving both throughput and effective context utilization. In practice, the 128K vocabulary compresses English text by roughly 15% more tokens per passage compared to the LLaMA 2 tokenizer, which means more text fits within the same context window [1]. Meta retrained the tokenizer with explicit weighting toward non-English text, code, and mathematical symbols, which improved compression for non-Latin scripts and reduced the tokenization disparity between English and other languages that had been a noted weakness of LLaMA 2 [3]. The chat template uses special header tokens (such as <|begin_of_text|>, <|start_header_id|>, and <|eot_id|>) to mark turn boundaries, replacing the inline [INST] markers used in LLaMA 2.

grouped-query attention

Grouped-query attention (GQA) is used across all LLaMA 3 model sizes, including the 8B variant. In GQA, multiple query heads share a smaller number of key-value heads (8 key-value heads in the case of LLaMA 3), which reduces the memory footprint of the key-value cache during inference and improves decoding speed without meaningful degradation in output quality. For the 8B model with 32 query heads and 8 key-value heads, each key-value head is shared across 4 query heads. For the 70B model with 64 query heads, each key-value head is shared across 8 query heads. For the 405B model with 128 query heads, each key-value head serves 16 query heads. LLaMA 2 had used GQA only in its 70B variant, so extending it to all sizes was a notable architectural decision [4].

GQA's main practical benefit is on inference: KV cache memory scales with the number of key-value heads rather than query heads, so a 70B model with 64 query heads but 8 KV heads consumes about 1/8 the cache of a same-shape model that used full multi-head attention. This enables longer context generation on commodity hardware and reduces the cost of multi-tenant serving in cloud deployments [4].

rotary positional embeddings

LLaMA 3 employs Rotary Positional Embeddings (RoPE) to encode positional information. RoPE applies a rotation matrix to encode absolute position while simultaneously incorporating relative position information directly into the self-attention computation. For the LLaMA 3 series, Meta increased the RoPE base frequency hyperparameter to 500,000 (up from 10,000 in LLaMA 2), which enables better support for longer context lengths of up to 8,192 tokens in the initial release and 128,000 tokens in LLaMA 3.1 [3]. The higher base frequency stretches the rotation period, allowing the model to distinguish positions over longer ranges without running into the periodic aliasing that would occur at lower frequencies.

For the 128K context extension in LLaMA 3.1, Meta further scaled RoPE using a custom interpolation scheme combined with continued pretraining on long-context documents. The team gradually expanded context length over several training stages: 8K, then 16K, then 32K, 64K, and finally 128K, with each stage using a curated mixture of long documents (books, code repositories, research papers) to teach the model to attend over the new range. The 128K extension preserved short-context quality while delivering near-perfect retrieval scores on the needle-in-a-haystack benchmark across the full window [3].

dense transformer design

Unlike some competing models that use mixture-of-experts (MoE) architectures, all LLaMA 3 models employ a dense transformer design in which every parameter is active during inference. Meta chose this approach for its simplicity, training stability, and ease of deployment, though it means that inference cost scales linearly with parameter count [3]. The team explicitly evaluated MoE alternatives during the design phase but concluded that the dense architecture provided better training stability at the scales they targeted and was simpler to optimize for deployment across diverse hardware platforms. The decision was reversed in the Llama 4 generation, which moved to MoE in 2025.

training

pretraining data

LLaMA 3 models were pretrained on over 15 trillion tokens of text collected from publicly available sources. This represents a roughly sevenfold increase over the 2 trillion tokens used for LLaMA 2. Over 5% of the training data (approximately 800 million tokens) consisted of text in more than 30 non-English languages, improving multilingual performance [1].

Meta developed custom data filtering pipelines using multiple stages of quality control. The pipeline included heuristic filters for removing low-quality content, NSFW classifiers, and text quality classifiers trained specifically for this purpose. For quality scoring, Meta used DistilRoBERTa models trained on web data that had been annotated by LLaMA 2 itself, creating a bootstrapping approach where the previous generation model helped curate data for the next [3]. Specialized classifiers were also trained for code and reasoning content, using prompt-tuned models to identify web pages containing mathematical deductions, STEM reasoning, and code interleaved with natural language.

The deduplication process operated at both the document and line levels. Document-level deduplication used MinHash-based near-duplicate detection to remove redundant content across the corpus. Line-level deduplication employed heuristics such as duplicated n-gram coverage ratios to strip out repetitive content like logging messages or error traces. Additionally, token-distribution Kullback-Leibler divergence was used to filter out documents containing abnormal distributions of tokens compared to the overall training corpus [3].

The approximate data mix composition for pretraining was as follows [3]:

Data Category	Share of Training Mix
General knowledge (web text)	~50%
Mathematical and reasoning data	~25%
Code	~17%
Multilingual data	~8%

Meta noted that even the multilingual budget understates the actual coverage, because much of the "general knowledge" web text contains code-switched or non-English passages that are not classified into the multilingual bucket. Internal annealing experiments, which trained smaller proxy models on candidate data mixes for short runs, helped Meta tune category weightings before committing to the full 15 trillion token run [3]. Knowledge cutoff dates differed slightly across releases: December 2023 for LLaMA 3.1 8B, 70B, and 405B, with later versions inheriting the same cutoff [16].

training infrastructure

The LLaMA 3.1 405B model was trained on a cluster of 16,384 NVIDIA H100 80GB GPUs, making it one of the largest single training runs conducted on publicly disclosed infrastructure at the time [5]. The total training compute for the 405B model was approximately 3.8 x 10^25 FLOPs, and the run consumed a cumulative 39.3 million GPU hours on H100-80GB hardware (rated at a thermal design power of 700W per GPU) [3]. The 8B model used roughly 1.46 million GPU hours, the 70B about 7.0 million GPU hours, and the 405B about 30.84 million GPU hours, for a combined 39.3 million GPU hours across the three Llama 3.1 variants [16].

Meta drew on two purpose-built 24,576-GPU H100 clusters described in March 2024 by its data center engineering team. One cluster used a RoCE (RDMA over Converged Ethernet) fabric assembled from Arista 7800 switches with Wedge400 and Minipack2 OCP rack switches, while the second used NVIDIA Quantum-2 400 Gbps InfiniBand. Both clusters housed GPUs in Meta's Grand Teton OCP chassis and used Meta's Tectonic distributed flash storage system, accessed through a Filesystem in Userspace (FUSE) layer, for training data and checkpoints [11]. Meta's stated 2024 buildout target was to operate the equivalent of nearly 600,000 H100s by the end of the year, including roughly 350,000 H100s.

Training a model of this scale required advanced distributed training techniques. Meta employed a 4D parallelism strategy that combined four forms of parallelism simultaneously [3]:

Parallelism Type	Description
Tensor Parallelism (TP)	Splits individual weight matrices across multiple GPUs within a node
Pipeline Parallelism (PP)	Distributes different layers of the model across groups of GPUs
Context Parallelism (CP)	Splits long input sequences across GPUs for memory efficiency
Data Parallelism (DP)	Replicates the model and distributes training batches across groups

For the 405B model the dominant configuration was 8-way tensor parallelism within nodes, 16-way pipeline parallelism across racks, context parallelism for sequences longer than 8K tokens, and fully sharded data parallelism (FSDP) at the outermost level. The achieved effective utilization, expressed as model FLOPs utilization (MFU) on H100 hardware, was reported in the technical paper at over 400 TFLOPs per GPU during the 16K-GPU training run [1].

The pretraining run for the 405B model took approximately 54 days. During that period, the cluster experienced 466 job interruptions, of which 47 were planned (for automated maintenance) and 419 were unexpected. The unexpected interruptions broke down as follows: 148 were caused by faulty GPUs (30.1%), 72 by GPU HBM3 memory errors (17.2%), 35 by network switch and cable problems (8.4%), 19 by GPU SRAM memory issues (4.5%), and 17 by GPU system processor failures (4.1%). Only two CPU failures were recorded during the entire period. Despite these challenges, the team achieved over 90% effective training time through automated checkpointing and rapid recovery procedures [6]. Independent reporting interpreted these statistics as a roughly one-failure-every-three-hours cadence on the 16K-GPU cluster, underscoring how reliability engineering becomes a first-class research concern at frontier scale [6].

The models were trained on sequences of 8,192 tokens using document-level masking to prevent self-attention from crossing document boundaries. An important finding from the training process was that model performance continued to improve log-linearly well beyond the Chinchilla-optimal compute allocation. While the Chinchilla-optimal amount of training data for an 8B parameter model is roughly 200 billion tokens, Meta observed continued gains when training on two orders of magnitude more data (15 trillion tokens), which informed their decision to overtrain the smaller models significantly [1]. The same logic applied at the largest scale: compute-optimal scaling laws would have suggested a smaller dataset for a 405B parameter model, but Meta deliberately overtrained to improve inference-time efficiency, since serving costs scale with parameter count rather than with the original training token budget.

Meta reported location-based CO2 emissions of approximately 11,390 metric tons for the LLaMA 3.1 family and 12.9 metric tons for the LLaMA 3.3 70B fine-tuning run, with a market-based net of zero achieved by matching electricity consumption with renewable energy purchases [16][7].

post-training

Instruct-tuned variants of all LLaMA 3 models underwent supervised fine-tuning on publicly available instruction datasets along with over 10 million human-annotated examples. Post-training also included reinforcement learning from human feedback (RLHF) using rejection sampling and direct preference optimization (DPO) to align model outputs with human preferences.

The post-training pipeline followed multiple iterative rounds. In each round, the model was first fine-tuned on curated instruction data (SFT), then improved through rejection sampling (RS) where the model generated multiple candidate responses and a reward model selected the best ones, and finally refined through DPO where the model learned to prefer higher-quality outputs over lower-quality alternatives [3]. This iterative approach allowed each round to build on the improvements of the previous round, progressively improving both helpfulness and safety. Several rounds also incorporated synthetic data generation, using earlier checkpoints to bootstrap higher quality instruction data for later rounds.

Meta diverged from the prevailing PPO-based RLHF approach used by OpenAI and others, opting for a simpler combination of rejection sampling plus DPO. Internal experiments showed comparable or better quality with substantially less training compute and better stability, which the team credited as one reason the post-training stack scaled cleanly from 8B to 405B [3]. A reward model trained on a mix of public preference datasets and Meta's own human comparison annotations served as the scorer for rejection sampling. The team explicitly avoided training a separate reward model for safety, instead relying on the same general reward model with safety-specific preference data injected into the mix.

For LLaMA 3.3, the fine-tuning data was further expanded to over 25 million synthetically generated examples, reflecting Meta's growing investment in synthetic data generation as a post-training technique [7]. The synthetic data generation pipeline used the larger 405B teacher to create candidate prompts and responses across math, code, instruction following, and tool use, with LLM-based classifiers filtering out low-quality completions before adding them to the SFT mix.

llama 3 (8b and 70b)

The initial LLaMA 3 release on April 18, 2024 included two dense text models: 8B and 70B parameters. Both models were released in base (pretrained) and instruct-tuned variants. Despite the relatively short 8,192-token context window, these models set new standards for open-weight performance at their respective sizes [1].

benchmark performance

The initial LLaMA 3 models demonstrated strong results across standard benchmarks for their parameter counts [3].

Benchmark	LLaMA 3 8B	LLaMA 3 70B
MMLU (5-shot)	69.4	83.6
HumanEval (0-shot)	72.6	80.5
GSM8K (8-shot, CoT)	84.5	95.1
MATH (0-shot, CoT)	51.9	68.0
ARC-Challenge (25-shot)	83.4	94.8

The LLaMA 3 8B outperformed the previous LLaMA 2 70B on several benchmarks despite being nearly nine times smaller, illustrating the compounding benefits of more training data, a larger vocabulary, and grouped-query attention at all scales [1]. The 70B model was competitive with the best proprietary models available in early 2024, including GPT-3.5 Turbo and early versions of Claude 3 Sonnet. Meta's own benchmark sweep at launch indicated that LLaMA 3 70B Instruct beat Gemini Pro 1.5 and Claude 3 Sonnet on most evaluations, though Anthropic and Google would soon ship Claude 3.5 Sonnet (June 2024) and updated Gemini variants [17].

initial reception

The April 2024 release drew immediate attention. Within the first week, the models had been downloaded over 1.2 million times across Hugging Face, Meta's own llama.com portal, and partner platforms [1]. Coverage in IEEE Spectrum, the New York Times, the Financial Times, and Reuters framed the release as a strategic move that further legitimized open-weight frontier development. The launch was paired with the relaunch of Meta AI as a Llama-3-powered consumer assistant, integrated into Facebook, Instagram, WhatsApp, and Messenger search bars and accessible via a standalone meta.ai web interface, with real-time information drawn from Bing and Google Search integrations [17].

llama 3.1

The LLaMA 3.1 release on July 23, 2024 updated all three model sizes (8B, 70B, and 405B) with extended context lengths of 128,000 tokens and multilingual support for eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This was a meaningful expansion from the English-centric focus of earlier LLaMA releases [5][16]. The release was timed deliberately to coincide with the publication of "The Llama 3 Herd of Models" preprint and a Mark Zuckerberg open letter titled "Open Source AI Is the Path Forward" [9].

llama 3.1 405b

The LLaMA 3.1 405B model was the largest openly available language model at the time of its release. With 405 billion parameters and a 128K token context window, it was positioned as a direct competitor to proprietary frontier models like GPT-4 and Claude 3 Opus [5]. Press estimates valued the GPU hardware required for the training run at roughly $400 million in retail H100 prices, though Meta's effective costs were lower because the GPUs were owned and amortized across multiple workloads [16].

The 405B variant was the first frontier-scale Meta model to natively support function calling and multi-step tool use, enabled by extensive tool-use data injected during post-training. It accepts JSON tool definitions and emits structured tool calls in a Python-like syntax that downstream agents can parse. Tool definitions are presented to the model through the chat template's system prompt, and the model can invoke tools either inline (mid-response) or as standalone responses, depending on the task [3].

benchmark performance (llama 3.1 family)

The following table compares the LLaMA 3.1 models against leading proprietary models on standard benchmarks. All scores are for the instruct-tuned variants [3][5][16].

Benchmark	LLaMA 3.1 8B	LLaMA 3.1 70B	LLaMA 3.1 405B	GPT-4o	Claude 3.5 Sonnet
MMLU (5-shot)	73.0	86.0	87.3	88.7	88.3
MMLU-Pro (5-shot, CoT)	--	--	73.3	--	--
HumanEval (0-shot)	72.6	80.5	89.0	90.2	92.0
GSM8K (8-shot, CoT)	84.5	95.1	96.8	96.1	96.4
MATH (0-shot, CoT)	51.9	68.0	73.8	76.6	71.1
ARC-Challenge (25-shot)	83.4	94.8	96.9	--	--
GPQA (0-shot)	--	48.0	50.7	--	--
Tool Use (BFCL)	--	77.5	88.5	83.6	90.2
Multilingual (MGSM)	--	86.9	91.6	90.5	91.6

On the Scale AI SEAL leaderboard, LLaMA 3.1 405B ranked second in math and reasoning, fourth in coding, and first in instruction following [5]. Experimental evaluations suggested the 405B model performed on par with GPT-4-0125-Preview and Claude 3.5 Sonnet, winning and losing roughly the same percentage of head-to-head comparisons, though it fell slightly behind GPT-4o in direct matchups, winning only 19.1% of comparisons [8]. Independent evaluations from Vellum and Promptfoo confirmed that the 405B model traded wins with proprietary frontier models across coding, reasoning, and math benchmarks, with no single model dominating [16].

significance

Meta CEO Mark Zuckerberg described the 405B release as a pivotal moment for open-source AI, comparing it to the role Linux played in democratizing server operating systems. The release was accompanied by a blog post titled "Open Source AI Is the Path Forward," in which Zuckerberg argued that openly available models would ultimately outcompete closed alternatives through community-driven innovation [9]. Zuckerberg also framed open weights as strategically aligned with Meta's business: unlike OpenAI or Anthropic, Meta does not monetize model access directly, so open distribution does not cannibalize a core product line and instead lowers Meta's cost of innovation through ecosystem feedback [9].

The release marked the first time Meta had ever released model weights for a frontier-scale system, and several commentators including Andrej Karpathy and Andrew Ng described the 405B launch as a turning point that would force closed-model providers to compete more aggressively on price and quality. By the end of 2024, hosted Llama usage by token volume across major cloud partners had more than doubled in the May-July window, and monthly Llama usage at the largest cloud providers had grown roughly tenfold from January to July 2024 [10].

llama 3.2: multimodal and edge models

Released on September 25, 2024 at Meta Connect, LLaMA 3.2 expanded the family in two significant directions: multimodal vision-language models and lightweight models for edge devices [10].

vision models (11b and 90b)

The LLaMA 3.2 11B-Vision and 90B-Vision models accept both text and image inputs and produce text outputs. They were designed as drop-in replacements for their text-only counterparts while adding image reasoning capabilities. These models can perform tasks such as document understanding (including charts and graphs), image captioning, and visual grounding (identifying objects in images based on natural language descriptions) [10].

The vision capability was integrated through a separately trained vision adapter, an approach Meta describes as "late fusion." The adapter consists of a dedicated image encoder paired with a series of cross-attention layers that feed image encoder representations into the core language model. The adapter weights were trained on text-image pairs to align the visual representations with the language model's internal representations, while the pretrained language model weights remained largely frozen during this alignment stage. This design allowed the vision models to retain the full text performance of their LLaMA 3.1 counterparts while gaining image understanding capabilities, and supports tasks such as visual question answering, chart interpretation, and document analysis without disturbing the text-only behavior of the original models [10].

The image encoder was based on a vision transformer pretrained on a large corpus of image-text pairs filtered for relevance and quality. Cross-attention layers were inserted at regular intervals throughout the language model so that text tokens could attend to image patch features when relevant, but text-only inference paths remain unaffected. The total parameter count of the 11B-Vision model includes about 8B inherited from LLaMA 3.1 8B, plus the encoder and adapter weights; the 90B-Vision builds analogously on top of LLaMA 3.1 70B [10].

vision model benchmarks

The following table shows benchmark scores for the instruct-tuned vision models on image understanding tasks [11].

Benchmark	LLaMA 3.2 11B-Vision	LLaMA 3.2 90B-Vision
MMMU (val, CoT)	50.7	60.3
MMMU-Pro Standard	33.0	45.2
MathVista	51.5	57.3
ChartQA (CoT)	83.4	85.5
AI2 Diagram (test)	91.1	92.3
DocVQA (test)	88.4	90.1
VQAv2 (test)	75.2	78.1

Meta reported that the 11B-Vision model outperformed Claude 3 Haiku and was competitive with GPT-4o mini on image recognition and visual understanding tasks [10].

lightweight text models (1b and 3b)

The LLaMA 3.2 1B and 3B models were designed for on-device deployment on mobile phones and edge hardware. Despite their small size, they support the full 128K token context window and achieve competitive performance on summarization, instruction following, and text rewriting tasks [10]. Both models were created through a combination of structured pruning of LLaMA 3.1 8B and knowledge distillation, in which the smaller models learn from the logits of larger 8B and 70B teachers. This pipeline allowed Meta to compress capable behavior into much smaller parameter budgets while keeping the same tokenizer and chat template [11].

lightweight model benchmarks

The following table compares the LLaMA 3.2 lightweight models against similarly sized competitors [10].

Benchmark	LLaMA 3.2 1B	LLaMA 3.2 3B	Gemma 2 2.6B	Phi 3.5-mini
MMLU (5-shot)	49.3	63.4	57.8	69.0
IFEval (Instruction Following)	59.5	77.4	61.9	59.2
ARC-Challenge	59.4	78.6	76.7	87.4
Tool Use (BFCL V2)	25.7	67.0	--	--
NIH/Multi-Needle	75.0	84.7	--	--

The 3B model outperformed Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, and tool use tasks, while the 1B model was competitive with Gemma on general knowledge benchmarks [10]. Meta worked with Qualcomm and MediaTek, the two largest mobile system-on-a-chip companies, to optimize these models for mobile processors. The models were also enabled for Arm processors, which provide the foundational compute layer for 99% of mobile devices [10]. Quantized variants released in the same family (using QAT+LoRA and SpinQuant) target 4-bit weights with 8-bit activations, allowing them to run on smartphone-class hardware in 8K context windows.

eu restrictions

The Llama 3.2 Community License Agreement carries a regional restriction not present in earlier releases: rights granted under the agreement are not extended to individuals domiciled in, or companies with their principal place of business in, the European Union for the multimodal models. The text-only 1B and 3B models are not subject to this restriction. Meta cited regulatory uncertainty under the GDPR and the EU AI Act, particularly around the use of public Facebook and Instagram data in training, as the basis for the carve-out. End users of products and services that incorporate the multimodal models remain unaffected by the restriction; only the right to develop and host the models is curtailed in the EU [12]. The carveout drew criticism from European AI policy commentators who argued that it deepened the gap between US and EU access to frontier-scale open weights.

llama 3.3 70b

Released on December 6, 2024, LLaMA 3.3 70B is an efficiency-optimized model that delivers performance comparable to the much larger LLaMA 3.1 405B at a fraction of the computational cost [7].

The model retains the same architecture as LLaMA 3.1 70B (80 layers, 8,192 model dimension, 64 attention heads) but incorporates advances in post-training techniques. Meta applied expanded supervised fine-tuning with over 25 million synthetically generated examples, along with improved rejection sampling and direct preference optimization rounds, which yielded substantial gains in reasoning, mathematics, instruction following, and tool use [7]. No pretraining changes were made; the underlying LLaMA 3.1 70B base checkpoint was reused, with all gains coming from the post-training stack. Meta did not release a separately fine-tuned base model in the 3.3 release; only the instruction-tuned variant is available.

benchmark performance

The following table compares LLaMA 3.3 70B against its predecessor and several leading models [7][12].

Benchmark	LLaMA 3.1 70B	LLaMA 3.3 70B	LLaMA 3.1 405B	GPT-4o	Gemini 1.5 Pro
MMLU (0-shot, CoT)	86.0	86.0	88.6	87.5	87.1
MMLU-Pro (5-shot, CoT)	66.4	68.9	--	73.8	76.1
IFEval (Instruction Following)	87.5	92.1	88.6	84.6	81.9
HumanEval (0-shot)	80.5	88.4	89.0	86.0	--
MATH (0-shot, CoT)	68.0	77.0	73.8	76.9	82.9
GPQA Diamond (0-shot, CoT)	48.0	50.5	50.7	--	53.5
Multilingual MGSM (0-shot)	86.9	91.1	--	90.6	--
NIH/Multi-Needle (Long Context)	97.5	97.5	98.1	--	94.7
Tool Use (BFCL v2, 0-shot)	77.5	77.3	--	--	80.3

LLaMA 3.3 70B trailed LLaMA 3.1 405B by under 2% on six out of nine evaluated benchmarks and achieved higher scores on three of them (IFEval, HumanEval, and MATH). The model outperformed Google Gemini 1.5 Pro, OpenAI GPT-4o, and Amazon Nova Pro on several benchmarks [7].

cost efficiency

With an output speed of approximately 82.9 tokens per second (median across providers) and generation costs roughly five times lower than the 405B model, LLaMA 3.3 70B quickly became one of the most cost-effective open-weight models available. Independent benchmarks measured inference speeds reaching 276 tokens per second on Groq hardware. At typical API pricing of $0.10 per million input tokens and $0.40 per million output tokens, the model was approximately 25 times cheaper than GPT-4o for equivalent workloads [12]. The release was widely interpreted as a demonstration that Meta could continue to extract value from the 405B teacher long after its initial training, by distilling its capabilities into smaller models that the broader community could actually run.

llama guard and safety stack

Alongside the LLaMA 3 models, Meta developed and released a layered safety stack: Llama Guard for content classification, Code Shield for filtering insecure code generation, CyberSec Eval for measuring cybersecurity behavior, and Prompt Guard for detecting prompt-injection and jailbreak attempts.

llama guard

Llama Guard is a safety classification system designed to moderate inputs and outputs of language model applications [13]. It treats safety classification as a structured generation problem: given a transcript of user prompt or model response and a list of hazard categories, the model outputs whether the content is safe and, if not, which hazard categories it violated. Because Llama Guard runs as a separate model rather than as a fine-tuned safety policy on the base LLM, the same system can be used to moderate any third-party model and can be customized to enforce different category lists.

Version	Base Model	Release	Key Capabilities
Llama Guard (original)	LLaMA 2 7B	December 2023	Text input/output classification
Llama Guard 2	LLaMA 3 8B	April 2024	Updated taxonomy, improved accuracy
Llama Guard 3 (8B)	LLaMA 3.1 8B	July 2024	8 languages, tool call safety, MLCommons taxonomy
Llama Guard 3 1B	LLaMA 3.2 1B	September 2024	Compact safety classifier for on-device use
Llama Guard 3 Vision (11B)	LLaMA 3.2 11B	September 2024	Multimodal (text + image) moderation
Llama Guard 4 (12B)	LLaMA 4 Scout	April 2025	Unified text + vision safety, pruned architecture

Llama Guard 3 categorizes content according to the MLCommons AI Safety Working Group taxonomy, which includes 13 hazard categories ranging from violent crimes and sex-related crimes to defamation, indiscriminate weapons, and code interpreter abuse. Versions 3 and later support multilingual classification across the same eight languages as the base LLaMA 3.1 models [13].

code shield

Code Shield is an inference-time filter that scans model-generated code for known insecure patterns before returning it to the user. It uses static analysis tools to detect common security weaknesses such as hard-coded credentials, unsafe deserialization, command injection, SQL injection, and use of cryptographic primitives in insecure modes. Code Shield was released alongside Llama 3 and is intended to be deployed as a wrapper around any code-generating LLM, not just LLaMA models [1].

cybersec eval

CyberSec Eval (versions 2 and 3, released alongside Llama 3 and Llama 3.1 respectively) is a benchmark suite that measures three properties of LLM behavior: propensity to generate insecure code, susceptibility to prompt injection attacks, and willingness to assist with cyberattacks. CyberSec Eval 3 added test categories for malicious code execution and exploit generation, and Meta uses the suite both internally during post-training and as a public benchmark for the broader community [3].

prompt guard

Prompt Guard, released in July 2024, is a small classifier model that detects jailbreak prompts and prompt-injection attempts in user input. It is designed to be deployed in front of a primary LLM in agent and retrieval-augmented setups where untrusted text might be concatenated into the model's context. The classifier outputs a label indicating whether each input segment is benign, an injection attempt, or a jailbreak attempt, and applications can choose to drop, sanitize, or pass through the input based on that label [13].

llama stack

Introduced alongside LLaMA 3.1 in July 2024, Llama Stack is a developer framework that standardizes the building blocks for constructing AI applications on top of LLaMA models [14].

Llama Stack provides a unified API layer covering inference, retrieval-augmented generation (RAG), AI agents, tool use, safety moderation, and evaluation. It supports a plugin architecture that allows different backend implementations for local development, on-premises servers, cloud environments, and mobile devices. SDKs are available for Python, TypeScript, iOS, and Android [14].

The ecosystem launched with over 25 integration partners, including AWS, NVIDIA, Databricks, Groq, Dell, Microsoft Azure, Google Cloud, IBM, Intel, Oracle Cloud, AMD, and Snowflake. The framework's reference implementation is open-source on GitHub and includes example apps that demonstrate composing inference, vector retrieval, safety filtering, and tool use into agent loops without writing infrastructure code from scratch.

community adoption and ecosystem

The LLaMA 3 family has had a substantial impact on the open-source AI landscape. The models have been downloaded hundreds of millions of times from Hugging Face and other platforms, and they form the foundation for thousands of community fine-tunes and specialized models.

LLaMA 3 weights are hosted on Meta's official llama.com portal, the Hugging Face Hub, Kaggle Models, and through cloud partner catalogs at AWS, Azure, Google Cloud, Databricks, Snowflake, and NVIDIA NIM. Hosted inference is offered by Together AI, Fireworks AI, Replicate, Groq, Cerebras, DeepInfra, Perplexity, and several others, often at substantially lower per-token cost than the closed proprietary alternatives.

The following table summarizes representative hosted-inference providers and the latency/cost profile they typically advertise for LLaMA 3.x models, as of late 2024 and early 2025.

Provider	Notable LLaMA 3 Models Hosted	Distinguishing Feature
Groq	3.1 8B, 3.1 70B, 3.3 70B	LPU custom hardware, 200-800 tokens/sec output
Cerebras	3.1 8B, 3.1 70B, 3.3 70B	Wafer-scale CS-3, very low time-to-first-token
Together AI	Full 3.x lineup, vision models	Custom inference kernels, batch APIs
Fireworks AI	Full 3.x lineup	Function calling and structured output
DeepInfra	Full 3.x lineup	Low per-token pricing
Replicate	3.x text and vision	Pay-per-second model hosting
Perplexity Sonar	3.1 70B variants	Search-augmented serving
AWS Bedrock	3.1, 3.2, 3.3	Managed enterprise endpoints
Azure AI Foundry	3.1, 3.2, 3.3	Microsoft-managed enterprise endpoints
Google Cloud Vertex AI	3.1, 3.2, 3.3	Vertex Model Garden hosting
NVIDIA NIM	3.1, 3.3	Optimized TRT-LLM inference containers
Databricks Mosaic	3.1, 3.3	Foundation Model Serving with LoRA fine-tuning

Several factors contributed to this adoption. First, the 8B and 70B models hit practical sweet spots for cost and capability, making them suitable for a wide range of applications from chatbots to code assistants. Second, Meta's partnerships with cloud providers (AWS, Azure, Google Cloud) and hardware vendors (NVIDIA, Qualcomm, MediaTek) ensured that the models were immediately deployable across diverse infrastructure. Third, the release of the 405B model demonstrated that open-weight models could compete with the best proprietary systems on standard benchmarks, lending credibility to the broader open-source AI movement [9].

Meta released torchtune alongside the LLaMA 3 family, a PyTorch-native library designed for fine-tuning and experimenting with LLMs. The library provides memory-efficient training recipes and integrates with platforms such as Hugging Face, Weights & Biases, and EleutherAI's evaluation harness, making it straightforward for researchers and developers to adapt LLaMA 3 models to custom tasks and domains [2].

The model family also seeded a large derivative ecosystem. Public fine-tunes built on LLaMA 3 backbones include code specialists such as Code Llama derivatives and dedicated tool-use variants released by enterprise vendors. Notable third-party fine-tunes include Hermes 3 from Nous Research (an instruction-following variant of the 405B and 70B with relaxed refusal behaviors), Reflection 70B (a chain-of-thought-tuned variant from earlier 2024 that drew controversy over its claims), DeepSeek-V2 distillations on LLaMA 3 backbones, and a steady stream of role-play, narrative, and domain-specific (medical, legal, finance) fine-tunes hosted on Hugging Face. The competitive pressure from LLaMA 3 is credited with influencing other organizations to open their models or adopt more permissive licensing. Even OpenAI CEO Sam Altman acknowledged in late 2024 that the company "may need to pursue a more rigorous open source strategy" in response to the rise of open models [9].

use cases

LLaMA 3 deployments span a wide range of production use cases, including coding assistance (in tools such as GitHub Copilot alternatives, Continue, and Cursor self-hosted endpoints), customer support automation, document analysis and summarization, retrieval-augmented question answering for enterprise knowledge bases, internal Slack and Microsoft Teams assistants, agent platforms (with the 405B and 70B providing the planning core), and on-device assistants for mobile and embedded products. The 1B and 3B variants, in particular, have been adopted for local applications where data residency or offline operation is required, including healthcare and government deployments.

Meta's own deployment of LLaMA 3 inside its consumer surfaces is itself the largest single use case: Meta AI, the assistant integrated into Facebook, Instagram, WhatsApp, Messenger, Ray-Ban Meta smart glasses, and the standalone meta.ai web app, runs on a variant of the LLaMA 3 family and crossed 600 million monthly active users by the end of 2024 according to Meta's earnings disclosures [17].

licensing

LLaMA 3 models are released under the Llama 3 Community License, which Meta describes as "open" but which the Open Source Initiative has determined does not meet the formal Open Source Definition [15]. Each minor release ships with its own license document; the Llama 3, 3.1, 3.2, and 3.3 community licenses are similar but differ in several important details.

Key restrictions in the license include:

Monthly active user threshold: Organizations with services exceeding 700 million monthly active users at the time of the model release must obtain a separate commercial license from Meta. This threshold targets a small number of large platforms (Apple, Google, Microsoft, ByteDance, Amazon, Tencent, etc.) without affecting most commercial users.
Use restrictions: The license incorporates an Acceptable Use Policy that prohibits applications including weapons development, child sexual abuse material generation, mass surveillance, and content designed to incite violence.
Attribution: Distributors must include the license text and display "Built with Llama" branding on user-facing products that materially use a Llama model.
Non-competition clause (LLaMA 3 only): The original LLaMA 3 license prohibited using model outputs to improve competing language models. This restriction was relaxed starting with LLaMA 3.1, which allows such use with proper attribution [15].
Termination: Meta may terminate the license at any time if it determines the licensee is in breach, with no grace period.
EU restriction (LLaMA 3.2 multimodal only): The Llama 3.2 vision models are not licensed to entities domiciled or based in the European Union, citing GDPR and AI Act uncertainty [12].

The license is governed under California law. Despite these restrictions, the Llama Community License is considerably more permissive than fully proprietary licenses and has enabled widespread commercial and research adoption. The Open Source Initiative argues that the user-cap clause and the trademark requirement together disqualify the license from being considered open source under the Open Source Definition, while Meta and several legal commentators argue that, for practical purposes, the license is functionally equivalent to permissive open source for the vast majority of users [15].

reception and criticism

The LLaMA 3 family received broadly positive reception from the AI community, but several recurring criticisms accompanied the releases. Critics highlighted that the term "open" was being used to describe a license that did not meet the OSI's formal definition, and that pretraining data sources were not disclosed in detail. The Open Source Initiative published explicit objections to the licensing language, and academic groups argued that without source data, claims about model behavior could not be fully audited [15].

The vision models drew criticism for the EU carve-out, with European AI policy commentators arguing that the restriction widened a transatlantic gap in access to frontier-scale open weights and complicated the position of European startups that wanted to build on Meta's stack [12]. Some commentators also noted that benchmark scores published by Meta favored particular evaluation prompts and few-shot configurations, and independent reproductions occasionally found smaller margins than reported in the model cards.

Despite these criticisms, the consensus among practitioners by the end of 2024 was that LLaMA 3 had reset expectations for open-weight models. Independent leaderboards including Chatbot Arena and Scale AI's SEAL placed LLaMA 3.1 405B and 3.3 70B near the top of the open-weight category and within striking distance of proprietary frontier models on most public tasks [5][8].

impact on open-source ai

The LLaMA 3 family's release reshaped the competitive landscape of the AI industry in 2024. Before LLaMA 3, the prevailing view in much of the industry was that only closed-model companies with proprietary data advantages could produce frontier-quality systems. The LLaMA 3.1 405B result, matching or exceeding GPT-4 Turbo on multiple benchmarks, challenged that assumption directly.

The influence extended beyond direct adoption. Google accelerated the release cadence of its Gemma open-weight models, Mistral AI expanded its own open-weight offerings, and several Chinese AI labs (including Alibaba with Qwen, 01.AI with Yi, and DeepSeek with V2 and V3) released increasingly competitive open-weight alternatives, creating a feedback loop that accelerated the entire open-weight ecosystem. By early 2025, open-weight models from Chinese labs had begun to match or exceed LLaMA 3.x on several benchmarks, illustrating how quickly the global landscape was responding.

The LLaMA 3 family also proved that overtraining smaller models well beyond the Chinchilla-optimal compute budget could produce highly capable models at accessible sizes. The finding that an 8B model trained on 15 trillion tokens could match or exceed the performance of 70B models trained on only 2 trillion tokens influenced training strategies across the industry and contributed to the trend of prioritizing data quality and quantity over raw parameter count [3]. Industry training reports from Mistral AI, DeepSeek, and Google all converged on similar conclusions in subsequent releases.

llama 4

On April 5, 2025, Meta released the first models in the Llama 4 generation, marking a significant architectural departure from LLaMA 3 [16].

architecture changes

Llama 4 models use a mixture-of-experts (MoE) architecture, replacing the dense transformer design of LLaMA 3. In MoE models, each layer contains multiple expert subnetworks, and a gating function routes each token to only a subset of experts, allowing the model to maintain the knowledge capacity of a much larger model while keeping inference costs proportional to only the active parameters.

Llama 4 also introduces native multimodality through early fusion, meaning that text and image inputs are processed together from the earliest layers of the model rather than being handled by separate encoders that are later combined [16].

models

Model	Total Parameters	Active Parameters	Experts	Context Length
Llama 4 Scout	~109B	17B	16	10,000,000
Llama 4 Maverick	~400B	17B	128	1,000,000
Llama 4 Behemoth (in training)	~2T	288B	--	--

Llama 4 Scout uses 16 experts per MoE layer with 17 billion active parameters per token out of approximately 109 billion total. Its most striking feature is an industry-leading context window of 10 million tokens, vastly exceeding any previous open model [16].

Llama 4 Maverick scales up to 128 routed experts (plus a shared expert) per MoE layer, with MoE and dense layers alternating so that experts are applied in half of the layers. Each token is routed to the shared expert and one of the 128 routed experts. Maverick supports a context window of 1 million tokens and was reported to outperform GPT-4o and Gemini 2.0 Flash across a broad range of benchmarks [16].

Llama 4 Behemoth is a forthcoming model with approximately 2 trillion total parameters and around 288 billion active parameters. Meta described it as in training at the time of the April 2025 announcement and has positioned it as a teacher for future Llama distillations rather than a primary release target.

technical paper

Meta published a detailed technical report titled "The Llama 3 Herd of Models" (arXiv:2407.21783), first posted on July 31, 2024 and last revised on November 23, 2024. The paper provides comprehensive documentation of the training methodology, scaling experiments, architectural decisions, and evaluation results for the LLaMA 3 and 3.1 families [3]. The paper is 92 pages long and covers topics including data curation, pretraining scaling laws, long-context extension, multilingual training, post-training alignment, safety evaluations, and inference optimization. With over 500 listed authors led by Aaron Grattafiori and Abhimanyu Dubey, it is one of the most widely cited AI papers of 2024 and serves as the canonical reference for the LLaMA 3 series.

The paper also discusses unreleased multimodal experiments in which Meta integrated image, video, and speech encoders into LLaMA 3 backbones via compositional approaches similar to those used in LLaMA 3.2 Vision. These experimental variants achieved competitive results on image, video, and speech recognition benchmarks but were not released to the public; some of the engineering work flowed into the Llama 3.2 Vision and Llama 4 multimodal releases.

current state (may 2026)

As of May 2026, the LLaMA model family remains one of the two dominant open-weight model ecosystems in the AI industry, alongside Mistral AI's model family. The LLaMA 3.3 70B and Llama 4 Scout and Maverick models are widely deployed across cloud platforms, enterprise applications, and research institutions. The Llama Stack developer framework continues to expand with new integrations and tooling.

Meta has committed to continuing its open-weight release strategy, with Llama 4 Behemoth expected as the next major release. The company's approach of releasing frontier-scale models openly has reshaped industry dynamics and established a viable alternative to the closed-model paradigm championed by OpenAI and others. The LLaMA 3 family in particular continues to see heavy use as a baseline and as a fine-tuning target, in part because the fully dense architecture is easier to fine-tune and serve than the newer MoE-based Llama 4 models on hardware without specialized routing kernels.

references

Introducing Meta Llama 3: The most capable openly available LLM to date - Meta AI Blog, April 2024
Industry Leading, Open-Source AI | Llama - Meta, 2025
The Llama 3 Herd of Models - arXiv:2407.21783, July 2024
Welcome Llama 3 - Meta's new open LLM - Hugging Face, April 2024
Introducing Llama 3.1: Our most capable models to date - Meta AI Blog, July 2024
Faulty Nvidia H100 GPUs and HBM3 memory caused half of failures during LLama 3 training - Tom's Hardware, 2024
Meta unveils a new, more efficient Llama model - TechCrunch, December 2024
Meta Releases Llama 3.1 405B, Largest Open-Source Model to Date - InfoQ, July 2024
Open Source AI Is the Path Forward - Meta, July 2024
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models - Meta AI Blog, September 2024
Building Meta's GenAI Infrastructure - Engineering at Meta, March 2024
Meta Restricts Multimodal Models in The European Union Due to Privacy Concerns - DeepLearning.AI The Batch
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations - Meta AI Research
Llama Stack - Composable building blocks to build LLM Apps - GitHub
Meta's LLaMa license is still not Open Source - Open Source Initiative
Llama 3.1 Model Card - GitHub, Meta
Llama 3 Establishes Meta as the Leader in "Open" AI - IEEE Spectrum, April 2024
The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation - Meta AI Blog, April 2025
With 10x growth since 2023, Llama is the leading engine of AI innovation - Meta AI Blog, August 2024
Llama 3.3 Model Card - GitHub, Meta
Llama 3.2 Vision Model Card - GitHub, Meta

background and predecessors

release timeline

architecture

model dimensions

tokenizer

grouped-query attention

rotary positional embeddings

dense transformer design

training

pretraining data

training infrastructure

post-training

llama 3 (8b and 70b)

benchmark performance

initial reception

llama 3.1

llama 3.1 405b

benchmark performance (llama 3.1 family)

significance

llama 3.2: multimodal and edge models

vision models (11b and 90b)

vision model benchmarks

lightweight text models (1b and 3b)

lightweight model benchmarks

eu restrictions

llama 3.3 70b

benchmark performance

cost efficiency

llama guard and safety stack

llama guard

code shield

cybersec eval

prompt guard

llama stack

community adoption and ecosystem

use cases

licensing

reception and criticism

impact on open-source ai

llama 4

architecture changes

models

technical paper

current state (may 2026)

see also

references

Improve this article

Related Articles

LLaMA

DeepSeek 3.0

Open-source AI

OpenClaw

LLaMA 3

BART (language model)

background and predecessors

release timeline

architecture

model dimensions

tokenizer

grouped-query attention

rotary positional embeddings

dense transformer design

training

pretraining data

training infrastructure

post-training

llama 3 (8b and 70b)

benchmark performance

initial reception

llama 3.1

llama 3.1 405b

benchmark performance (llama 3.1 family)

significance

llama 3.2: multimodal and edge models

vision models (11b and 90b)

vision model benchmarks

lightweight text models (1b and 3b)

lightweight model benchmarks

eu restrictions

llama 3.3 70b