Llama 3.1
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 5,599 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 5,599 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama 3.1 is a family of open-weight large language models released by Meta on July 23, 2024.[^1] The release comprises three model sizes — 8 billion, 70 billion, and 405 billion parameters — each available in both pre-trained (base) and instruction-tuned (Instruct) variants.[^1][^2] Llama 3.1 built directly on the foundation of Llama 3 with four major expansions: a 128,000-token context window (extended sixteen-fold from the 8,192-token limit in Llama 3), formal multilingual support across eight languages, native tool-calling and function-calling capabilities, and — most consequentially — the 405B variant, which Meta described as "the world's largest and most capable openly available foundation model" and the first open-weight model to be widely benchmarked as competitive with closed frontier systems such as GPT-4o and Claude 3.5 Sonnet.[^1][^2]
The 405B model was trained on more than 15 trillion tokens using over 16,000 NVIDIA H100 GPUs in a single roughly two-month pre-training run, making it one of the most compute-intensive open-weight training runs publicly disclosed at the time.[^1][^2][^3] Meta simultaneously released the Llama Stack specification for agentic application development and a suite of safety classifiers under the Llama Guard 3 and Prompt Guard names.[^1] All models were made available for download via Hugging Face and through more than 25 commercial cloud and inference partners on launch day, including AWS, Microsoft Azure, Google Cloud, NVIDIA, Databricks, Groq, and Dell.[^1][^4]
The accompanying technical paper, "The Llama 3 Herd of Models" (arXiv:2407.21783), documents both the April 2024 Llama 3 release and the July 2024 Llama 3.1 release as a single research program.[^2] Mark Zuckerberg paired the launch with an open letter, "Open Source AI is the Path Forward," arguing that open-weight foundation models would follow a trajectory analogous to Linux displacing proprietary Unix in server infrastructure.[^5]
Meta's first Llama models appeared in February 2023 as compact, research-focused weights distributed under a non-commercial license. LLaMA 2 followed in July 2023 with an explicit commercial license and substantially better performance. Llama 3, released in April 2024 in 8B and 70B variants, brought a 128,000-token vocabulary, substantially higher benchmark scores, and a new pre-training pipeline based on 15 trillion tokens of multilingual web data. Despite these advances, Llama 3 shipped with three significant ecosystem limitations: an 8,192-token context window, only English instruction-tuning at production quality, and no built-in tool-calling protocol.[^1][^2]
Llama 3.1 addressed all three of those gaps in a single release. The context limit was extended by a factor of sixteen, reaching 128,000 tokens via continued pre-training with positional-embedding adjustments.[^2] Multilingual instruction-tuning was formalized across eight languages. Tool calling, function calling, and a code-interpreter mode were incorporated directly into the Instruct variants using a documented JSON interface and a small set of new special tokens.[^6] And the 405B variant — entirely new for this generation — pushed Meta into direct competition with frontier closed models for the first time in the company's open-weight history.[^1][^2]
Until July 2024, open-weight large language models trailed leading closed systems on most public benchmarks. Mistral Large, Command R+, Qwen2-72B, and the original Llama 3 70B were collectively a meaningful step behind GPT-4-class systems. The Llama 3.1 405B — with benchmark scores within a few percentage points of GPT-4o and Claude 3.5 Sonnet — was widely characterized as the first time an organization had released a frontier-tier model with downloadable weights and a commercially usable license.[^1][^2][^7] Patrick McGuinness called it "the first time an open model is being released that matches the closed frontier."[^8] Louie Peters estimated that the 405B training run alone cost approximately $60 million in compute, on top of an estimated $850 million in capital expenditure for the underlying GPU cluster.[^9]
Mark Zuckerberg used the Llama 3.1 launch to make an extended argument for open-source AI, comparing the trajectory he expected for open models to the history of Linux displacing proprietary Unix in server infrastructure. He wrote that open source "will ensure that more people around the world have access to the benefits and opportunities of AI, that power isn't concentrated in the hands of a small number of companies, and that the technology can be deployed more evenly and safely across society."[^5]
Llama 3.1 sits between Llama 3 (April 2024) and Llama 3.2 (September 2024) in Meta's release timeline. Llama 3.2 added vision capabilities to 11B and 90B variants and introduced small text-only edge models at 1B and 3B parameters.[^10] Llama 3.3, released December 6, 2024, revisited the 70B scale with refined post-training and was reported by Meta to approximate Llama 3.1 405B-class quality on instruction following and mathematics at roughly one-fifth the inference cost.[^11] Llama 4 (April 2025) marked a major architectural inflection point: Meta shifted to a mixture-of-experts architecture with native multimodality, releasing Llama 4 Scout and Llama 4 Maverick as the first open-weight Llama models to use MoE, with a two-trillion-parameter Llama 4 Behemoth still in training as of the announcement.[^12] Within this progression, Llama 3.1 represents the high point of Meta's dense decoder-only era and the point at which open weights first achieved measured parity with closed frontier systems.[^1][^2][^12]
Llama 3.1 was released in three sizes, each available in both a pre-trained base model and an instruction-tuned variant fine-tuned for chat, tool use, and multilingual instruction following.[^1]
The 8B model is the smallest and most accessible variant. In FP16 it occupies approximately 16 GB of GPU memory, fitting comfortably on a single high-end consumer GPU such as an NVIDIA RTX 4090 or a data-center-class A10G. With 4-bit quantization the memory footprint drops to approximately 4 GB, enabling deployment on integrated and consumer-grade hardware.[^4] The 8B Instruct variant supports the full 128K-token context window, the standard tool-calling protocol, and the eight officially supported instruction languages.
Despite its size, the 8B Llama 3.1 showed meaningful improvements over its Llama 3 8B predecessor: MMLU rose from approximately 66.6 to 69.4 (5-shot), and GSM8K math reasoning improved from 79.6 to 84.5.[^2] At this scale, the model competes with 13B- and 34B-class models from earlier open-source generations, making it suitable for memory-constrained deployments and on-device applications.
The 70B model represents the mid-tier option and is widely considered the most practical of the three for organizations that need strong capability without the infrastructure demands of the 405B. In FP16, the model requires approximately 140 GB of GPU memory, necessitating at least two 80 GB A100 or H100 GPUs or a comparable multi-GPU configuration. With AWQ 4-bit quantization the requirement drops to roughly 35 GB, enabling deployment on two consumer-grade 24 GB GPUs.[^4]
The 70B Instruct variant scored 83.6 on MMLU (5-shot), 95.1 on GSM8K (8-shot CoT), and 80.5 on HumanEval (0-shot), placing it well ahead of GPT-3.5-class systems and competitive with earlier frontier models such as GPT-4-0125-Preview on many tasks.[^2] On the ZeroSCROLLS/QuALITY long-context benchmark, Llama 3.1 70B Instruct scored 90.5 — matching GPT-4o exactly on that metric.[^2]
The 405B is the flagship model and the primary reason for the Llama 3.1 release's significance. Meta described it as "the first openly available model that rivals the top AI models" in general knowledge, math reasoning, tool use, and multilingual translation.[^1] In full BF16 precision, the model weights alone occupy approximately 810 GB (roughly ten H100 80 GB GPUs).[^4] Meta's official FP8-quantized variant reduces this to approximately 405 GB, fitting within a single 8-way H100 node — a deliberate engineering target intended to make the model deployable on a single industry-standard server.[^1][^4]
Benchmark performance places the 405B within a few percentage points of GPT-4o and Claude 3.5 Sonnet on most tasks. On GSM8K math reasoning the 405B scored 96.8, slightly above GPT-4o's 96.1. On ARC Challenge it scored 96.9, essentially matching GPT-4o's 96.7. On HumanEval code generation it scored 89.0, modestly below GPT-4o's 90.2 and Claude 3.5 Sonnet's 92.0.[^2][^13]
Meta's FP8 quantization was applied specifically to the major linear operators of the model, covering the gate, up, and down projections in the feed-forward networks. These components account for approximately 75% of the model's inference FLOPs. The quantization produces minimal accuracy degradation relative to BF16 on standard evaluations while halving the memory footprint, with NVIDIA and Neural Magic each subsequently releasing alternative FP8 calibrations with slightly different trade-offs.[^4][^14]
Llama 3.1 uses a dense, decoder-only Transformer architecture across all three sizes. Meta deliberately chose not to adopt a mixture-of-experts design for this release, citing training stability and post-training simplicity as the primary reasons.[^2] The architectural specifications are summarized in the table below.
| Parameter | 8B | 70B | 405B |
|---|---|---|---|
| Transformer layers | 32 | 80 | 126 |
| Model dimension | 4,096 | 8,192 | 16,384 |
| FFN hidden dimension | 14,336 | 28,672 | 53,248 |
| Attention heads | 32 | 64 | 128 |
| Key-value heads (GQA) | 8 | 8 | 8 |
| Context window | 128,000 | 128,000 | 128,000 |
| Vocabulary size | 128,000 | 128,000 | 128,000 |
| Peak learning rate | 3 × 10⁻⁴ | 1.5 × 10⁻⁴ | 8 × 10⁻⁵ |
All three variants use Grouped-Query Attention (GQA) with exactly 8 key-value heads per layer.[^2] GQA reduces the memory required by the key-value cache during inference, which is particularly important at 128K context lengths where the KV cache would otherwise grow very large. For the 405B at full 128K context, the KV cache alone requires approximately 123 GB in FP16, comparable to the memory needed for the model weights themselves. The standardization of GQA across all three sizes — rather than using multi-head attention at smaller scales — also enables the same inference kernels to be used across variants, simplifying the deployment surface.[^4]
Llama 3.1 uses Rotary Position Embeddings (RoPE) with a base frequency of 500,000.[^2] This base is substantially higher than the typical RoPE configuration of 10,000 used in earlier Transformer models, and the higher value is necessary to maintain stable attention patterns at 128K token distances. The base models were initially trained at an 8K context window and then progressively extended through six stages of continued pre-training, each stage roughly doubling the context length and incorporating data specifically curated to require long-range attention. Approximately 800 billion additional tokens were processed during these long-context extension stages.[^2]
The feed-forward networks use SwiGLU activation, a gated linear unit variant that generally produces better training dynamics than standard ReLU or GELU activations. This choice is consistent with Llama 3 and with many other recent large language models including PaLM, Mistral, and Qwen.[^2]
Llama 3.1 uses the same 128,000-token vocabulary as Llama 3. The vocabulary was constructed by combining 100,000 tokens from OpenAI's tiktoken cl100k_base tokenizer with 28,000 additional tokens added to improve coverage of non-English languages.[^2][^4] The expanded vocabulary improves tokenization efficiency for languages such as Hindi, Thai, and Arabic, reducing the average number of tokens required to represent a given passage by roughly 30% on multilingual benchmarks relative to a purely English-derived vocabulary.
The Llama 3.1 models were pre-trained on over 15 trillion tokens drawn from a multilingual web corpus.[^1][^2] The dataset was assembled with several filtering stages. Hashing-based and MinHash deduplication removed exact and near-duplicate content. A heuristic and classifier-based quality filter retained high-quality text, drawing on a RoBERTa-based classifier trained on human-annotated quality labels. Domain-specific data for code and mathematics was upsampled relative to its natural frequency in the web corpus, since the paper reports that this upsampling materially improved downstream reasoning benchmarks.[^2]
Approximately 8% of the pre-training tokens were non-English, covering the eight languages that the Instruct variants subsequently support along with broader linguistic diversity at the base-model level. The knowledge cutoff for the pre-training data is December 2023.[^2]
The 405B model was trained on over 16,000 NVIDIA H100 GPUs in a single cluster operated by Meta.[^1][^2] Published analyses indicate that pre-training took approximately 54 days for the 405B variant, with a total accumulated compute budget of 39.3 million GPU-hours across all three model sizes.[^15] Tom's Hardware, citing the published Llama 3 paper, reported that during the 54-day training period the cluster experienced 419 unexpected component failures — roughly one every three hours — with faulty GPUs and HBM3 memory modules collectively responsible for about half of those failures.[^15] To run training reliably at this scale, Meta developed custom optimizations across the network, storage, and compute stack, including changes to NCCL collective operations and a custom checkpoint-restart pipeline.[^2]
The pre-training precision was BF16. To enable production-scale inference at reasonable cost, Meta separately developed FP8 quantization of the final weights, reducing storage and memory requirements by approximately half without meaningful degradation on standard benchmarks.[^4]
Post-training followed a multi-stage pipeline of supervised fine-tuning (SFT) followed by preference optimization, specifically using Direct Preference Optimization (DPO) as one of the preference optimization methods rather than the more traditional Proximal Policy Optimization variant of RLHF.[^2] Meta reports generating more than 25 million synthetic training examples for the post-training pipeline using a combination of earlier Llama models, human annotation, and rejection sampling against reward models.[^2]
Post-training addressed four areas beyond general instruction following:
The complete post-training pipeline went through several iterations of rejection sampling, SFT, and DPO. The paper describes this as a "three-stage" post-training process, with each stage progressively improving capability, safety, and alignment on the eight target languages.[^2]
The extension from 8,192 tokens in Llama 3 to 128,000 tokens in Llama 3.1 is one of the release's most consequential changes. A 128K context window can accommodate approximately 96,000–100,000 words of English text in a single input. This enables several use cases that were impractical with shorter windows:[^2][^4]
Meta achieved the 128K extension through a combination of architectural and training-time changes. The RoPE base frequency was increased to 500,000, which substantially expands the effective wavelengths used by the positional encoding and reduces the rotation rate that earlier RoPE configurations applied to long-distance attention pairs.[^2] On top of this architectural change, Meta performed six progressive stages of continued pre-training. Each stage roughly doubled the maximum sequence length used during training and was conducted on a corpus of long-document data specifically curated for that stage. The cumulative continued-pre-training token count was approximately 800 billion tokens.[^2]
The KV cache requirements scale linearly with context length. For the 405B at 128K tokens, the cache requires approximately 123 GB, which must be added to the weight memory. In practice this means serving the 405B at full 128K context requires careful memory planning across multiple H100 nodes. For the 8B model, the 128K KV cache requires approximately 15.6 GB, manageable alongside the 16 GB weight footprint on a dual-GPU setup or a single 32 GB or larger accelerator.[^4]
Llama 3.1's Instruct variants officially support eight languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.[^1] The selection was driven by a combination of speaker population, partner demand, and availability of high-quality instruction data. Multilingual support is a meaningful capability upgrade over Llama 3, which was tuned primarily for English.
Meta evaluated multilingual capability on benchmarks including MGSM (Multilingual Grade-School Math), translation benchmarks across the eight supported languages, and dedicated multilingual MMLU variants. The 405B Instruct scored 91.6 on MGSM (0-shot CoT), matching Claude 3.5 Sonnet exactly and edging out GPT-4o's 90.5.[^2] The expanded 128,000-token vocabulary contributes to multilingual capability by tokenizing non-English languages — especially Hindi and Thai, which use non-Latin scripts — more efficiently than English-derived tokenizers, reducing the per-token computational cost of generating in those languages.[^2][^4]
Outside the eight officially supported languages, the base models contain meaningful capability across a much broader set of languages owing to the diversity of the pre-training corpus, but Meta cautions that Instruct-model quality drops substantially outside the officially supported set.[^1]
Llama 3.1 was the first Llama generation to include native tool calling as a first-class capability of the Instruct models.[^6] The Instruct variants support three modes of tool interaction.
Two tools are pre-trained into the model and can be activated via the system prompt:
Tools: brave_search. The model calls the tool using Python-like syntax: brave_search.call(query="...").Tools: wolfram_alpha. Called as wolfram_alpha.call(query="...").These built-in tools are triggered in an ipython-style environment context. The model generates a tool call, receives the result via an ipython role message, and then continues reasoning with the result incorporated into the dialogue.[^6]
Developers can define arbitrary custom tools in the system prompt using a JSON-schema format similar to the OpenAI function-calling specification. The model parses the tool definitions and generates calls in the form {"name": "function_name", "parameters": {"arg_name": "value"}}. This permits integration with any external API or service.[^6]
The model can also generate Python code for execution when the system prompt includes Environment: ipython. In this mode the model wraps code in a <|python_tag|> block, closes with the <|eom_id|> token to signal that execution should occur, and then continues after receiving the code output. This enables agentic workflows where the model iteratively writes, runs, and debugs code.[^6]
The Llama 3.1 prompt format added new special tokens compared to Llama 3:
<|python_tag|>: Marks the beginning of a tool or code call.<|eom_id|>: End of message, indicating the model expects a tool result before the turn ends.ipython role for tool-result messages.These tokens allow the model to clearly signal multi-step tool interactions within the existing turn-based conversation format.[^6]
Meta evaluated all three Llama 3.1 models on a wide range of benchmarks and compared them against GPT-4o and Claude 3.5 Sonnet. The numbers below are taken from the Llama 3 Herd of Models paper, the Hugging Face announcement post, and Meta's official model cards.[^1][^2][^4]
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| MMLU (5-shot) | 69.4 | 83.6 | 87.3 | 89.1 | 89.9 |
| MMLU (0-shot CoT) | 73.0 | 86.0 | 88.6 | 88.7 | 88.3 |
| MMLU-Pro (5-shot CoT) | 48.3 | 66.4 | 73.3 | 74.4 | 77.0 |
| ARC Challenge (0-shot) | 83.4 | 94.8 | 96.9 | 96.7 | 96.7 |
| HellaSwag (0-shot) | 82.1 | 88.0 | 89.2 | 95.3 | 89.0 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| GSM8K (8-shot, CoT) | 84.5 | 95.1 | 96.8 | 96.1 | 96.4 |
| MATH (0-shot, CoT) | 51.9 | 68.0 | 73.8 | 76.6 | 71.1 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| HumanEval (0-shot) | 72.6 | 80.5 | 89.0 | 90.2 | 92.0 |
| MBPP EvalPlus (0-shot) | 72.8 | 86.0 | 88.6 | 87.0 | 90.7 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o |
|---|---|---|---|---|
| ZeroSCROLLS/QuALITY | 81.0 | 90.5 | 95.2 | 90.5 |
| InfiniteBench/En.MC | 65.1 | 78.2 | 83.4 | 82.5 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| MGSM (0-shot, CoT) | 68.9 | 86.9 | 91.6 | 90.5 | 91.6 |
The 405B notably surpasses GPT-4o on long-context tasks (ZeroSCROLLS, InfiniteBench) and matches it on multilingual math reasoning. The remaining gap on MMLU is approximately 1.8 to 2.6 points; on HumanEval it is 1.2 to 3.0 points; on MATH it is 2.8 to 4.7 points. On reasoning-heavy benchmarks such as GSM8K and ARC Challenge, the 405B matches or slightly exceeds its closed-model counterparts.[^2][^13]
Note that two protocols for MMLU are reported in this article: 5-shot (the historical convention, giving 87.3 for the 405B) and 0-shot with chain-of-thought (88.6 for the 405B). Meta's blog post emphasized the 88.6 figure, while the paper and many downstream evaluations report the 5-shot 87.3. Both refer to the same model and reflect different evaluation conventions rather than different models.[^1][^2]
Llama 3.1 is released under the Llama 3.1 Community License Agreement, a custom commercial license written by Meta for this release. It is not an Open Source Initiative (OSI)-approved open-source license.[^16][^17] Key provisions are as follows.
Commercial use is allowed without restriction for the vast majority of organizations. Users may run, fine-tune, modify, and distribute the model weights. The license permits using model outputs to create, train, fine-tune, or otherwise improve other AI models — a meaningful liberalization relative to the Llama 2 license, which prohibited using Llama outputs to train competing models.[^16][^17] However, any AI model that is created, trained, or improved using Llama 3.1 outputs and is subsequently distributed must include "Llama" at the beginning of its name. This requirement applies to distillation, synthetic-data training, and any other workflow that uses Llama 3.1 outputs to shape another model.[^16][^17]
Organizations whose products or services had more than 700 million monthly active users (MAU) in the calendar month preceding the Llama 3.1 release date must obtain a separate license from Meta. Meta retains sole discretion over whether to grant such a license and on what terms.[^16][^17] This threshold was widely discussed at the time of release because it potentially affects companies such as Google, Microsoft, Amazon, Apple, ByteDance, and Tencent — organizations that might wish to embed Llama 3.1 in consumer products at very large scale.[^17]
Derivatives and products built on Llama 3.1 must include attribution in product documentation, on a related website, or in a relevant user interface, and must include a "Notice" text file with the language "Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved."[^16]
Use is also governed by Meta's separate Acceptable Use Policy, which prohibits use of the models for a range of harmful, illegal, or deceptive purposes, including weapons development, child sexual abuse material, and certain categories of high-stakes automated decision-making.[^16]
The license was broadly interpreted as more permissive than Llama 2's license, particularly in allowing the use of outputs to improve other models. Critics noted that the 700 million MAU threshold, the Llama-name requirement on distillation derivatives, and the absence of OSI approval together mean the model cannot technically be described as "open source" under the canonical Open Source Initiative definition; the more accurate term in legal scholarship is "open weights" or "source available."[^17][^18]
Meta released Llama Guard 3 alongside Llama 3.1. Llama Guard 3 is an 8B safety classifier built on the Llama 3.1 8B base. It classifies model inputs and outputs against a taxonomy of harm categories defined in collaboration with the MLCommons consortium. Llama Guard 3 supports all eight Llama 3.1 languages, making it the first multilingual safety classifier in the Llama Guard series. Meta reported that Llama Guard 3 outperformed GPT-4 on English, multilingual, and tool-use safety classification benchmarks while maintaining lower false-positive rates.[^1][^19]
A 1B variant (Llama Guard 3 1B) was released subsequently for deployments where running the full 8B safety model alongside a large inference model is too expensive.
Prompt Guard is an 86-million-parameter classifier released simultaneously with Llama 3.1. It is trained to detect prompt injection attacks and jailbreak attempts. The model was built on a multilingual DeBERTa backbone to support cross-language adversarial inputs. At 86M parameters, Prompt Guard can be run as a fast pre-filter on every user input before it reaches the main model, adding minimal latency overhead.[^20]
The Llama 3.1 release was accompanied by the announcement of Llama Stack, a specification for standardized interfaces across the components needed to build agentic applications.[^21] Llama Stack defines APIs for common building blocks including:
The goal was to reduce fragmentation in the ecosystem of frameworks built around Llama models, conceptually analogous to how ONNX provides a standard format for model exchange across machine-learning frameworks. Multiple providers — including Meta itself, AWS, NVIDIA, and Fireworks AI — subsequently released Llama Stack-compatible distributions. Llama Stack was positioned as a self-hostable alternative to OpenAI-compatible APIs for organizations that want to run their own inference infrastructure end-to-end.[^21]
The Llama 3.1 release received broad attention in the AI industry. The 405B's competitive benchmark scores against GPT-4o and Claude 3.5 Sonnet were widely reported as a milestone for open-weight models.[^1][^7][^8] The combination of frontier-class performance, open weights, and an updated license permitting distillation was interpreted by many commentators as a significant strategic move against proprietary AI providers.
Mark Zuckerberg's accompanying essay generated substantial discussion.[^5] He argued that open-source models would follow a trajectory similar to Linux, eventually becoming the default infrastructure choice for most developers. Critics noted that the 700 million MAU cap in the license was a meaningful constraint that distinguished the release from true open source and that Meta's interest in commoditizing the AI-model market aligned with its business interests in advertising infrastructure rather than direct AI services.[^17][^18]
Within hours of release, major cloud providers including Amazon Bedrock, Google Cloud Vertex AI, Microsoft Azure, and Oracle Cloud Infrastructure announced Llama 3.1 availability.[^1][^4] Over 25 inference and fine-tuning partners offered access on day one, making it one of the most broadly deployed open-weight models at the time of its release. The Hugging Face model card for the 405B accumulated one of the highest download counts of any model in that size range during the second half of 2024.[^4]
NVIDIA released an official FP8-quantized version of the 405B Instruct model in collaboration with Meta, and Neural Magic released alternative FP8 quantizations with different calibration strategies, demonstrating the ecosystem's rapid adaptation to the deployment challenges posed by a 405B model.[^14]
Industry analysts noted that Llama 3.1 raised the floor of what organizations could achieve without purchasing access to proprietary frontier models. The ability to run a GPT-4o-class model on owned infrastructure, with full data privacy and at lower inference cost, altered the calculus for enterprises evaluating build-versus-buy decisions for AI capabilities.[^7] Within months, derivative fine-tunes proliferated on Hugging Face, including domain-specialized variants for medicine, law, and coding; multi-language fine-tunes extending beyond the eight official languages; and a wide range of community-quantized variants targeting consumer-grade hardware.
Llama 3.2 was released on September 25, 2024, two months after Llama 3.1. It introduced two parallel tracks: vision-capable 11B and 90B models that retain the text capability of their Llama 3.1 counterparts while adding image understanding via an adapter-based vision encoder, and small text-only 1B and 3B variants engineered for on-device deployment on phones and edge devices.[^10] Llama 3.2 used the same 128K context window and the same eight supported languages.
Llama 3.3 was released on December 6, 2024. Unlike Llama 3.1 and 3.2, the Llama 3.3 release consisted of a single new model: a refined 70B Instruct variant. Meta reported that the new 70B approximated the Llama 3.1 405B's quality on instruction following (IFEval ~92.1) and mathematics (MATH ~77.0) at roughly one-fifth the inference cost.[^11] Llama 3.3 became the recommended replacement for Llama 3.1 70B in most production deployments and partially superseded the 405B for cost-sensitive applications, though the 405B remained the recommended choice for the most demanding general-knowledge and reasoning tasks.
Llama 4 marked a major architectural inflection point in the Llama family.[^12] Released in early April 2025, Llama 4 introduced mixture-of-experts (MoE) routing for the first time in an open-weight Llama model, along with native multimodality (image and video understanding alongside text). The initial release comprised Llama 4 Scout (17B active parameters across 16 experts, optimized for single-H100 deployment) and Llama 4 Maverick (17B active parameters across 128 experts, totaling roughly 400B parameters), with a two-trillion-parameter Llama 4 Behemoth (288B active across 16 experts) still in training at the announcement.[^12] Llama 4 also dramatically extended the context window, with Scout reaching a 10-million-token context length. With Llama 4, Meta departed from the dense decoder-only design that characterized Llama 1 through Llama 3.3.
As of mid-2026, Llama 3.1 remains one of the most widely used open-weight model families in production, particularly the 8B and 70B variants. The 405B retains a niche but important role as a research and benchmark reference point and as a teacher model for synthetic-data generation workflows. For the 70B size class, however, most production deployments have migrated to Llama 3.3 70B, which delivers approximately equivalent capability at the same serving cost while incorporating post-training improvements. For applications requiring multimodality, agentic MoE inference, or extreme long-context handling beyond 128K tokens, users have largely moved to Llama 3.2 (for vision) and Llama 4 (for multimodal MoE).
The 8B variant continues to be heavily used for on-device deployment and as the base model for a large number of community fine-tunes. Its combination of strong English instruction following, eight-language coverage, and ability to run on a single consumer GPU with 4-bit quantization made it the de facto default for small open-weight model applications throughout 2024 and 2025.
Llama 3.1's most lasting contribution may be its role in establishing that open-weight models can credibly compete with frontier closed systems. The 405B demonstrated, for the first time in publicly available weights, that the gap between open and closed at the very top of the capability frontier was not structural but rather a matter of compute budget and engineering investment. Subsequent open-weight efforts built on this foundation, and the Llama 3.1 Community License's 700M MAU cap and "Llama"-name requirement on distillation derivatives set a template that most subsequent commercially permissive open-weight licenses have followed in some form.