Llama 3.1

Llama 3.1 is a family of open-weight large language models released by Meta on July 23, 2024.[^1] The release comprises three model sizes — 8 billion, 70 billion, and 405 billion parameters — each available in both pre-trained (base) and instruction-tuned (Instruct) variants.[^1][^2] Llama 3.1 built directly on the foundation of Llama 3 with four major expansions: a 128,000-token context window (extended sixteen-fold from the 8,192-token limit in Llama 3), formal multilingual support across eight languages, native tool-calling and function-calling capabilities, and — most consequentially — the 405B variant, which Meta described as "the world's largest and most capable openly available foundation model" and the first open-weight model to be widely benchmarked as competitive with closed frontier systems such as GPT-4o and Claude 3.5 Sonnet.[^1][^2]

The 405B model was trained on more than 15 trillion tokens using over 16,000 NVIDIA H100 GPUs in a single roughly two-month pre-training run, making it one of the most compute-intensive open-weight training runs publicly disclosed at the time.[^1][^2][^3] Meta simultaneously released the Llama Stack specification for agentic application development and a suite of safety classifiers under the Llama Guard 3 and Prompt Guard names.[^1] All models were made available for download via Hugging Face and through more than 25 commercial cloud and inference partners on launch day, including AWS, Microsoft Azure, Google Cloud, NVIDIA, Databricks, Groq, and Dell.[^1][^4]

The accompanying technical paper, "The Llama 3 Herd of Models" (arXiv:2407.21783), documents both the April 2024 Llama 3 release and the July 2024 Llama 3.1 release as a single research program.[^2] Mark Zuckerberg paired the launch with an open letter, "Open Source AI is the Path Forward," arguing that open-weight foundation models would follow a trajectory analogous to Linux displacing proprietary Unix in server infrastructure.[^5]

Background

From Llama 3 to Llama 3.1

Meta's first Llama models appeared in February 2023 as compact, research-focused weights distributed under a non-commercial license. LLaMA 2 followed in July 2023 with an explicit commercial license and substantially better performance. Llama 3, released in April 2024 in 8B and 70B variants, brought a 128,000-token vocabulary, substantially higher benchmark scores, and a new pre-training pipeline based on 15 trillion tokens of multilingual web data. Despite these advances, Llama 3 shipped with three significant ecosystem limitations: an 8,192-token context window, only English instruction-tuning at production quality, and no built-in tool-calling protocol.[^1][^2]

Llama 3.1 addressed all three of those gaps in a single release. The context limit was extended by a factor of sixteen, reaching 128,000 tokens via continued pre-training with positional-embedding adjustments.[^2] Multilingual instruction-tuning was formalized across eight languages. Tool calling, function calling, and a code-interpreter mode were incorporated directly into the Instruct variants using a documented JSON interface and a small set of new special tokens.[^6] And the 405B variant — entirely new for this generation — pushed Meta into direct competition with frontier closed models for the first time in the company's open-weight history.[^1][^2]

Why the 405B Mattered for the Open Ecosystem

Until July 2024, open-weight large language models trailed leading closed systems on most public benchmarks. Mistral Large, Command R+, Qwen2-72B, and the original Llama 3 70B were collectively a meaningful step behind GPT-4-class systems. The Llama 3.1 405B — with benchmark scores within a few percentage points of GPT-4o and Claude 3.5 Sonnet — was widely characterized as the first time an organization had released a frontier-tier model with downloadable weights and a commercially usable license.[^1][^2][^7] Patrick McGuinness called it "the first time an open model is being released that matches the closed frontier."[^8] Louie Peters estimated that the 405B training run alone cost approximately $60 million in compute, on top of an estimated $850 million in capital expenditure for the underlying GPU cluster.[^9]

Mark Zuckerberg used the Llama 3.1 launch to make an extended argument for open-source AI, comparing the trajectory he expected for open models to the history of Linux displacing proprietary Unix in server infrastructure. He wrote that open source "will ensure that more people around the world have access to the benefits and opportunities of AI, that power isn't concentrated in the hands of a small number of companies, and that the technology can be deployed more evenly and safely across society."[^5]

Context in the Broader Llama Family

Llama 3.1 sits between Llama 3 (April 2024) and Llama 3.2 (September 2024) in Meta's release timeline. Llama 3.2 added vision capabilities to 11B and 90B variants and introduced small text-only edge models at 1B and 3B parameters.[^10] Llama 3.3, released December 6, 2024, revisited the 70B scale with refined post-training and was reported by Meta to approximate Llama 3.1 405B-class quality on instruction following and mathematics at roughly one-fifth the inference cost.[^11] Llama 4 (April 2025) marked a major architectural inflection point: Meta shifted to a mixture-of-experts architecture with native multimodality, releasing Llama 4 Scout and Llama 4 Maverick as the first open-weight Llama models to use MoE, with a two-trillion-parameter Llama 4 Behemoth still in training as of the announcement.[^12] Within this progression, Llama 3.1 represents the high point of Meta's dense decoder-only era and the point at which open weights first achieved measured parity with closed frontier systems.[^1][^2][^12]

Model Variants

Llama 3.1 was released in three sizes, each available in both a pre-trained base model and an instruction-tuned variant fine-tuned for chat, tool use, and multilingual instruction following.[^1]

8B

The 8B model is the smallest and most accessible variant. In FP16 it occupies approximately 16 GB of GPU memory, fitting comfortably on a single high-end consumer GPU such as an NVIDIA RTX 4090 or a data-center-class A10G. With 4-bit quantization the memory footprint drops to approximately 4 GB, enabling deployment on integrated and consumer-grade hardware.[^4] The 8B Instruct variant supports the full 128K-token context window, the standard tool-calling protocol, and the eight officially supported instruction languages.

Despite its size, the 8B Llama 3.1 showed meaningful improvements over its Llama 3 8B predecessor: MMLU rose from approximately 66.6 to 69.4 (5-shot), and GSM8K math reasoning improved from 79.6 to 84.5.[^2] At this scale, the model competes with 13B- and 34B-class models from earlier open-source generations, making it suitable for memory-constrained deployments and on-device applications.

70B

The 70B model represents the mid-tier option and is widely considered the most practical of the three for organizations that need strong capability without the infrastructure demands of the 405B. In FP16, the model requires approximately 140 GB of GPU memory, necessitating at least two 80 GB A100 or H100 GPUs or a comparable multi-GPU configuration. With AWQ 4-bit quantization the requirement drops to roughly 35 GB, enabling deployment on two consumer-grade 24 GB GPUs.[^4]

The 70B Instruct variant scored 83.6 on MMLU (5-shot), 95.1 on GSM8K (8-shot CoT), and 80.5 on HumanEval (0-shot), placing it well ahead of GPT-3.5-class systems and competitive with earlier frontier models such as GPT-4-0125-Preview on many tasks.[^2] On the ZeroSCROLLS/QuALITY long-context benchmark, Llama 3.1 70B Instruct scored 90.5 — matching GPT-4o exactly on that metric.[^2]

405B

The 405B is the flagship model and the primary reason for the Llama 3.1 release's significance. Meta described it as "the first openly available model that rivals the top AI models" in general knowledge, math reasoning, tool use, and multilingual translation.[^1] In full BF16 precision, the model weights alone occupy approximately 810 GB (roughly ten H100 80 GB GPUs).[^4] Meta's official FP8-quantized variant reduces this to approximately 405 GB, fitting within a single 8-way H100 node — a deliberate engineering target intended to make the model deployable on a single industry-standard server.[^1][^4]

Benchmark performance places the 405B within a few percentage points of GPT-4o and Claude 3.5 Sonnet on most tasks. On GSM8K math reasoning the 405B scored 96.8, slightly above GPT-4o's 96.1. On ARC Challenge it scored 96.9, essentially matching GPT-4o's 96.7. On HumanEval code generation it scored 89.0, modestly below GPT-4o's 90.2 and Claude 3.5 Sonnet's 92.0.[^2][^13]

Meta's FP8 quantization was applied specifically to the major linear operators of the model, covering the gate, up, and down projections in the feed-forward networks. These components account for approximately 75% of the model's inference FLOPs. The quantization produces minimal accuracy degradation relative to BF16 on standard evaluations while halving the memory footprint, with NVIDIA and Neural Magic each subsequently releasing alternative FP8 calibrations with slightly different trade-offs.[^4][^14]

Architecture

Llama 3.1 uses a dense, decoder-only Transformer architecture across all three sizes. Meta deliberately chose not to adopt a mixture-of-experts design for this release, citing training stability and post-training simplicity as the primary reasons.[^2] The architectural specifications are summarized in the table below.

Parameter	8B	70B	405B
Transformer layers	32	80	126
Model dimension	4,096	8,192	16,384
FFN hidden dimension	14,336	28,672	53,248
Attention heads	32	64	128
Key-value heads (GQA)	8	8	8
Context window	128,000	128,000	128,000
Vocabulary size	128,000	128,000	128,000
Peak learning rate	3 × 10⁻⁴	1.5 × 10⁻⁴	8 × 10⁻⁵

Grouped-Query Attention

All three variants use Grouped-Query Attention (GQA) with exactly 8 key-value heads per layer.[^2] GQA reduces the memory required by the key-value cache during inference, which is particularly important at 128K context lengths where the KV cache would otherwise grow very large. For the 405B at full 128K context, the KV cache alone requires approximately 123 GB in FP16, comparable to the memory needed for the model weights themselves. The standardization of GQA across all three sizes — rather than using multi-head attention at smaller scales — also enables the same inference kernels to be used across variants, simplifying the deployment surface.[^4]

Rotary Position Embeddings and Long-Context Extension

Llama 3.1 uses Rotary Position Embeddings (RoPE) with a base frequency of 500,000.[^2] This base is substantially higher than the typical RoPE configuration of 10,000 used in earlier Transformer models, and the higher value is necessary to maintain stable attention patterns at 128K token distances. The base models were initially trained at an 8K context window and then progressively extended through six stages of continued pre-training, each stage roughly doubling the context length and incorporating data specifically curated to require long-range attention. Approximately 800 billion additional tokens were processed during these long-context extension stages.[^2]

SwiGLU Activation

The feed-forward networks use SwiGLU activation, a gated linear unit variant that generally produces better training dynamics than standard ReLU or GELU activations. This choice is consistent with Llama 3 and with many other recent large language models including PaLM, Mistral, and Qwen.[^2]

Tokenizer

Llama 3.1 uses the same 128,000-token vocabulary as Llama 3. The vocabulary was constructed by combining 100,000 tokens from OpenAI's tiktoken cl100k_base tokenizer with 28,000 additional tokens added to improve coverage of non-English languages.[^2][^4] The expanded vocabulary improves tokenization efficiency for languages such as Hindi, Thai, and Arabic, reducing the average number of tokens required to represent a given passage by roughly 30% on multilingual benchmarks relative to a purely English-derived vocabulary.

Training

Pre-training Data

The Llama 3.1 models were pre-trained on over 15 trillion tokens drawn from a multilingual web corpus.[^1][^2] The dataset was assembled with several filtering stages. Hashing-based and MinHash deduplication removed exact and near-duplicate content. A heuristic and classifier-based quality filter retained high-quality text, drawing on a RoBERTa-based classifier trained on human-annotated quality labels. Domain-specific data for code and mathematics was upsampled relative to its natural frequency in the web corpus, since the paper reports that this upsampling materially improved downstream reasoning benchmarks.[^2]

Approximately 8% of the pre-training tokens were non-English, covering the eight languages that the Instruct variants subsequently support along with broader linguistic diversity at the base-model level. The knowledge cutoff for the pre-training data is December 2023.[^2]

Pre-training Infrastructure

The 405B model was trained on over 16,000 NVIDIA H100 GPUs in a single cluster operated by Meta.[^1][^2] Published analyses indicate that pre-training took approximately 54 days for the 405B variant, with a total accumulated compute budget of 39.3 million GPU-hours across all three model sizes.[^15] Tom's Hardware, citing the published Llama 3 paper, reported that during the 54-day training period the cluster experienced 419 unexpected component failures — roughly one every three hours — with faulty GPUs and HBM3 memory modules collectively responsible for about half of those failures.[^15] To run training reliably at this scale, Meta developed custom optimizations across the network, storage, and compute stack, including changes to NCCL collective operations and a custom checkpoint-restart pipeline.[^2]

The pre-training precision was BF16. To enable production-scale inference at reasonable cost, Meta separately developed FP8 quantization of the final weights, reducing storage and memory requirements by approximately half without meaningful degradation on standard benchmarks.[^4]

Post-training

Post-training followed a multi-stage pipeline of supervised fine-tuning (SFT) followed by preference optimization, specifically using Direct Preference Optimization (DPO) as one of the preference optimization methods rather than the more traditional Proximal Policy Optimization variant of RLHF.[^2] Meta reports generating more than 25 million synthetic training examples for the post-training pipeline using a combination of earlier Llama models, human annotation, and rejection sampling against reward models.[^2]

Post-training addressed four areas beyond general instruction following:

Long-context handling: The Instruct models were fine-tuned on examples requiring comprehension and generation across the full 128K window, including multi-document summarization, long code files, and extended conversations.
Tool use: JSON-based tool definitions and multi-turn tool-call sequences were incorporated into post-training data, including data for the two built-in tools (Brave Search and Wolfram Alpha) and for generic custom-tool definitions.
Multilingual alignment: Preference data in all eight supported languages was used to align Instruct models across languages rather than only in English.
Safety: The post-training pipeline included safety-specific fine-tuning informed by red-teaming exercises across multiple harm categories, in collaboration with the MLCommons consortium.[^2]

The complete post-training pipeline went through several iterations of rejection sampling, SFT, and DPO. The paper describes this as a "three-stage" post-training process, with each stage progressively improving capability, safety, and alignment on the eight target languages.[^2]

Long Context (128K) and How It Was Achieved

The extension from 8,192 tokens in Llama 3 to 128,000 tokens in Llama 3.1 is one of the release's most consequential changes. A 128K context window can accommodate approximately 96,000–100,000 words of English text in a single input. This enables several use cases that were impractical with shorter windows:[^2][^4]

Entire software codebases or large multi-file projects can be analyzed in a single pass.
Long legal documents, academic papers, or financial filings can be processed without chunking.
Conversations running to hundreds of turns can be maintained in context without summarization losses.
Multi-document retrieval-augmented generation (RAG) pipelines can include many source documents simultaneously, reducing the risk of relevant context being dropped during chunking.

Meta achieved the 128K extension through a combination of architectural and training-time changes. The RoPE base frequency was increased to 500,000, which substantially expands the effective wavelengths used by the positional encoding and reduces the rotation rate that earlier RoPE configurations applied to long-distance attention pairs.[^2] On top of this architectural change, Meta performed six progressive stages of continued pre-training. Each stage roughly doubled the maximum sequence length used during training and was conducted on a corpus of long-document data specifically curated for that stage. The cumulative continued-pre-training token count was approximately 800 billion tokens.[^2]

The KV cache requirements scale linearly with context length. For the 405B at 128K tokens, the cache requires approximately 123 GB, which must be added to the weight memory. In practice this means serving the 405B at full 128K context requires careful memory planning across multiple H100 nodes. For the 8B model, the 128K KV cache requires approximately 15.6 GB, manageable alongside the 16 GB weight footprint on a dual-GPU setup or a single 32 GB or larger accelerator.[^4]

Multilingual Support

Llama 3.1's Instruct variants officially support eight languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.[^1] The selection was driven by a combination of speaker population, partner demand, and availability of high-quality instruction data. Multilingual support is a meaningful capability upgrade over Llama 3, which was tuned primarily for English.

Meta evaluated multilingual capability on benchmarks including MGSM (Multilingual Grade-School Math), translation benchmarks across the eight supported languages, and dedicated multilingual MMLU variants. The 405B Instruct scored 91.6 on MGSM (0-shot CoT), matching Claude 3.5 Sonnet exactly and edging out GPT-4o's 90.5.[^2] The expanded 128,000-token vocabulary contributes to multilingual capability by tokenizing non-English languages — especially Hindi and Thai, which use non-Latin scripts — more efficiently than English-derived tokenizers, reducing the per-token computational cost of generating in those languages.[^2][^4]

Outside the eight officially supported languages, the base models contain meaningful capability across a much broader set of languages owing to the diversity of the pre-training corpus, but Meta cautions that Instruct-model quality drops substantially outside the officially supported set.[^1]

Tool Use and Function Calling

Llama 3.1 was the first Llama generation to include native tool calling as a first-class capability of the Instruct models.[^6] The Instruct variants support three modes of tool interaction.

Built-in Tools

Two tools are pre-trained into the model and can be activated via the system prompt:

Brave Search: Enables web-search queries. Activated with the system-prompt tag Tools: brave_search. The model calls the tool using Python-like syntax: brave_search.call(query="...").
Wolfram Alpha: Enables mathematical and symbolic computation. Activated with Tools: wolfram_alpha. Called as wolfram_alpha.call(query="...").

These built-in tools are triggered in an ipython-style environment context. The model generates a tool call, receives the result via an ipython role message, and then continues reasoning with the result incorporated into the dialogue.[^6]

JSON-Based Custom Tools

Developers can define arbitrary custom tools in the system prompt using a JSON-schema format similar to the OpenAI function-calling specification. The model parses the tool definitions and generates calls in the form {"name": "function_name", "parameters": {"arg_name": "value"}}. This permits integration with any external API or service.[^6]

Code Interpreter

The model can also generate Python code for execution when the system prompt includes Environment: ipython. In this mode the model wraps code in a <|python_tag|> block, closes with the <|eom_id|> token to signal that execution should occur, and then continues after receiving the code output. This enables agentic workflows where the model iteratively writes, runs, and debugs code.[^6]

Special Tokens for Tool Calls

The Llama 3.1 prompt format added new special tokens compared to Llama 3:

<|python_tag|>: Marks the beginning of a tool or code call.
<|eom_id|>: End of message, indicating the model expects a tool result before the turn ends.
The ipython role for tool-result messages.

These tokens allow the model to clearly signal multi-step tool interactions within the existing turn-based conversation format.[^6]

Benchmarks

Meta evaluated all three Llama 3.1 models on a wide range of benchmarks and compared them against GPT-4o and Claude 3.5 Sonnet. The numbers below are taken from the Llama 3 Herd of Models paper, the Hugging Face announcement post, and Meta's official model cards.[^1][^2][^4]

General Knowledge and Reasoning

Benchmark	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B	GPT-4o	Claude 3.5 Sonnet
MMLU (5-shot)	69.4	83.6	87.3	89.1	89.9
MMLU (0-shot CoT)	73.0	86.0	88.6	88.7	88.3
MMLU-Pro (5-shot CoT)	48.3	66.4	73.3	74.4	77.0
ARC Challenge (0-shot)	83.4	94.8	96.9	96.7	96.7
HellaSwag (0-shot)	82.1	88.0	89.2	95.3	89.0

Mathematics

Benchmark	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B	GPT-4o	Claude 3.5 Sonnet
GSM8K (8-shot, CoT)	84.5	95.1	96.8	96.1	96.4
MATH (0-shot, CoT)	51.9	68.0	73.8	76.6	71.1

Code

Benchmark	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B	GPT-4o	Claude 3.5 Sonnet
HumanEval (0-shot)	72.6	80.5	89.0	90.2	92.0
MBPP EvalPlus (0-shot)	72.8	86.0	88.6	87.0	90.7

Long Context

Benchmark	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B	GPT-4o
ZeroSCROLLS/QuALITY	81.0	90.5	95.2	90.5
InfiniteBench/En.MC	65.1	78.2	83.4	82.5

Multilingual

Benchmark	Llama 3.1 8B	Llama 3.1 70B	Llama 3.1 405B	GPT-4o	Claude 3.5 Sonnet
MGSM (0-shot, CoT)	68.9	86.9	91.6	90.5	91.6

The 405B notably surpasses GPT-4o on long-context tasks (ZeroSCROLLS, InfiniteBench) and matches it on multilingual math reasoning. The remaining gap on MMLU is approximately 1.8 to 2.6 points; on HumanEval it is 1.2 to 3.0 points; on MATH it is 2.8 to 4.7 points. On reasoning-heavy benchmarks such as GSM8K and ARC Challenge, the 405B matches or slightly exceeds its closed-model counterparts.[^2][^13]

Note that two protocols for MMLU are reported in this article: 5-shot (the historical convention, giving 87.3 for the 405B) and 0-shot with chain-of-thought (88.6 for the 405B). Meta's blog post emphasized the 88.6 figure, while the paper and many downstream evaluations report the 5-shot 87.3. Both refer to the same model and reflect different evaluation conventions rather than different models.[^1][^2]

Llama 3.1 Community License

Llama 3.1 is released under the Llama 3.1 Community License Agreement, a custom commercial license written by Meta for this release. It is not an Open Source Initiative (OSI)-approved open-source license.[^16][^17] Key provisions are as follows.

Permitted Uses

Commercial use is allowed without restriction for the vast majority of organizations. Users may run, fine-tune, modify, and distribute the model weights. The license permits using model outputs to create, train, fine-tune, or otherwise improve other AI models — a meaningful liberalization relative to the Llama 2 license, which prohibited using Llama outputs to train competing models.[^16][^17] However, any AI model that is created, trained, or improved using Llama 3.1 outputs and is subsequently distributed must include "Llama" at the beginning of its name. This requirement applies to distillation, synthetic-data training, and any other workflow that uses Llama 3.1 outputs to shape another model.[^16][^17]

700M MAU Scale Restriction

Organizations whose products or services had more than 700 million monthly active users (MAU) in the calendar month preceding the Llama 3.1 release date must obtain a separate license from Meta. Meta retains sole discretion over whether to grant such a license and on what terms.[^16][^17] This threshold was widely discussed at the time of release because it potentially affects companies such as Google, Microsoft, Amazon, Apple, ByteDance, and Tencent — organizations that might wish to embed Llama 3.1 in consumer products at very large scale.[^17]

Attribution

Derivatives and products built on Llama 3.1 must include attribution in product documentation, on a related website, or in a relevant user interface, and must include a "Notice" text file with the language "Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved."[^16]

Acceptable Use Policy

Use is also governed by Meta's separate Acceptable Use Policy, which prohibits use of the models for a range of harmful, illegal, or deceptive purposes, including weapons development, child sexual abuse material, and certain categories of high-stakes automated decision-making.[^16]

The license was broadly interpreted as more permissive than Llama 2's license, particularly in allowing the use of outputs to improve other models. Critics noted that the 700 million MAU threshold, the Llama-name requirement on distillation derivatives, and the absence of OSI approval together mean the model cannot technically be described as "open source" under the canonical Open Source Initiative definition; the more accurate term in legal scholarship is "open weights" or "source available."[^17][^18]

Safety Models Released Alongside

Llama Guard 3

Meta released Llama Guard 3 alongside Llama 3.1. Llama Guard 3 is an 8B safety classifier built on the Llama 3.1 8B base. It classifies model inputs and outputs against a taxonomy of harm categories defined in collaboration with the MLCommons consortium. Llama Guard 3 supports all eight Llama 3.1 languages, making it the first multilingual safety classifier in the Llama Guard series. Meta reported that Llama Guard 3 outperformed GPT-4 on English, multilingual, and tool-use safety classification benchmarks while maintaining lower false-positive rates.[^1][^19]

A 1B variant (Llama Guard 3 1B) was released subsequently for deployments where running the full 8B safety model alongside a large inference model is too expensive.

Prompt Guard

Prompt Guard is an 86-million-parameter classifier released simultaneously with Llama 3.1. It is trained to detect prompt injection attacks and jailbreak attempts. The model was built on a multilingual DeBERTa backbone to support cross-language adversarial inputs. At 86M parameters, Prompt Guard can be run as a fast pre-filter on every user input before it reaches the main model, adding minimal latency overhead.[^20]

Llama Stack

The Llama 3.1 release was accompanied by the announcement of Llama Stack, a specification for standardized interfaces across the components needed to build agentic applications.[^21] Llama Stack defines APIs for common building blocks including:

Inference endpoints
Safety layers (integrating Llama Guard)
Memory stores
Tool integrations
Agentic loops

The goal was to reduce fragmentation in the ecosystem of frameworks built around Llama models, conceptually analogous to how ONNX provides a standard format for model exchange across machine-learning frameworks. Multiple providers — including Meta itself, AWS, NVIDIA, and Fireworks AI — subsequently released Llama Stack-compatible distributions. Llama Stack was positioned as a self-hostable alternative to OpenAI-compatible APIs for organizations that want to run their own inference infrastructure end-to-end.[^21]

Reception and Impact

The Llama 3.1 release received broad attention in the AI industry. The 405B's competitive benchmark scores against GPT-4o and Claude 3.5 Sonnet were widely reported as a milestone for open-weight models.[^1][^7][^8] The combination of frontier-class performance, open weights, and an updated license permitting distillation was interpreted by many commentators as a significant strategic move against proprietary AI providers.

Mark Zuckerberg's accompanying essay generated substantial discussion.[^5] He argued that open-source models would follow a trajectory similar to Linux, eventually becoming the default infrastructure choice for most developers. Critics noted that the 700 million MAU cap in the license was a meaningful constraint that distinguished the release from true open source and that Meta's interest in commoditizing the AI-model market aligned with its business interests in advertising infrastructure rather than direct AI services.[^17][^18]

Within hours of release, major cloud providers including Amazon Bedrock, Google Cloud Vertex AI, Microsoft Azure, and Oracle Cloud Infrastructure announced Llama 3.1 availability.[^1][^4] Over 25 inference and fine-tuning partners offered access on day one, making it one of the most broadly deployed open-weight models at the time of its release. The Hugging Face model card for the 405B accumulated one of the highest download counts of any model in that size range during the second half of 2024.[^4]

NVIDIA released an official FP8-quantized version of the 405B Instruct model in collaboration with Meta, and Neural Magic released alternative FP8 quantizations with different calibration strategies, demonstrating the ecosystem's rapid adaptation to the deployment challenges posed by a 405B model.[^14]

Industry analysts noted that Llama 3.1 raised the floor of what organizations could achieve without purchasing access to proprietary frontier models. The ability to run a GPT-4o-class model on owned infrastructure, with full data privacy and at lower inference cost, altered the calculus for enterprises evaluating build-versus-buy decisions for AI capabilities.[^7] Within months, derivative fine-tunes proliferated on Hugging Face, including domain-specialized variants for medicine, law, and coding; multi-language fine-tunes extending beyond the eight official languages; and a wide range of community-quantized variants targeting consumer-grade hardware.

Successors

Llama 3.2 (September 2024)

Llama 3.2 was released on September 25, 2024, two months after Llama 3.1. It introduced two parallel tracks: vision-capable 11B and 90B models that retain the text capability of their Llama 3.1 counterparts while adding image understanding via an adapter-based vision encoder, and small text-only 1B and 3B variants engineered for on-device deployment on phones and edge devices.[^10] Llama 3.2 used the same 128K context window and the same eight supported languages.

Llama 3.3 (December 2024)

Llama 3.3 was released on December 6, 2024. Unlike Llama 3.1 and 3.2, the Llama 3.3 release consisted of a single new model: a refined 70B Instruct variant. Meta reported that the new 70B approximated the Llama 3.1 405B's quality on instruction following (IFEval ~92.1) and mathematics (MATH ~77.0) at roughly one-fifth the inference cost.[^11] Llama 3.3 became the recommended replacement for Llama 3.1 70B in most production deployments and partially superseded the 405B for cost-sensitive applications, though the 405B remained the recommended choice for the most demanding general-knowledge and reasoning tasks.

Llama 4 (April 2025)

Llama 4 marked a major architectural inflection point in the Llama family.[^12] Released in early April 2025, Llama 4 introduced mixture-of-experts (MoE) routing for the first time in an open-weight Llama model, along with native multimodality (image and video understanding alongside text). The initial release comprised Llama 4 Scout (17B active parameters across 16 experts, optimized for single-H100 deployment) and Llama 4 Maverick (17B active parameters across 128 experts, totaling roughly 400B parameters), with a two-trillion-parameter Llama 4 Behemoth (288B active across 16 experts) still in training at the announcement.[^12] Llama 4 also dramatically extended the context window, with Scout reaching a 10-million-token context length. With Llama 4, Meta departed from the dense decoder-only design that characterized Llama 1 through Llama 3.3.

Legacy and Current Status

As of mid-2026, Llama 3.1 remains one of the most widely used open-weight model families in production, particularly the 8B and 70B variants. The 405B retains a niche but important role as a research and benchmark reference point and as a teacher model for synthetic-data generation workflows. For the 70B size class, however, most production deployments have migrated to Llama 3.3 70B, which delivers approximately equivalent capability at the same serving cost while incorporating post-training improvements. For applications requiring multimodality, agentic MoE inference, or extreme long-context handling beyond 128K tokens, users have largely moved to Llama 3.2 (for vision) and Llama 4 (for multimodal MoE).

The 8B variant continues to be heavily used for on-device deployment and as the base model for a large number of community fine-tunes. Its combination of strong English instruction following, eight-language coverage, and ability to run on a single consumer GPU with 4-bit quantization made it the de facto default for small open-weight model applications throughout 2024 and 2025.

Llama 3.1's most lasting contribution may be its role in establishing that open-weight models can credibly compete with frontier closed systems. The 405B demonstrated, for the first time in publicly available weights, that the gap between open and closed at the very top of the capability frontier was not structural but rather a matter of compute budget and engineering investment. Subsequent open-weight efforts built on this foundation, and the Llama 3.1 Community License's 700M MAU cap and "Llama"-name requirement on distillation derivatives set a template that most subsequent commercially permissive open-weight licenses have followed in some form.

References

Llama 3.1

Background

From Llama 3 to Llama 3.1

Why the 405B Mattered for the Open Ecosystem

Context in the Broader Llama Family

Model Variants

8B

70B

405B

Architecture

Grouped-Query Attention

Rotary Position Embeddings and Long-Context Extension

SwiGLU Activation

Tokenizer

Training

Pre-training Data

Pre-training Infrastructure

Post-training

Long Context (128K) and How It Was Achieved

Multilingual Support

Tool Use and Function Calling

Built-in Tools

JSON-Based Custom Tools

Code Interpreter

Special Tokens for Tool Calls

Benchmarks

General Knowledge and Reasoning

Mathematics

Code

Long Context

Multilingual

Llama 3.1 Community License

Permitted Uses

700M MAU Scale Restriction

Attribution

Acceptable Use Policy

Safety Models Released Alongside

Llama Guard 3

Prompt Guard

Llama Stack

Reception and Impact

Successors

Llama 3.2 (September 2024)

Llama 3.3 (December 2024)

Llama 4 (April 2025)

Legacy and Current Status

See Also

References

Improve this article

Related Articles

Llama 3.2

Llama 3.3

Llama 4 Scout and Maverick

Llama 4 Behemoth

DeepSeek 3.0

OpenClaw

Llama 3.1

Background

From Llama 3 to Llama 3.1

Why the 405B Mattered for the Open Ecosystem

Context in the Broader Llama Family

Model Variants

8B

70B

405B

Architecture

Grouped-Query Attention

Rotary Position Embeddings and Long-Context Extension

SwiGLU Activation

Tokenizer

Training

Pre-training Data

Pre-training Infrastructure

Post-training

Long Context (128K) and How It Was Achieved

Multilingual Support

Tool Use and Function Calling

Built-in Tools

JSON-Based Custom Tools

Code Interpreter