Llama 3.1 is a family of large language models released by Meta on July 23, 2024. The release comprises three model sizes: 8 billion, 70 billion, and 405 billion parameters. Llama 3.1 built on the foundation of Llama 3 with several major expansions: a 128,000-token context window (up from 8,192 in Llama 3), robust multilingual support across eight languages, native tool-calling capabilities, and, in the 405B variant, the first openly available model widely considered competitive with closed frontier systems such as GPT-4o and Claude 3.5 Sonnet.
The 405B model was trained on more than 15.6 trillion tokens using over 16,000 NVIDIA H100 GPUs, making it one of the most compute-intensive open-weight training runs disclosed at the time. Meta simultaneously released the Llama Stack specification and a suite of safety models under the Llama Guard 3 and Prompt Guard names. All models were made available for download on Hugging Face and through more than 25 commercial cloud and inference partners.
Meta's first Llama models appeared in February 2023 as compact, research-focused releases. Llama 2 followed in July 2023 with an explicit commercial license. Llama 3, released in April 2024 in 8B and 70B variants, brought substantial quality improvements and a 128K-token tokenizer vocabulary, but shipped with an 8,192-token context limit and no built-in tool-calling framework.
Llama 3.1 addressed all three gaps. The context limit was extended by a factor of 16 to 128,000 tokens. Multilingual post-training was formalized across eight languages. Tool calling was incorporated directly into the instruction-tuned variants using a standardized JSON interface and two built-in tool integrations. And the 405B variant pushed Meta into direct competition with frontier closed models for the first time.
Mark Zuckerberg used the Llama 3.1 launch to make an extended argument for open-source AI, comparing the trajectory he expected for open models to the history of Linux displacing proprietary Unix. He wrote that open source "will ensure that more people around the world have access to the benefits and opportunities of AI, that power isn't concentrated in the hands of a small number of companies, and that the technology can be deployed more evenly and safely across society."
Llama 3.1 sits between Llama 3 (April 2024) and Llama 3.2 (September 2024) in Meta's release timeline. Llama 3.2 added vision capabilities and introduced smaller edge models at 1B and 3B parameters. Llama 3.3 (December 2024) revisited the 70B scale with further post-training improvements, and Llama 4 (April 2025) shifted the architecture to a mixture-of-experts design with native multimodality. Within this progression, Llama 3.1 represents the point at which Meta first achieved open-weight models that benchmarked alongside rather than behind frontier proprietary systems.
Llama 3.1 was released in three sizes, each available in both base (pre-trained) and instruct (instruction-tuned) versions.
The 8B model is the smallest and most accessible variant. In FP16 it occupies approximately 16 GB of GPU memory, fitting on a single high-end consumer GPU such as an NVIDIA RTX 4090 or a data-center-class A10G. With 4-bit quantization the memory footprint drops to around 4 GB. The 8B Instruct model supports the full 128K context window, tool calling, and the eight supported languages.
Despite its size, the 8B model showed meaningful improvements over its Llama 3 8B predecessor: MMLU scores rose from roughly 66 to 69.4, and GSM8K math reasoning improved from 79.6 to 84.5. At this scale, Llama 3.1 8B competes with earlier 13B or 34B class models from the broader open-source ecosystem, making it suitable for deployments on constrained hardware.
The 70B model represents the mid-tier option and is widely considered the most practical of the three for organizations that need strong capability without the infrastructure demands of the 405B. In FP16, it requires approximately 140 GB of GPU memory, necessitating at least two 80 GB A100s or a comparable configuration. With AWQ 4-bit quantization the requirement drops to around 35 GB, enabling deployment on two consumer-grade 24 GB GPUs.
The 70B Instruct model scored 83.6 on MMLU, 95.1 on GSM8K, and 80.5 on HumanEval, placing it ahead of GPT-3.5-level systems across most benchmarks. It also achieves 90.5 on ZeroSCROLLS/QuALITY, the same score as GPT-4o on that long-context benchmark.
The 405B is the flagship model and the primary reason for the Llama 3.1 release's significance. Meta described it as "the first openly available model that rivals the top AI models" in general knowledge, math, tool use, and multilingual translation. In full FP16 precision, the model requires approximately 810 GB of GPU memory (roughly 10 H100 80 GB GPUs). Meta's official FP8 quantized variant reduces this to about 405 GB, fitting within a single 8-way H100 node.
Benchmark performance places the 405B within a few percentage points of GPT-4o and Claude 3.5 Sonnet on most tasks. On GSM8K math reasoning it scores 96.8, slightly above GPT-4o's 96.1. On ARC Challenge it scores 96.9, matching GPT-4o's 96.7. On HumanEval code generation it scores 89.0, modestly below GPT-4o's 90.2 and Claude 3.5 Sonnet's 92.0.
Meta's FP8 quantization was applied specifically to the major linear operators of the model, covering the gate, up, and down projections for the feed-forward networks. These components account for approximately 75% of the model's inference FLOPs. The quantization produces minimal accuracy degradation while halving the memory footprint compared to BF16.
Llama 3.1 uses a dense, decoder-only transformer architecture. Meta chose not to adopt mixture-of-experts for this release, citing training stability as a primary reason. The architectural specification across the three model sizes is as follows:
| Parameter | 8B | 70B | 405B |
|---|---|---|---|
| Transformer layers | 32 | 80 | 126 |
| Model dimension | 4,096 | 8,192 | 16,384 |
| FFN hidden dimension | 14,336 | 28,672 | 53,248 |
| Attention heads | 32 | 64 | 128 |
| Key-value heads (GQA) | 8 | 8 | 8 |
| Context window | 128,000 | 128,000 | 128,000 |
| Vocabulary size | 128,000 | 128,000 | 128,000 |
| Peak learning rate | 3e-4 | 1.5e-4 | 8e-5 |
All three variants use Grouped-Query Attention (GQA) with 8 key-value heads. GQA reduces the memory required by the key-value cache during inference, which is particularly important at 128K context lengths where the KV cache would otherwise grow very large. For the 405B model at full 128K context, the KV cache alone requires approximately 123 GB in FP16.
Llama 3.1 uses Rotary Position Embeddings (RoPE) with a base frequency of 500,000. This is substantially higher than typical RoPE configurations and is necessary to support stable attention patterns at 128K token distances. The base models were initially trained with an 8K context and then extended to 128K through a six-stage progressive fine-tuning process. Each stage increased the context length and was trained on approximately 800 billion additional tokens, using data specifically selected for long-context quality.
The feed-forward networks use SwiGLU activation, which is a gated linear unit variant that generally produces better training dynamics than standard ReLU or GELU activations. This is consistent with Llama 3 and many other recent large language models.
Llama 3.1 uses the same 128,000-token vocabulary as Llama 3. This vocabulary was constructed by combining 100,000 tokens from the tiktoken tokenizer with 28,000 additional tokens added to improve coverage of non-English languages. The expanded vocabulary improves tokenization efficiency for languages such as Hindi, Thai, and Arabic, reducing the number of tokens needed to represent a given passage.
The Llama 3.1 models were pre-trained on 15.6 trillion tokens drawn from a multilingual web corpus. The dataset was assembled with several filtering stages. Hashing-based and MinHash deduplication removed exact and near-duplicate content. A RoBERTa-based quality classifier was used to identify and retain high-quality text, similar in spirit to the approach used in GPT-3's Pile filtering. Domain-specific data for code and mathematics was upsampled relative to its natural frequency in the web corpus.
Approximately 8% of the training tokens were non-English, covering the eight languages that the instruct variants support. The knowledge cutoff for pre-training data is December 2023.
The 405B model was trained on over 16,000 NVIDIA H100 GPUs. This required substantial infrastructure innovations. The full training run accumulated approximately 30.84 million GPU-hours for the 405B model alone (1.46 million for 8B, 7.0 million for 70B). Meta developed custom optimizations across the network, storage, and compute stack to run training at this scale reliably.
The pre-training precision was BF16. To enable production-scale inference at reasonable cost, Meta developed FP8 quantization of the final weights, reducing the storage and memory requirements by approximately half without meaningful degradation on standard benchmarks.
Post-training followed a pipeline of supervised fine-tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF), specifically using Direct Preference Optimization (DPO) as one of the preference optimization methods. Meta generated more than 25 million synthetic training examples using a combination of earlier Llama models and human annotation pipelines.
Post-training addressed four areas beyond general instruction following:
The extension from 8,192 tokens in Llama 3 to 128,000 tokens in Llama 3.1 is one of the release's most consequential changes. A 128K context window can accommodate approximately 100,000 words of English text in a single input. This enables several use cases that were impractical with shorter windows:
The KV cache requirements scale linearly with context length. For the 405B model at 128K tokens, the cache requires approximately 123 GB, which must be added to the weight memory. In practice this means serving the 405B at full 128K context requires careful memory planning across multiple H100 nodes. For the 8B model, the 128K KV cache requires approximately 15.6 GB, manageable alongside the 16 GB weight footprint on a dual-GPU setup.
Llama 3.1 was the first Llama generation to include native tool calling as a first-class capability. The instruct models support three modes of tool interaction.
Two tools are pre-trained into the model and can be activated via the system prompt:
Tools: brave_search. The model calls the tool using Python-like syntax: brave_search.call(query="...").Tools: wolfram_alpha. Called as wolfram_alpha.call(query="...").These built-in tools are triggered in an ipython environment context. The model generates a tool call, receives the result via an ipython role message, and then continues reasoning with the result incorporated.
Developers can define arbitrary custom tools in the system prompt using a JSON schema format similar to the OpenAI function-calling specification. The model parses the tool definitions and generates calls in the form {"name": "function_name", "parameters": {"arg_name": "value"}}. This allows integrating any external API or service as a tool.
The model can also generate Python code for execution when the system prompt includes Environment: ipython. In this mode the model wraps code in a <|python_tag|> block, closes with the <|eom_id|> token to signal that execution should occur, and then continues after receiving the code output. This enables agentic workflows where the model iteratively writes and tests code.
The Llama 3.1 prompt format added new special tokens compared to Llama 3:
<|python_tag|>: Marks the beginning of a tool or code call.<|eom_id|>: End of message, indicating the model expects a tool result before the turn ends.ipython role for tool result messages.These tokens allow the model to clearly signal multi-step tool interactions within the existing turn-based conversation format.
Meta evaluated all three Llama 3.1 models on a wide range of benchmarks and compared them against GPT-4o and Claude 3.5 Sonnet. The results below reflect Meta's published evaluations from the Llama 3 Herd of Models paper (arXiv 2407.21783).
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| MMLU (5-shot) | 69.4 | 83.6 | 87.3 | 89.1 | 89.9 |
| MMLU-Pro (5-shot) | 48.3 | 66.4 | 73.3 | 74.4 | 77.0 |
| ARC Challenge (0-shot) | 83.4 | 94.8 | 96.9 | 96.7 | 96.7 |
| HellaSwag (0-shot) | 82.1 | 88.0 | 89.2 | 95.3 | 89.0 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| GSM8K (8-shot, CoT) | 84.5 | 95.1 | 96.8 | 96.1 | 96.4 |
| MATH (0-shot, CoT) | 51.9 | 68.0 | 73.8 | 76.6 | 71.1 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| HumanEval (0-shot) | 72.6 | 80.5 | 89.0 | 90.2 | 92.0 |
| MBPP (0-shot) | 72.8 | 86.0 | 88.6 | 87.0 | 90.7 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o |
|---|---|---|---|---|
| ZeroSCROLLS/QuALITY | 81.0 | 90.5 | 95.2 | 90.5 |
| InfiniteBench/En.MC | 65.1 | 78.2 | 83.4 | 82.5 |
| Benchmark | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| MGSM (0-shot, CoT) | 68.9 | 86.9 | 91.6 | 90.5 | 91.6 |
The 405B model notably surpasses GPT-4o on long-context tasks and matches it on multilingual math reasoning. The gap between 405B and GPT-4o is small on most benchmarks, with GPT-4o maintaining leads on MMLU, HellaSwag, and MATH, while 405B leads on ARC Challenge and GSM8K.
Llama 3.1 is released under the Llama 3.1 Community License Agreement, a custom bespoke license that Meta wrote for this release. It is not an Open Source Initiative (OSI)-approved open-source license. Key provisions:
Permitted uses: Commercial use is allowed without restriction for most organizations. Users may run, fine-tune, modify, and distribute the model weights. The license explicitly permits using model outputs to improve other language models, including through knowledge distillation. This was an important change from earlier Llama licenses, which prohibited using Llama outputs to train competing models.
Scale restriction: Organizations with more than 700 million monthly active users must obtain a separate license from Meta. Meta is not required to grant such a license on any specific terms. This threshold was widely discussed at the time of release because it affects companies such as Google, Microsoft, and Amazon that might wish to embed Llama 3.1 in consumer products at very large scale.
Competitor restriction: The license prohibits using the model to improve competing AI foundation models in ways that go beyond the permitted output-improvement clause.
Attribution: Derivatives and products built on Llama 3.1 must include attribution and are subject to the same license terms.
The license was broadly interpreted as more permissive than Llama 2's license, particularly in allowing distillation. Critics noted that the 700M MAU threshold and absence of OSI approval mean the model cannot technically be described as open-source under the standard definition.
Meta released Llama Guard 3 alongside Llama 3.1. Llama Guard 3 is an 8B safety classifier built on the Llama 3.1 8B base. It classifies model inputs and outputs against a taxonomy of harm categories defined in collaboration with the MLCommons consortium. Llama Guard 3 supports all eight Llama 3.1 languages, making it the first multilingual safety classifier in the Llama Guard series. Meta reported that Llama Guard 3 outperforms GPT-4 on English, multilingual, and tool-use safety classification benchmarks while maintaining lower false-positive rates.
A 1B variant (Llama Guard 3 1B) was also released for deployments where running the full 8B safety model alongside a large inference model is too expensive.
Prompt Guard is an 86-million-parameter classifier released simultaneously with Llama 3.1. It is trained to detect prompt injection attacks and jailbreak attempts. The model was built on a multilingual DeBERTa backbone to support cross-language adversarial inputs. At 86M parameters, Prompt Guard can be run as a fast pre-filter on every user input before it reaches the main model, adding minimal latency overhead.
The Llama 3.1 release was accompanied by the announcement of Llama Stack, a specification for standardized interfaces for toolchain components and agentic application development. Llama Stack defines APIs for common agentic building blocks including:
The goal of Llama Stack was to reduce fragmentation in the ecosystem of frameworks built around Llama models, similar in concept to how ONNX provides a standard format for model exchange across ML frameworks. Multiple providers including Meta itself, AWS, NVIDIA, and Fireworks AI subsequently released Llama Stack-compatible distributions. Llama Stack was positioned as a drop-in alternative to OpenAI-compatible APIs for organizations that want to self-host.
The Stack's reference applications included sample multi-turn agents using the 405B with web search and code execution tools, demonstrating the end-to-end capability of the tool-calling infrastructure.
The table below summarizes the key differences between Llama 3.1 variants and contemporary models at the time of release:
| Feature | Llama 3 8B | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|---|
| Release date | April 2024 | July 2024 | July 2024 | July 2024 | May 2024 | June 2024 |
| Context window | 8K | 128K | 128K | 128K | 128K | 200K |
| Parameters | 8B | 8B | 70B | 405B | ~200B (est.) | Unknown |
| Open weights | Yes | Yes | Yes | Yes | No | No |
| Tool calling | No | Yes | Yes | Yes | Yes | Yes |
| Multilingual | Limited | 8 languages | 8 languages | 8 languages | Many | Many |
| MMLU | 66.6 | 69.4 | 83.6 | 87.3 | 89.1 | 89.9 |
| HumanEval | 62.2 | 72.6 | 80.5 | 89.0 | 90.2 | 92.0 |
| GSM8K | 79.6 | 84.5 | 95.1 | 96.8 | 96.1 | 96.4 |
| License | Custom | Custom | Custom | Custom | Proprietary | Proprietary |
The most significant gaps between Llama 3.1 405B and the frontier closed models are on MMLU (1.8 to 2.6 points below), HumanEval (1.2 to 3.0 points below), and MATH (2.8 to 4.7 points below). On reasoning-heavy tasks such as GSM8K and ARC Challenge, the 405B matches or slightly exceeds its closed-model counterparts. The cost difference is substantial: inference on self-hosted or third-party hosted Llama 3.1 405B was widely reported to cost approximately 50% less per token than GPT-4o at comparable quality levels.
One of the most actively promoted use cases for the 405B at release was using it as a teacher model to generate synthetic training data. The updated license explicitly permits this. Organizations can prompt the 405B to produce labeled examples, instruction-response pairs, or preference annotations, and then use that data to fine-tune a smaller model such as the 8B or 70B. AWS, NVIDIA, Snowflake, and Microsoft Azure all published tutorials for this workflow within weeks of the release.
Knowledge distillation refers to training a smaller model to mimic the probability distributions or outputs of a larger model. The Llama 3.1 405B was explicitly designed to serve as a distillation teacher. Meta's own documentation notes that the 8B and 70B Instruct models benefited from distillation from earlier internal large models during post-training, and the community was invited to replicate and extend this approach.
The 128K context window opened practical use cases for processing long documents that previously required chunking or retrieval pipelines. Legal contract analysis, scientific literature review, large codebase refactoring, and financial document summarization are prominent examples. In RAG pipelines, the large context window allows including more retrieved documents, reducing the likelihood of missing relevant context.
With eight supported languages at the instruction-tuned level, Llama 3.1 can be used for customer service applications, content localization, and translation assistance across English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai without language-specific fine-tuning.
The tool-calling infrastructure and Llama Stack APIs made Llama 3.1 a practical foundation for multi-step agentic workflows. Examples include research assistants that search the web and synthesize results, coding agents that iteratively run and fix code, and customer support bots that call backend APIs to fulfill requests.
The 8B model, especially with 4-bit quantization, can run on consumer hardware. At INT4, the 8B model requires approximately 4 GB of GPU memory, enabling deployment on gaming-class GPUs. This makes it viable for on-device applications where data privacy requirements prohibit sending data to cloud APIs.
The Llama 3.1 release received broad attention in the AI industry. The 405B's competitive benchmark scores against GPT-4o and Claude 3.5 Sonnet were widely reported as a milestone for open-weight models. The combination of frontier-class performance, open weights, and an updated license permitting distillation was interpreted as a significant strategic move against proprietary AI providers.
Mark Zuckerberg's accompanying essay on open-source AI generated substantial discussion. He argued that open-source models would follow a trajectory similar to Linux, eventually becoming the default infrastructure choice for most developers. Critics noted that the 700M MAU cap in the license was a meaningful constraint that distinguished the release from true open-source, and that Meta's interest in commoditizing the AI model market aligned with its business interests in advertising infrastructure rather than AI services.
Within weeks of release, major cloud providers including Amazon Bedrock, Google Cloud, Microsoft Azure, and Oracle Cloud announced Llama 3.1 availability. Over 25 inference and fine-tuning partners offered access, making it one of the most broadly deployed open-weight models at the time of its release. The Hugging Face model card for the 405B accumulated one of the highest download counts of any model in that size range.
NVIDIA released an official FP8-quantized version of the 405B Instruct model in collaboration with Meta, and Neural Magic released alternative FP8 quantizations with different calibration strategies, demonstrating the ecosystem's quick adaptation to the deployment challenges posed by a 405B model.
Industry analysts at the time of release noted that Llama 3.1 raised the floor of what organizations could achieve without purchasing access to proprietary frontier models. The ability to run a GPT-4o-class model on owned infrastructure, with full data privacy and at lower inference cost, altered the calculus for enterprises evaluating build-versus-buy decisions for AI capabilities.
Knowledge cutoff: The pre-training data has a cutoff of December 2023. Events, publications, and developments after that date are not represented in the model's weights. Retrieval-augmented generation or tool use with web search is required to access more recent information.
Text-only modality: Llama 3.1 is a text-only model. It cannot process images, audio, or video natively. Multimodal capabilities were added in subsequent releases (Llama 3.2 introduced vision).
405B deployment complexity: Despite FP8 quantization, serving the 405B model in production requires at least 8 H100 80 GB GPUs. This represents a significant infrastructure investment that is beyond the reach of small organizations. At BF16, the hardware requirement is roughly double that.
Hallucination: Like all large language models, Llama 3.1 can generate factually incorrect information stated with apparent confidence. Meta's post-training included hallucination mitigation strategies, but the problem is not eliminated. The paper describes an approach where the model is trained to refuse questions it cannot answer reliably, though in practice this calibration is imperfect.
Limited languages: Despite multilingual improvements, Llama 3.1's instruction fine-tuning covers only eight languages. The base models contain more language diversity from pre-training, but the quality of instruction-following degrades for languages not in the fine-tuning set.
Context quality at long range: While the 128K context window is functional, performance on tasks that require attending to information at very long distances within the context window can degrade compared to tasks where relevant information is near the beginning or end of the input. This is a known limitation of Transformer attention at long ranges.
License is not OSI open-source: The 700M MAU threshold, competitor restriction, and bespoke license terms mean that Llama 3.1 cannot be incorporated into OSI-compliant open-source projects without legal review. This limits its adoption in certain academic, government, and corporate open-source contexts.