Llama 3.3
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,961 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,961 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama 3.3 is an instruction-tuned large language model released by Meta on December 6, 2024.[^1][^2] The model has 70 billion parameters, is text-only, supports eight languages, and operates with a 128,000-token context window.[^2][^3] Llama 3.3 was distributed as a single 70B Instruct checkpoint — Meta released no other parameter sizes, no base model, and no multimodal variant under the 3.3 designation.[^1][^2]
Meta's central claim for the release was that improvements concentrated in the post-training pipeline allowed a 70B model to approach the benchmark performance of Llama 3.1 405B, the largest model in the Llama 3.1 herd, on most text tasks.[^1][^3] Meta described the post-training approach as relying on "online preference optimization" in addition to the existing supervised fine-tuning, rejection sampling, and Direct Preference Optimization stages used for prior Llama 3 generations.[^4] On the IFEval instruction-following benchmark, Llama 3.3 70B scored 92.1, exceeding Llama 3.1 405B's 88.6 on the same benchmark per Meta's published numbers.[^2][^3]
The release arrived almost exactly twelve months after Meta's December 2023 Llama training-data cutoff and three months after the Llama 3.2 family had added vision and on-device variants.[^2][^5] It was Meta's final Llama release of 2024, and was succeeded in April 2025 by Llama 4 Scout and Maverick — the first Llama models built on a Mixture-of-Experts architecture and the first to be natively multimodal.[^6][^7]
Meta's open-weights model line began with the original LLaMA in February 2023, evolved through LLaMA 2 in July 2023, Llama 3 in April 2024, Llama 3.1 in July 2024 (introducing the 405B open-weights model and 128K context), and Llama 3.2 in September 2024 (introducing the 1B and 3B on-device models and the 11B and 90B Vision models).[^8] By the time Llama 3.3 arrived in December 2024, the Llama 3 family already spanned six base architectures across two context-window generations.
In that landscape, Llama 3.3 occupied an unusual position. It did not extend the family upward (no new 405B), it did not extend it downward (no new 1B or 3B), and it did not add a modality (no vision variant). It was a single instruct checkpoint released at the 70B size, replacing Llama 3.1 70B Instruct as the recommended default for most production deployments.[^1][^9]
Meta explained the rationale in the model card and its accompanying blog post as a focus on efficiency: post-training advances had narrowed the capability gap between the 70B model and the 405B model to the point that the larger model was no longer the obvious choice for most deployments.[^2][^4] Mark Zuckerberg's announcement on Instagram emphasized that the 70B model could be served on standard developer hardware while delivering performance comparable to the much larger 405B sibling.[^4][^10]
The 70B parameter count occupies a practical middle ground in the open-weights landscape. With 4-bit quantization, the model fits within roughly 40 gigabytes of memory, enabling deployment on two consumer-grade GPUs such as RTX 4090s or on a high-memory Apple Silicon machine. Simon Willison reported running quantized Llama 3.3 70B on his 64GB M2 MacBook Pro using a 43GB GGUF build through Ollama and an 8-bit MLX alternative on release day.[^11] Llama 3.1 405B, by contrast, requires multi-node server-class deployments for reasonable inference throughput.
The corresponding 8B model was not updated to a Llama 3.3 designation because, per available Meta communications, the post-training improvements did not transfer as cleanly to the smaller size; vision capability was already covered by the recently released Llama 3.2 11B and 90B Vision models.
Llama 3.3 70B is built on the same pretrained base as Llama 3.1 70B. The Hugging Face model card identifies Meta Llama 3.1 70B as the base model and describes the differences from Llama 3.1 70B as concentrated in the instruction-tuning post-training pipeline.[^2][^4] There is no new pretraining run for Llama 3.3; the approximately 15 trillion tokens of pretraining data, the December 2023 knowledge cutoff, and the tokenizer are all inherited from Llama 3.1.[^2]
Meta's December 2024 announcement and documentation describe the post-training improvements as combining the existing Llama 3 post-training stack — supervised fine-tuning (SFT), rejection sampling, and Direct Preference Optimization (DPO) — with "online preference optimization," a technique in which preference data is collected and applied iteratively during training rather than from a fixed offline dataset.[^4][^9] Meta cited this online preference optimization as a key driver of the gains in instruction following, mathematics, multilingual reasoning, and code.[^4]
The differences between Llama 3.3 70B and Llama 3.1 70B are therefore exclusively in the post-trained instruction-tuned weights. Because the underlying architecture and tokenizer are identical, any inference stack built for Llama 3.1 70B works without modification for Llama 3.3 70B, and the same prompt format and special tokens apply across both releases.[^9]
Llama 3.3 70B is an auto-regressive decoder-only transformer using the same architecture as Llama 3.1 70B.[^2][^8] The model has 80 transformer decoder layers, a hidden size of 8,192, and a feed-forward intermediate dimension of 28,672. It uses Grouped-Query Attention (GQA) with 64 attention heads and 8 key-value heads, reducing the size of the key-value cache during inference compared to full multi-head attention and enabling more efficient long-context decoding.[^2][^8]
The vocabulary contains 128,256 tokens generated by a byte-level byte-pair encoding tokenizer (tiktoken-compatible), the same vocabulary first introduced with Llama 3 in April 2024 and reused across the 3.1, 3.2, and 3.3 generations.[^8] Rotary Positional Embeddings (RoPE) are used for position encoding, with the RoPE base frequency scaled to support the 128,000-token native context.[^2][^8]
According to the Llama 3.3 model card, training "utilized a cumulative of 39.3M GPU hours" of computation on H100-80GB hardware (700W TDP), but the Llama 3.3 70B-specific line item in the training-compute table reports 7.0 million GPU hours.[^2] The larger 39.3M figure is the cumulative total for the Llama 3.1 family that supplied the pretrained base; the 7.0M figure is the compute specifically attributed to Llama 3.3 70B, which corresponds to the post-training refresh on top of the inherited Llama 3.1 base.[^2] Estimated location-based greenhouse gas emissions for the full cumulative training were 11,390 tons CO2eq, with market-based emissions reported as 0 tons because the relevant Meta data centers operate on 100% renewable energy.[^2]
Because the model card prompt format and special tokens are identical to Llama 3.1, Llama 3.3 is a drop-in replacement for Llama 3.1 70B in inference stacks. The same chat template, the same tool-calling format, and the same role delimiters apply. Quantization tooling (GGUF, AWQ, GPTQ, ExLlamaV2) and serving frameworks (vLLM, TGI, llama.cpp, Ollama, MLX) that supported Llama 3.1 70B added Llama 3.3 70B support immediately on or near release day with no architectural changes.
The headline benchmark comparison published by Meta on the Hugging Face model card and the Llama 3.3 model card on GitHub places Llama 3.3 70B alongside Llama 3.1 70B (its direct predecessor) and Llama 3.1 405B (the largest sibling) across nine benchmarks.[^2][^9] All figures below come from that published comparison.
| Benchmark | Llama 3.1 70B | Llama 3.3 70B | Llama 3.1 405B |
|---|---|---|---|
| MMLU (CoT) | 86.0 | 86.0 | 88.6 |
| MMLU Pro (CoT) | 66.4 | 68.9 | 73.3 |
| IFEval (instruction following) | 87.5 | 92.1 | 88.6 |
| GPQA Diamond (CoT) | 48.0 | 50.5 | 49.0 |
| HumanEval (code) | 80.5 | 88.4 | 89.0 |
| MBPP EvalPlus (code) | 86.0 | 87.6 | 88.6 |
| MATH (CoT) | 68.0 | 77.0 | 73.8 |
| MGSM (multilingual math) | 86.9 | 91.1 | 91.6 |
| BFCL v2 (tool use) | 77.5 | 77.3 | 81.1 |
Several patterns stand out. On general world knowledge (MMLU), Llama 3.3 70B is unchanged from Llama 3.1 70B at 86.0, and trails Llama 3.1 405B at 88.6 — pretraining-bound benchmarks see no improvement, consistent with Llama 3.3 inheriting the Llama 3.1 pretrained base. On instruction following (IFEval), mathematics (MATH and MGSM), code (HumanEval and MBPP EvalPlus), and graduate-level science reasoning (GPQA Diamond), Llama 3.3 70B improves meaningfully over Llama 3.1 70B and either matches or exceeds Llama 3.1 405B in several cases. The largest absolute gain over Llama 3.1 70B is on MATH (+9.0 points), where the 70B model crosses the 405B's 73.8 to reach 77.0.
On IFEval at 92.1, Llama 3.3 70B exceeds both its 70B predecessor (87.5) and its 405B sibling (88.6) on Meta's own published numbers.[^2] On GPQA Diamond at 50.5, it also exceeds the 405B's 49.0 — a noteworthy result given that GPQA is intended to measure expert-level scientific reasoning.[^2] On BFCL v2 (Berkeley Function Calling Leaderboard v2, the tool-use evaluation), Llama 3.3 70B at 77.3 is essentially unchanged from Llama 3.1 70B at 77.5 and trails the 405B at 81.1.[^2]
The areas where Llama 3.1 405B retains a clear lead are MMLU and MMLU Pro (general knowledge), where the larger model's broader pretraining capacity contributes meaningfully and post-training improvements on the same base cannot recover the gap. Independent evaluators including Artificial Analysis published their own benchmark comparisons within days of release; their results broadly tracked Meta's published figures.[^12]
Llama 3.3 was trained with explicit multilingual coverage for eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This is the same set of officially supported languages first formalized for Llama 3.1 and reused for Llama 3.2 and Llama 3.3.[^2][^3]
The multilingual improvement is most clearly visible on MGSM, a multilingual grade-school math benchmark spanning the eight supported languages plus a few others. Llama 3.3 70B scores 91.1 against Llama 3.1 70B's 86.9 and Llama 3.1 405B's 91.6, a gain of approximately 4.2 points that nearly closes the gap to the 405B.[^2]
Meta's model card explicitly states that Llama 3.3 is not tested or evaluated for languages outside the eight supported ones, and that outputs in unsupported languages may be less accurate or inconsistent.[^2] Developers building applications for languages outside this set are advised to fine-tune on domain-specific data and to add system-level controls to constrain language coverage.
The eight-language scope is narrower than several contemporary open-weights competitors. Qwen 2.5 72B, released by Alibaba Cloud in September 2024, supports 29 languages, and Llama 4 Scout and Maverick (April 2025) advanced Meta's own multilingual coverage to 12 languages with pretraining on 200 languages.[^6][^7] Llama 3.3's eight-language scope was the same as Llama 3.1's, reflecting that the multilingual training data was not expanded in the 3.3 post-training refresh — only its post-training mix was rebalanced.
Llama 3.3 retains the tool-calling format introduced in Llama 3.1, with the same prompt-format conventions, special tokens (<|eot_id|> for zero-shot function calls, <|eom_id|> for built-in tools), and JSON-payload conventions.[^13] The model can invoke user-defined functions by generating structured output in the model card's specified format, and it supports both single-tool and parallel tool calling (multiple function calls in a single inference pass).[^13]
For zero-shot, user-defined functions, Llama 3.3 uses a bracket-based syntax of the form [function_name(parameter1=value1, parameter2=value2)], with function schemas defined as Python-style dictionaries that include a name, description, parameters object, and a required array — equivalent to OpenAI-compatible JSON schema. Function definitions may be placed in either the system or the user message; the prompt format is the same either way.[^13]
For built-in tools, Llama 3.3 ships with native support for Brave Search, Wolfram Alpha, and Code Interpreter using the same conventions as Llama 3.1.[^13] These built-in tools are activated by including the appropriate tags in the system message.
Llama 3.3's reported BFCL v2 score of 77.3 is essentially unchanged from Llama 3.1 70B's 77.5, indicating that, on the public function-calling benchmark, the 70B improvements were not specifically targeted at raw tool-call accuracy.[^2] Meta described the practical improvement in tool calling as one of better calibration in deciding when to use a tool versus answering directly, alongside cleaner JSON output and reduced spurious tool calls — qualitative improvements that BFCL v2 does not separately measure.
The format compatibility with Llama 3.1 ensured that agentic frameworks built for the prior generation — LangChain, LlamaIndex, AutoGen, CrewAI, and others — worked with Llama 3.3 without code changes.
Llama 3.3 is distributed under the Llama 3.3 Community License Agreement, a custom commercial license published at llama.com/llama3_3/license/ and reproduced in the model's GitHub repository.[^14][^15] This license is structurally similar to the Llama 3.1 and Llama 3.2 Community Licenses with some changes specific to the 3.3 release.
The license grants a non-exclusive, worldwide, royalty-free right to use, reproduce, distribute, copy, create derivative works of, and modify the Llama 3.3 model weights for commercial purposes. The key restrictions are:
User threshold: Developers whose products or services built using Llama 3.3 exceed 700 million monthly active users as of the Llama 3.3 release date must request a separate commercial license from Meta, which Meta may grant in its sole discretion.[^14] In practice, this clause is intended to affect only the largest global technology platforms.
Attribution requirements: Distributions of the model or derivative works must include the notice "Llama 3.3 is licensed under the Llama 3.3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved." Products built with the model must display "Built with Llama" prominently. Any derivative AI model that uses the materials must include "Llama" at the beginning of its name.[^14]
Use restrictions: The license incorporates an Acceptable Use Policy and prohibits training models intended to compete with Meta's own AI products, as well as illegal activities and uses that violate applicable law.[^14]
Governance: The license designates Meta Platforms Ireland Limited as the contracting party for individuals or entities in the EEA or Switzerland, with Meta Platforms, Inc. as the contracting party elsewhere.[^14] Notably, unlike the Llama 3.2 Community License — which restricted the use and distribution of the Llama 3.2 multimodal weights for individuals or companies domiciled in the European Union — the Llama 3.3 license does not contain a comparable EU exclusion clause. This reflects that Llama 3.3 is text-only and so the EU AI Act considerations that prompted the Llama 3.2 multimodal restriction did not apply.
The Open Source Initiative has stated that Meta's Llama licenses, including the Llama 3.3 Community License, do not meet the OSI definition of open source, primarily because of the use restrictions, the 700M MAU threshold, and the attribution requirements.[^16] The model is therefore commonly described as "open weights" rather than "open source" in contexts where this distinction matters, such as enterprise procurement and compliance discussions.
Llama 3.3 70B was released on December 6, 2024 through several channels.
Hugging Face: The official meta-llama/Llama-3.3-70B-Instruct repository on Hugging Face hosted the BF16 weights, the tokenizer, and the model card from the day of release.[^2] A community user form on the Hugging Face page is required to acknowledge the license.
llama.com: Meta's own model-distribution portal at llama.com offered direct downloads of the Llama 3.3 weights and configuration files.
Hosting partners: Major cloud and inference providers added Llama 3.3 70B within days. Groq published a Llama 3.3 70B Versatile endpoint and a higher-speed Llama 3.3 70B SpecDec variant using speculative decoding.[^17][^18] AWS Bedrock, Microsoft Azure AI Foundry, Google Vertex AI, Fireworks, Together.ai, DeepInfra, and OpenRouter all listed Llama 3.3 on or shortly after the release date.[^19]
GitHub Models: GitHub added Llama 3.3 70B Instruct to its GitHub Models catalog in general availability on December 13, 2024, one week after the model release.[^20]
Local-deployment tooling: Ollama, LM Studio, and llama.cpp added Llama 3.3 70B support on release day with the same GGUF tooling already used for Llama 3.1 70B. Apple's MLX framework added quantized variants for Apple Silicon devices in the same window.[^11]
Local deployment with 4-bit quantization (Q4_K_M GGUF or similar) typically requires approximately 40 to 43 gigabytes of memory and runs on hardware such as dual RTX 4090 setups or Apple Silicon machines with 64GB or more of unified memory. Full BF16 precision requires roughly 140 gigabytes of memory, which is achievable only on multi-GPU server configurations.
Initial coverage of Llama 3.3 on December 6, 2024 emphasized the efficiency framing. TechCrunch summarized the release as "a new, more efficient Llama model" that "delivers the performance of Meta's largest Llama model, Llama 3.1 405B, at lower cost."[^21] SiliconANGLE highlighted that the model "performs almost as well" as the 405B at meaningfully lower deployment cost.[^22] InfoQ emphasized the multilingual and instruction-following improvements, situating Llama 3.3 as a practical default for enterprise deployments.[^23] Simon Willison's same-day write-up framed the release primarily as a useful local-deployment option, noting that running 70B-class models on personal laptops had become routine.[^11]
The IFEval result — 92.1 for Llama 3.3 70B against 88.6 for Llama 3.1 405B — became the most frequently cited benchmark in social discussion of the release, often presented as evidence that careful post-training could substitute for parameter scaling on certain task categories. Groq, which operated one of the highest-throughput Llama 3.3 inference endpoints, published an analysis framing the release as a challenge to the "death of scaling laws" narrative, arguing that the model demonstrated continued capability growth from post-training refinement rather than from pure parameter scaling.[^17]
Developer-platform adoption was rapid. GitHub Models general availability on December 13, 2024 made the model available to GitHub Copilot Workspace and GitHub Actions users with no separate API key.[^20] Microsoft Azure AI Foundry announced support within days, positioning Llama 3.3 as one of the recommended open-weights options for Azure customers.[^24] AWS Bedrock added Llama 3.3 70B to its Llama family lineup in the same release window.
The reception was not without criticism. Independent evaluators noted that while the headline benchmarks were favorable, Llama 3.3 inherited Llama 3.1's pretraining cutoff (December 2023) and therefore lagged proprietary models with later training cutoffs on questions about events from 2024. MMLU at 86.0 (unchanged from Llama 3.1 70B) and MMLU Pro at 68.9 (versus 73.3 for the 405B) indicated that, on broad world-knowledge benchmarks, Llama 3.3 70B did not close the gap to the 405B — the post-training improvements were concentrated in instruction following, math, code, and multilingual reasoning, not in general factual recall. Some developers reported that, on complex multi-step tasks in production deployments, Llama 3.1 405B continued to outperform Llama 3.3 70B, even where benchmark numbers were close.
Meta's next major Llama release after 3.3 was Llama 4, announced and released on April 5, 2025.[^6] Llama 4 represented a substantial discontinuity from the Llama 3 line:
The Llama 4 family at launch comprised:
Llama 4 thus replaced Llama 3.3 70B as Meta's recommended default open-weights model for most production deployments, though the architectural shift (dense 70B versus sparse MoE) meant that infrastructure investments specifically tailored to Llama 3.3 did not transfer directly. Llama 3.3 remained available through Hugging Face, llama.com, and hosting partners after the Llama 4 release.
The reception of Llama 4 was mixed. Independent benchmarking found that Scout and Maverick underperformed Meta's claims on several public leaderboards, and the Behemoth teacher model remained unreleased through 2025 and into 2026.
Llama 3.3 70B occupied a transitional but durable position in the Llama line. It demonstrated that, at least on instruction-following and reasoning benchmarks, careful post-training could substitute substantially for raw parameter scaling — a narrative that resonated with later developments in the open-weights ecosystem, including efficiency-focused releases from competitors such as Alibaba's Qwen 2.5 and DeepSeek's V3.
For developers in production through 2025, Llama 3.3 70B served as the practical default for "best Llama you can run on commodity infrastructure," a role that Llama 4 Scout did not entirely replace because the MoE architecture introduced new memory-management complexity even though active parameters were smaller. Many deployments that had standardized on Llama 3.1 70B during the summer of 2024 transitioned directly to Llama 3.3 70B at the end of the year, then to a mix of Llama 3.3 70B and Llama 4 Scout in mid-2025 depending on context-length and modality requirements.
The model card and prompt format compatibility with Llama 3.1 meant that the migration from Llama 3.1 70B to Llama 3.3 70B was among the smoothest in the open-weights model history — typically a weight swap with no code changes and no prompt-engineering revisions. This contributed to Llama 3.3's rapid adoption.
As of the most recent surveys of open-weights inference availability, Llama 3.3 70B remained broadly available through Hugging Face, llama.com, and major cloud providers. Meta has subsequently consolidated some of its model-distribution strategy around Llama 4 and the work of Meta Superintelligence Labs, but Llama 3.3 has not been deprecated and continues to be served by inference partners for customers that have standardized on it.