Llama 3
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v11 ยท 7,003 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v11 ยท 7,003 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama 3 is a family of open-weight large language models released by Meta on April 18, 2024. The initial Llama 3 release comprised two dense decoder-only Transformer sizes, Llama 3 8B and Llama 3 70B, each available as a pretrained base model and an instruction-tuned chat variant. Both models were trained on more than 15 trillion tokens of publicly available text, used a new 128,000-token tokenizer that roughly quadrupled the vocabulary of Llama 2, supported an 8,192-token context window, and applied grouped-query attention (GQA) at every model size, a change from Llama 2, where GQA had been used only in the 34B and 70B variants[1][2][3].
Meta positioned the launch as the most capable openly available LLM family at the time of release. The 8B model was reported to outperform comparably sized open models such as Mistral 7B and Gemma 7B across standard benchmarks, while the 70B Instruct model achieved scores competitive with proprietary tier-2 frontier systems such as GPT-3.5 Turbo and Claude 3 Sonnet, closing much of the gap with GPT-4[1][2]. The release coincided with the rebuild of Meta AI, Meta's consumer-facing assistant, on top of Llama 3 across Facebook, Instagram, WhatsApp, Messenger, and a new standalone meta.ai web app[1][4][5].
Llama 3 was the foundation of a rapidly expanding family. The 405B-parameter variant arrived three months later as part of Llama 3.1 (July 23, 2024), together with a 128,000-token context window for all sizes. Subsequent releases extended the lineage with multimodal vision-language models and small edge models in Llama 3.2 (September 2024), an efficiency-optimized 70B in Llama 3.3 (December 2024), and a transition to a mixture-of-experts architecture in Llama 4 (April 2025)[6][7][8][9]. This article focuses on the original Llama 3 release; the followup releases are covered in their own articles.
By early 2024, Meta's preceding open-weight family had become one of the most widely used model lineages in the industry. The original LLaMA (February 2023) had introduced 7B to 65B parameter models under a research-only license, with leaked weights catalyzing a wave of community fine-tunes such as Alpaca, Vicuna, and Guanaco. Llama 2 (July 2023) became Meta's first openly licensed flagship language model series, available for most commercial use under the Llama 2 Community License. Llama 2 introduced a 4,096-token context window, grouped-query attention on the larger variants, and roughly 2 trillion training tokens[10]. Between Llama 2 and Llama 3, Meta also shipped Code Llama (August 2023), a code-specialist family fine-tuned from Llama 2 weights, and Purple Llama (December 2023), a safety toolkit that included the first Llama Guard input/output classifier.
The open-weight landscape had become highly competitive. Mistral AI released Mistral 7B and Mixtral 8x7B; 01.AI shipped the Yi series; and Alibaba's Qwen family expanded to multilingual variants. At the same time, proprietary frontier models including GPT-4, Claude 2 and 3, and Gemini demonstrated capabilities no open-weight model could match. Meta's stated goal with Llama 3 was to materially close that gap, and to do so under a permissive license that enabled most commercial use[1][11]. CEO Mark Zuckerberg framed the strategy publicly as the conviction that open-weight release was Meta's long-term path to AI relevance: by widening adoption of a single model family, Meta could avoid being beholden to platform owners such as Apple or Google for distribution of AI capabilities, while still operating profitable consumer products on top of the same weights[11][12].
The Llama 3 effort was led inside Meta by Ahmad Al-Dahle, VP and head of Generative AI. Al-Dahle had previously led Meta's XR engineering team after a 16-year career at Apple, and was appointed to lead the new GenAI organization in early 2023 specifically to consolidate Meta's foundation-model work[13]. The training effort drew on contributions from hundreds of researchers and engineers across Meta's AI organization, separate from the more research-focused Fundamental AI Research (FAIR) group under Yann LeCun.
In a March 2024 engineering post, Meta also disclosed the two purpose-built 24,576-GPU H100 clusters that it was using for GenAI training: one using a RoCE-based Ethernet fabric, the other using NVIDIA Quantum-2 400 Gbps InfiniBand. Both clusters became visible parts of the Llama 3 story when the model was announced the following month[14]. Meta's then-stated 2024 buildout target was to operate the equivalent of nearly 600,000 H100s by year-end, with roughly 350,000 H100s of that total. The Llama 3 release was the first major model to demonstrably benefit from this scaled-up training capacity[14].
Llama 3 launched on April 18, 2024, accompanied by a Meta AI blog post titled "Introducing Meta Llama 3: The most capable openly available LLM to date" and a Hugging Face hub launch post[1][2]. Both pretrained ("Base") and instruction-tuned ("Instruct") variants of the 8B and 70B models were released simultaneously, along with model cards, an updated Acceptable Use Policy, and a refreshed Llama Community License.
The launch was timed alongside the redesign of Meta's consumer AI assistant. The new Meta AI, powered by Llama 3, was made available across the search bars of Facebook, Instagram, WhatsApp, and Messenger, and through a new standalone web app at meta.ai. Meta described the assistant as the first time it was rolling out a single AI assistant across all of its consumer apps at scale, and noted that the assistant would begin expanding internationally beyond the United States[1][5]. Real-time information was provided through integrations with Bing and Google Search, and an image-generation feature based on Meta's own Imagine model was bundled into the assistant interface. Initial English-language availability extended to thirteen countries including Australia, Canada, Ghana, Jamaica, Malawi, New Zealand, Nigeria, Pakistan, Singapore, South Africa, Uganda, Zambia, and Zimbabwe[5][15].
Coverage by TechCrunch, The Verge, IEEE Spectrum, Reuters, and the New York Times framed the release as a strategic move that further legitimized open-weight frontier development[4][16]. Within the first week, Meta reported that Llama 3 models had been downloaded over 1.2 million times across Meta's own llama.com portal, the Hugging Face Hub, and partner cloud catalogs[1]. IEEE Spectrum noted that the 8B Instruct checkpoint topped the trending models list on Hugging Face with over 275,000 downloads in the first five days[16]. The blog post explicitly framed Llama 3 as "the most capable openly available LLM to date" and previewed the upcoming larger model, at the time still in training, that would later ship as the 405B variant in Llama 3.1[1].
The April 2024 release shipped four model checkpoints:
| Variant | Parameters | Context length | Description |
|---|---|---|---|
| Meta-Llama-3-8B | 8B | 8,192 | Pretrained base model, dense Transformer |
| Meta-Llama-3-8B-Instruct | 8B | 8,192 | Instruction-tuned chat variant of the 8B base |
| Meta-Llama-3-70B | 70B | 8,192 | Pretrained base model, dense Transformer |
| Meta-Llama-3-70B-Instruct | 70B | 8,192 | Instruction-tuned chat variant of the 70B base |
All four checkpoints used the same architecture, the same 128,000-token tokenizer, and the same 8,192-token context window. The 8B model targeted the laptop and single-GPU deployment niche, while the 70B model targeted high-end servers (8x A100 80GB or H100 80GB in BF16, or two GPUs with 4-bit quantization)[2]. The Instruct variants used a new chat template based on header tokens such as <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, and <|eot_id|>, replacing the inline [INST] markers used in Llama 2[2].
According to the official Hugging Face model cards, the 8B base model was trained with a March 2023 knowledge cutoff, while the 70B base model used a December 2023 cutoff[17]. Reported total training compute was 1.3 million GPU-hours on H100-80GB hardware for the 8B model and 6.4 million GPU-hours for the 70B, with cumulative CO2 emissions of 2,290 tCO2eq across both runs (fully offset by Meta's sustainability program)[17].
Meta also pre-announced that larger models were in training. A blog-post screenshot included an unreleased model card hinting at over 400B parameters; Meta explicitly noted that this was a work in progress and would be released later in the year. That larger model eventually arrived as Llama 3.1 405B in July 2024[6].
Alongside the base models, Meta released Llama Guard 2 8B on the same day. Llama Guard 2 is a Llama 3 fine-tune that classifies prompts and responses against the MLCommons hazards taxonomy, producing a structured safe/unsafe verdict together with violated category labels[18]. It served as a drop-in successor to the original Llama Guard built on Llama 2.
Llama 3 retained the basic decoder-only Transformer architecture used in Llama 2, with several scaled-up choices that made meaningful differences at training and serving time[1][3]. The architecture is intentionally conservative: Meta deliberately resisted introducing more exotic mechanisms (such as state-space layers, retrieval modules, or sparse experts) for the initial Llama 3 release, citing training stability and ease of deployment across heterogeneous hardware as priorities for an open-weight launch[3].
The two backbone sizes in the initial Llama 3 release have the following dimensions[3]:
| Parameter | 8B | 70B |
|---|---|---|
| Layers | 32 | 80 |
| Model dimension | 4,096 | 8,192 |
| FFN dimension | 14,336 | 28,672 |
| Attention heads | 32 | 64 |
| Key-value heads (GQA) | 8 | 8 |
| Attention head dimension | 128 | 128 |
| Vocabulary size | 128,256 | 128,256 |
| Context length | 8,192 | 8,192 |
| Peak learning rate | 3 x 10^-4 | 1.5 x 10^-4 |
| Positional encoding | RoPE (theta=500,000) | RoPE (theta=500,000) |
| Activation | SwiGLU | SwiGLU |
| Normalization | RMSNorm (pre-norm) | RMSNorm (pre-norm) |
Both sizes use SwiGLU activation functions in the feed-forward layers, Root Mean Square Normalization (RMSNorm) for internal state normalization in a pre-norm configuration, and Rotary Positional Embeddings (RoPE) for positional encoding[1][2][3]. These design choices were carried over from Llama 2.
One of the most consequential architectural decisions in Llama 3 was the use of grouped-query attention at every model size, not just the largest one. In Llama 2, GQA had been applied only to the 34B and 70B variants; the 7B and 13B variants used standard multi-head attention[10]. In Llama 3, both the 8B and the 70B use 8 key-value heads regardless of the number of query heads, which means that the 8B model with 32 query heads shares each key-value head across 4 query heads, while the 70B with 64 query heads shares each key-value head across 8 query heads[1][3].
The practical effect of universal GQA is that the key-value cache scales with the number of key-value heads rather than query heads, reducing inference-time memory and bandwidth pressure even for the smallest model. This makes long-context decoding cheaper and enables higher batch sizes in multi-tenant serving, a property that was previously a 70B-only luxury in the Llama 2 family[19]. The original GQA technique was published by Ainslie et al. at Google Research in May 2023, and Llama 3 was one of the first frontier-scale models to apply it uniformly[19].
Llama 3 employs Rotary Positional Embeddings (RoPE) to encode position information, applying a rotation matrix that simultaneously incorporates absolute and relative position into the self-attention computation[20]. Meta raised the RoPE base frequency hyperparameter from 10,000 (as used in Llama 2) to 500,000 for Llama 3[3]. The higher base frequency stretches the rotation period of the embeddings so that the model can distinguish positions over longer sequences without the periodic aliasing that would arise from low-frequency rotation. While the initial Llama 3 release retained an 8,192-token context window, the higher RoPE base was a forward-looking choice: it enabled the subsequent extension to a 128,000-token context in Llama 3.1 using continued pretraining and a custom RoPE-scaling scheme, without having to restart with a new positional encoding[3].
All Llama 3 models use a dense Transformer design in which every parameter is active during inference[1][3]. Meta chose this approach for its simplicity, training stability, and ease of deployment, even though inference cost in a dense model scales linearly with parameter count. Mixture-of-experts alternatives were explicitly considered but deferred; Meta did not move to MoE until the Llama 4 generation in April 2025[9]. The Llama 3 paper notes that the team prioritized stability at the 405B scale over conditional compute, observing that scaling existing dense recipes with more high-quality data was a more predictable engineering proposition than scaling MoE routing[3].
The Llama 3 tokenizer is a substantial departure from Llama 2. It uses a vocabulary of 128,256 tokens built with byte-pair encoding (BPE) via the tiktoken implementation, a fourfold increase from Llama 2's 32,000-token SentencePiece vocabulary[1][2]. Meta reported that the new tokenizer encodes English text using roughly 15% fewer tokens than the Llama 2 tokenizer for the same passage, which translates directly into both lower compute per character and more effective context utilization (more text fits inside the same 8K window)[1][21].
The tokenizer was deliberately retrained with greater weight on non-English text, code, and mathematical symbols, which improved compression for non-Latin scripts and reduced the tokenization disparity between English and other languages, a noted weakness of the Llama 2 tokenizer. The added vocabulary capacity was also used for new special tokens, including the chat-template header tokens (<|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, <|eot_id|>) that mark turn boundaries in the Instruct variants[2]. The full token table contains 128,000 BPE merges plus 256 reserved special tokens for a total nominal size of 128,256, matching the dimension of the input embedding and output projection matrices.
Although the headline framing was "128k vocabulary versus 32k," the more important downstream consequence was a measurable reduction in per-token cost at deployment and a smoother tokenization of code-heavy and multilingual inputs. The tokenizer was carried forward unchanged into Llama 3.1, 3.2, and 3.3, becoming a stable basis for the entire Llama 3 generation[6]. The Llama 4 release in April 2025 retained the same 128k tokenizer with extensions for multimodal tokens, preserving the encoding stability of the Llama 3.x downstream ecosystem[9].
The Instruct variants use a structured chat template that wraps each conversational turn in header tokens. A typical formatted prompt looks like the following:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{model response}<|eot_id|>
This template replaced the Llama 2 [INST]...[/INST] pattern with a clearer, role-labeled framing that is easier for downstream tooling to parse and for fine-tuners to extend with custom roles[2].
Llama 3 models were pretrained on over 15 trillion tokens of text collected from publicly available sources[1][3]. This was a roughly sevenfold increase over the 2 trillion tokens used for Llama 2 and a roughly eightfold increase over the 1.8 trillion tokens reported in the Llama 2 paper for that release[10]. Meta reported that over 5% of the training data, roughly 800 billion tokens, consisted of high-quality non-English text spanning more than 30 languages, although the launch blog explicitly cautioned that the 8B and 70B Llama 3 models were not expected to deliver the same multilingual quality as English[1]. The multilingual mix was substantially expanded in Llama 3.1.
Meta developed custom data-filtering pipelines for the pretraining corpus, including heuristic filters for low-quality web content, NSFW classifiers, and text-quality classifiers built specifically for this purpose. For quality scoring, Meta used DistilRoBERTa classifiers trained on web data that had been annotated by Llama 2 itself, creating a bootstrapping pipeline in which the previous generation helped curate the training data for the next[1][3]. A separate fasttext-based classifier predicted whether a document would be referenced by Wikipedia, providing a cheap signal that could be applied to the full corpus[3]. Specialized classifiers were also trained for code and mathematical reasoning content. The deduplication process operated at both the document level (global MinHash-based near-duplicate detection) and the line level (heuristics removing lines that appeared more than six times in each bucket of 30 million documents), with additional filtering based on token-distribution divergence to remove abnormal documents[3].
The pretraining mix used during the bulk training stage was approximately 50% general knowledge tokens, 25% mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens. Meta noted that this mix was the product of an extensive scaling-law sweep at smaller model sizes, in which different data mixes were trained to compute-equivalent quality on downstream benchmarks before the optimal mix was selected for the flagship runs[3].
Llama 3 was trained on Meta's purpose-built H100 clusters disclosed in March 2024. Two parallel clusters of 24,576 H100 80GB GPUs each were used: one with a RoCE (RDMA over Converged Ethernet) fabric built from Arista 7800 switches together with Wedge400 and Minipack2 OCP rack switches, and the second with NVIDIA Quantum-2 400 Gbps InfiniBand[14]. Both clusters housed GPUs in Meta's Grand Teton OCP chassis (eight H100s per server, connected by NVLink within the chassis) and used the Tectonic distributed flash storage system, providing roughly 240 PB of storage and 2 TB/s of sustained throughput, accessed through a FUSE layer for training data and checkpoints[3][14]. Meta noted that running large training jobs across these clusters required new approaches to job scheduling (the MAST global-scale scheduler), distributed checkpointing, and reliability engineering[3][14].
For Llama 3 itself, Meta reported a 4D parallelism strategy combining tensor parallelism, pipeline parallelism, context parallelism, and fully-sharded data parallelism (FSDP)[3]. Pipeline parallelism was implemented in 16 virtual stages run in interleaved fashion across GPUs to reduce pipeline bubbles, achieving approximately a 5% bubble ratio at favorable batch sizes[3]. On the 405B configuration trained as part of Llama 3.1, the team reported approximately 400 TFLOPs per GPU at 8K sequence length and 380 TFLOPs per GPU at 131K sequence length, yielding an overall BF16 Model FLOPs Utilization of 38% to 43% across runs[3]. The training process also delivered the highest training stability Meta had seen for a model of this scale, which the company credited to automated checkpointing, rapid recovery from hardware faults, and a refined parallelism layout[1].
Llama 3's batch size schedule started at 4 million tokens with sequence length 4,096, doubled to 8 million sequences of length 8,192 after 252 million tokens, and increased again to 16 million after 2.87 trillion tokens. The final 40 million tokens of pretraining used the full 128K context window for the Llama 3.1 long-context annealing phase, with the learning rate annealed to near zero[3].
The Instruct variants of Llama 3 went through a multi-stage post-training pipeline that combined supervised fine-tuning (SFT), rejection sampling, and direct preference optimization (DPO)[1][3]. The launch blog noted that the post-training stack was iterative: in each round, the model was first fine-tuned on curated instruction data, then improved through rejection sampling (where the model generates multiple candidate responses and a reward model selects the best ones), and finally refined through DPO to align outputs with human preferences[1][3].
A notable design decision documented in the Llama 3 paper is the move away from proximal policy optimization (PPO) for the bulk of post-training. The team reported that RLHF/PPO was less stable and harder to scale than DPO at the data and parameter volumes Llama 3 required, and chose a simple loop of rejection sampling, SFT, and DPO repeated over six rounds (with a learning rate of 1 x 10^-5, a DPO beta of 0.1, and an auxiliary NLL loss with a 0.2 coefficient on chosen responses) as the production pipeline. This contrasts with InstructGPT-style PPO pipelines used by OpenAI and by Llama 2's original post-training stack[3][22]. Rejection sampling generated between 10 and 30 candidates per prompt using a reward model trained on a mix of public and Meta-collected preference data, and the highest-scoring candidates were added to the SFT pool for the next round[3].
Over 10 million human-annotated examples were used during instruction tuning, alongside a growing volume of synthetic data generated by earlier checkpoints. The Llama 3 paper reports that the SFT data composition was approximately 52.66% general English, 14.89% code, 8.14% exam-like prompts, 21.19% reasoning and tool use, 3.01% multilingual, and 0.11% long-context examples[3]. Reward modeling was conducted on dialogues averaging 4.1 turns and 1,041.6 tokens per example, dominated by English (82.0%) with smaller fractions for code (6.9%), multilingual (5.2%), and reasoning prompts (5.9%)[3]. The post-training stack used in Llama 3 became the template that Meta reused, with refinements, in Llama 3.1, Llama 3.2, and especially Llama 3.3, which derived most of its quality gains over Llama 3.1 70B from improved post-training rather than from any change to the base model[8].
Meta published a benchmark sweep at launch comparing the Llama 3 8B and 70B Instruct variants against contemporary peer models such as Gemma 7B, Mistral 7B Instruct, Gemini Pro 1.0, Claude 3 Sonnet, and Mistral Medium[1]. The headline pretrained-model results were:
| Benchmark | Llama 3 8B | Llama 3 70B |
|---|---|---|
| MMLU (5-shot) | 66.6 | 79.5 |
| AGIEval English (3 to 5 shot) | 45.9 | 63.0 |
| BIG-Bench Hard (3-shot, CoT) | 61.1 | 81.3 |
| ARC-Challenge (25-shot) | 78.6 | 93.0 |
| DROP (3-shot, F1) | 58.4 | 79.7 |
Instruction-tuned model results published at launch[1]:
| Benchmark | Llama 3 8B Instruct | Llama 3 70B Instruct |
|---|---|---|
| MMLU (5-shot) | 68.4 | 82.0 |
| GPQA (0-shot) | 34.2 | 39.5 |
| HumanEval (0-shot) | 62.2 | 81.7 |
| GSM8K (8-shot, CoT) | 79.6 | 93.0 |
| MATH (4-shot, CoT) | 30.0 | 50.4 |
In Meta's launch sweep, Llama 3 8B Instruct outscored Gemma 7B-It and Mistral 7B Instruct on every reported benchmark, often by wide margins. Llama 3 70B Instruct beat Gemini Pro 1.5 and Claude 3 Sonnet on most of the same evaluations, although the comparison shifted again over the following months as proprietary vendors released updated systems (notably Claude 3.5 Sonnet in June 2024 and GPT-4o in May 2024)[1][4]. Independent evaluations on the LMSYS Chatbot Arena and other community leaderboards corroborated the headline finding that Llama 3 70B Instruct was the strongest open-weight chat model at the time of release, while Llama 3 8B Instruct was the strongest small open model[4][23].
Meta also published a small in-house human-evaluation set covering 1,800 prompts across 12 use cases (asking for advice, brainstorming, classification, closed and open question answering, coding, creative writing, extraction, persona, summarization, reasoning, rewriting). The 70B Instruct model was reported as preferred over Claude 3 Sonnet, Mistral Medium, GPT-3.5, and Meta Llama 2 in head-to-head comparisons on that evaluation set[1].
The following table summarizes Llama 3 against the most relevant open-weight peers at the time of release. Numbers are MMLU 5-shot for the pretrained base unless otherwise noted:
| Model | Params | Tokens | License | MMLU |
|---|---|---|---|---|
| Llama 2 7B | 7B | 2T | Llama 2 Community | 45.3 |
| Llama 2 70B | 70B | 2T | Llama 2 Community | 68.9 |
| Mistral 7B | 7B | undisclosed | Apache 2.0 | 60.1 |
| Mixtral 8x7B | 46.7B (12.9B active) | undisclosed | Apache 2.0 | 70.6 |
| Gemma 7B | 7B | 6T | Gemma Terms of Use | 64.3 |
| Qwen 1.5 72B | 72B | 3T | Tongyi Qianwen | 77.5 |
| Llama 3 8B | 8B | 15T+ | Llama 3 Community | 66.6 |
| Llama 3 70B | 70B | 15T+ | Llama 3 Community | 79.5 |
The Llama 3 8B's MMLU of 66.6 effectively closed the gap to Llama 2 70B (68.9) at a fraction of the inference cost, a result Meta credited to overtraining far beyond Chinchilla-optimal token counts on a much higher-quality dataset[1][16].
Llama 3 was released under the Llama 3 Community License[24]. The license is described by Meta as "open" but does not meet the Open Source Initiative's formal Open Source Definition[25]. Key terms include:
The OSI specifically objected to the user-cap clause, the trademark/naming requirement, and the use-of-outputs restriction as incompatible with the Open Source Definition. The organization argued that the agreement fails Freedom 0 (the right to use the model for any purpose), discriminates against certain users (point 5 of the OSD), and restricts fields of endeavor (point 6 of the OSD)[25]. Meta and several legal commentators argued in response that the license is functionally equivalent to permissive open source for the vast majority of users, since the user cap affected fewer than ten companies worldwide. The practical effect was that Llama 3 weights became one of the most widely deployed model artifacts in industry, distributed through Meta's own llama.com download portal, the Hugging Face Hub, Kaggle, and a wide range of cloud catalogs.
The Acceptable Use Policy has been treated as a binding extension of the license rather than a separate document, and breach of the AUP is sufficient grounds for license termination under section 5. Coverage by Stanford's Center for Research on Foundation Models noted that the Llama 3 license is one of several "source-available" licenses (alongside the Falcon License and Yi License) that approximate but do not satisfy formal open-source standards[26].
The initial reception of Llama 3 was strongly positive across the AI research community and the broader tech press. Within the first week of release, Meta reported that the models had been downloaded more than 1.2 million times[1]. By August 2024, Meta reported that cumulative downloads of Llama models (across all generations) had crossed 350 million on the Hugging Face Hub alone, a tenfold year-over-year increase, with monthly downloads exceeding 20 million tokens and more than 60,000 derivative fine-tunes published in the same period[27].
Commentary in IEEE Spectrum, authored by Matthew S. Smith, framed Llama 3 as the release that "established Meta as the leader in 'Open' AI," highlighting both the headline benchmark results and the strategic effect of releasing weights without monetizing direct API access[16]. The Verge and TechCrunch emphasized the Meta AI relaunch alongside the model release, and Reuters and the Financial Times covered the competitive implications for OpenAI, Anthropic, and Google[4]. The 8B model in particular was widely adopted as a baseline for derivative fine-tuning, and the 70B model rapidly displaced Llama 2 70B as the default open-weight serving target at hosted-inference providers such as Together AI, Fireworks AI, Replicate, Groq, and DeepInfra. Cloud-vendor partnerships ensured that the new models were immediately available on Amazon Bedrock, Azure AI, Google Vertex AI, IBM watsonx, Databricks, and Snowflake from day one[1].
Adoption metrics from hosted-inference benchmarks tracked by Artificial Analysis showed Llama 3 70B reaching token-generation rates above 800 tokens per second on Groq's LPU inference engine, an order of magnitude above conventional GPU baselines, which helped drive uptake among latency-sensitive applications such as voice agents and real-time coding assistants[28]. Enterprise customers cited at launch and in subsequent Meta updates included AT&T (customer support), DoorDash (engineering workflows), Goldman Sachs (document processing), Shopify (40 to 60 million daily inferences), and Zoom (AI Companion)[27].
Within months, a large derivative ecosystem had formed on top of Llama 3, including code specialists, role-play and narrative models, domain-specific medical and legal fine-tunes, and a variety of community quantization packages for llama.cpp, Ollama, vLLM, and similar local-inference runtimes. Notable downstream models include:
Llama 3 also served as the base model for Meta's own initial Llama Guard 2 safety classifier, released the same day, and was used to bootstrap subsequent safety classifiers including Llama Guard 3 and Prompt Guard introduced with Llama 3.1[6][18].
Llama 3 was the engine behind the largest single rollout of Meta's consumer AI assistant. The new Meta AI surfaced inside the search bar of Facebook, Instagram, WhatsApp, and Messenger, and on a new standalone meta.ai web app[1][5]. At launch, Meta described it as the first time a single AI assistant would be available across all of its consumer apps at scale, with rollout beginning in the United States and several other English-speaking countries on April 18, 2024 and expansions planned for additional regions through the rest of the year.
The Meta AI assistant integrated Llama 3 with real-time information retrieval (via Bing and Google Search), an image-generation capability based on Meta's Imagine model (which produces images in real time as the user types), and an animation feature that could turn generated stills into short animations[1][15]. Within enterprise platforms, Meta also began exposing Llama 3 endpoints through Meta AI Studio for selected developers and creator partners.
By the end of 2024, Meta reported that the Meta AI assistant had crossed roughly 600 million monthly active users across all of its consumer surfaces, making it one of the most widely used consumer AI assistants on the planet and a direct downstream beneficiary of the Llama 3 launch[30]. The Ray-Ban Meta smart glasses, which gained generative AI features later in 2024, also relied on backend variants of the Llama family.
On July 23, 2024, Meta released the Llama 3.1 family[6]. The release did three things that materially extended the original April 2024 Llama 3 launch:
The Llama 3.1 release was paired with the publication of the technical paper "The Llama 3 Herd of Models" (arXiv:2407.21783), which documents both the April 2024 Llama 3 release and the July 2024 Llama 3.1 release in detail[3]. A Mark Zuckerberg open letter titled "Open Source AI Is the Path Forward" accompanied the launch, framing open-weight release as Meta's deliberate long-term strategy and noting that running inference on Llama 3.1 405B on customer infrastructure was approximately 50% the cost of using closed models like GPT-4o[12]. Meta also released an official FP8 quantized variant of the 405B model that fit onto a single 8x H100 node by reducing memory requirements from 810 GB (BF16) to 405 GB (FP8) with minimal accuracy loss[32]. Crucially for understanding the Llama 3 release itself: the 405B variant is part of Llama 3.1, not the original April 2024 Llama 3 release. The two are routinely conflated in casual coverage, but Meta's own naming and the technical paper treat them as separate releases.
The Llama 3 family expanded through three further releases before the next generation:
Each successor reused the Llama 3 tokenizer, the chat template introduced in Llama 3, and the basic post-training pipeline first deployed in Llama 3, making the April 2024 release the technical foundation for the entire Llama 3.x generation.
By May 2026, the original April 2024 Llama 3 8B and 70B models remain widely deployed even though the same model sizes have been substantially superseded by their Llama 3.1, 3.3, and Llama 4 counterparts. The reasons are partly historical and partly practical: the Llama 3 8B and 70B checkpoints are stable, well-understood, and trivially deployable in 4-bit quantization through community runtimes such as llama.cpp and Ollama; many production stacks were built and validated against these specific checkpoints; and they continue to serve as the canonical baseline against which subsequent open-weight releases are measured. A long tail of community fine-tunes (chat, code, role-play, narrative, domain) continues to use the April 2024 base weights as a starting point.
The broader legacy of Llama 3 is twofold. First, it demonstrated that overtraining smaller models far beyond Chinchilla-optimal token budgets, for example, training an 8B model on 15 trillion tokens, yields a model that materially outperforms much larger predecessors and shifts the cost-quality frontier of small open models. The same pattern was subsequently reproduced and extended by Mistral, DeepSeek, and others, contributing to the now-standard practice of training open-weight base models on at least an order of magnitude more tokens than Chinchilla scaling laws would recommend[16]. Second, the release reset expectations for what an "open" model release could look like at scale: a same-day Meta AI consumer-product launch on top of the same weights, multiple cloud partner launches, a developer SDK, model cards, an Acceptable Use Policy, a layered safety stack, and global press coverage. That template has been followed, with minor variations, by every subsequent Llama release[6][7][8][9].
The April 18, 2024 release was also the first major model trained on Meta's purpose-built 24K H100 clusters, marking the moment when those clusters transitioned from infrastructure announcement to demonstrated production capability, a story that has continued to unfold through subsequent Llama generations and that culminated in Llama 4 being trained on what Meta described as "tens of thousands more" H100 and H200 GPUs in early 2025[9][14].
A May 2025 Meta restructuring split the Generative AI organization into a research-focused FAIR group under Yann LeCun and an applied AGI Foundations unit; Al-Dahle's role evolved into co-leading the latter, with continued ownership of the Llama model family[33]. The Llama 3 herd, including the original April 2024 release, has remained Meta's reference open-weight platform through this transition.