LLaMA 3 (Large Language Model Meta AI 3) is a family of large language models developed and released by Meta beginning in April 2024. The LLaMA 3 series represents a significant leap over its predecessor, LLaMA 2, in both scale and capability, and includes dense text models, multimodal vision-language models, and lightweight models designed for on-device deployment. With the release of LLaMA 3.1 405B in July 2024, Meta introduced the largest openly available language model at the time, directly challenging proprietary models from OpenAI, Google, and Anthropic. The LLaMA 3 family has become one of the most widely adopted open-weight model families in the AI ecosystem, accumulating over 1.2 million downloads in its first week alone and spawning a broad ecosystem of fine-tuned variants, developer tools, and enterprise integrations [1]. By early 2025, Meta reported that Llama models had been downloaded over 400 million times across all platforms, representing a tenfold increase over the prior year [2].
Meta released the LLaMA 3 family in several waves throughout 2024, each introducing new model sizes, capabilities, or architectural improvements. The following table summarizes all major releases in the LLaMA 3 generation.
| Model | Release Date | Parameters | Context Length | Key Features |
|---|---|---|---|---|
| LLaMA 3 8B | April 18, 2024 | 8B | 8,192 | Dense transformer, GQA, 128K vocabulary |
| LLaMA 3 70B | April 18, 2024 | 70B | 8,192 | Dense transformer, GQA, 128K vocabulary |
| LLaMA 3.1 8B | July 23, 2024 | 8B | 128,000 | Extended context, multilingual (8 languages) |
| LLaMA 3.1 70B | July 23, 2024 | 70B | 128,000 | Extended context, multilingual (8 languages) |
| LLaMA 3.1 405B | July 23, 2024 | 405B | 128,000 | Largest open-weight model, trained on 16K H100 GPUs |
| LLaMA 3.2 1B | September 25, 2024 | 1B | 128,000 | Lightweight text model for edge/mobile |
| LLaMA 3.2 3B | September 25, 2024 | 3B | 128,000 | Lightweight text model for edge/mobile |
| LLaMA 3.2 11B-Vision | September 25, 2024 | 11B | 128,000 | Multimodal (text + image input), vision encoder |
| LLaMA 3.2 90B-Vision | September 25, 2024 | 90B | 128,000 | Multimodal (text + image input), vision encoder |
| LLaMA 3.3 70B | December 6, 2024 | 70B | 128,000 | 405B-level performance at 70B cost |
LLaMA 3 uses a relatively standard decoder-only transformer architecture, but with several important design choices that improve efficiency and performance at scale.
The three primary backbone sizes in the LLaMA 3 family differ in layer depth, model width, feed-forward network dimension, and attention head count. The following table details the architectural specifications for each size [3].
| Parameter | 8B | 70B | 405B |
|---|---|---|---|
| Layers | 32 | 80 | 126 |
| Model Dimension | 4,096 | 8,192 | 16,384 |
| FFN Dimension | 14,336 | 28,672 | 53,248 |
| Attention Heads | 32 | 64 | 128 |
| Key-Value Heads | 8 | 8 | 8 |
| Attention Head Dimension | 128 | 128 | 128 |
| Vocabulary Size | 128,000 | 128,000 | 128,000 |
| Activation Function | SwiGLU | SwiGLU | SwiGLU |
| Normalization | RMSNorm | RMSNorm | RMSNorm |
| Positional Encoding | RoPE (theta=500K) | RoPE (theta=500K) | RoPE (theta=500K) |
All three model sizes share the same fundamental building blocks: SwiGLU activation functions in the feed-forward layers, Root Mean Square Normalization (RMSNorm) for internal state normalization, and Rotary Positional Embeddings (RoPE) for encoding positional information. The consistent use of 8 key-value heads across all sizes, regardless of the number of query heads, is a defining characteristic of the LLaMA 3 architecture [3].
The LLaMA 3 tokenizer uses a vocabulary of 128,000 tokens based on byte pair encoding (BPE) via the tiktoken library, a fourfold increase over the 32,000-token vocabulary in LLaMA 2. This larger vocabulary encodes language more efficiently, reducing the number of tokens required to represent a given text passage and thereby improving both throughput and effective context utilization. In practice, the 128K vocabulary compresses English text by roughly 15% more tokens per passage compared to the LLaMA 2 tokenizer, which means more text fits within the same context window [1].
Grouped-query attention (GQA) is used across all LLaMA 3 model sizes, including the 8B variant. In GQA, multiple query heads share a smaller number of key-value heads (8 key-value heads in the case of LLaMA 3), which reduces the memory footprint of the key-value cache during inference and improves decoding speed without meaningful degradation in output quality. For the 8B model with 32 query heads and 8 key-value heads, each key-value head is shared across 4 query heads. For the 70B model with 64 query heads, each key-value head is shared across 8 query heads. For the 405B model with 128 query heads, each key-value head serves 16 query heads. LLaMA 2 had used GQA only in its 70B variant, so extending it to all sizes was a notable architectural decision [4].
LLaMA 3 employs Rotary Positional Embeddings (RoPE) to encode positional information. RoPE applies a rotation matrix to encode absolute position while simultaneously incorporating relative position information directly into the self-attention computation. For the LLaMA 3 series, Meta increased the RoPE base frequency hyperparameter to 500,000 (up from 10,000 in LLaMA 2), which enables better support for longer context lengths of up to 8,192 tokens in the initial release and 128,000 tokens in LLaMA 3.1 [3]. The higher base frequency stretches the rotation period, allowing the model to distinguish positions over longer ranges without running into the periodic aliasing that would occur at lower frequencies.
Unlike some competing models that use mixture-of-experts (MoE) architectures, all LLaMA 3 models employ a dense transformer design in which every parameter is active during inference. Meta chose this approach for its simplicity, training stability, and ease of deployment, though it means that inference cost scales linearly with parameter count [3]. The team explicitly evaluated MoE alternatives during the design phase but concluded that the dense architecture provided better training stability at the scales they targeted and was simpler to optimize for deployment across diverse hardware platforms.
LLaMA 3 models were pretrained on over 15 trillion tokens of text collected from publicly available sources. This represents a roughly sevenfold increase over the 2 trillion tokens used for LLaMA 2. Over 5% of the training data (approximately 800 million tokens) consisted of text in more than 30 non-English languages, improving multilingual performance [1].
Meta developed custom data filtering pipelines using multiple stages of quality control. The pipeline included heuristic filters for removing low-quality content, NSFW classifiers, and text quality classifiers trained specifically for this purpose. For quality scoring, Meta used DistilRoBERTa models trained on web data that had been annotated by LLaMA 2 itself, creating a bootstrapping approach where the previous generation model helped curate data for the next [3]. Specialized classifiers were also trained for code and reasoning content, using prompt-tuned models to identify web pages containing mathematical deductions, STEM reasoning, and code interleaved with natural language.
The deduplication process operated at both the document and line levels. Document-level deduplication used MinHash-based near-duplicate detection to remove redundant content across the corpus. Line-level deduplication employed heuristics such as duplicated n-gram coverage ratios to strip out repetitive content like logging messages or error traces. Additionally, token-distribution Kullback-Leibler divergence was used to filter out documents containing abnormal distributions of tokens compared to the overall training corpus [3].
The approximate data mix composition for pretraining was as follows [3]:
| Data Category | Share of Training Mix |
|---|---|
| General knowledge (web text) | ~50% |
| Mathematical and reasoning data | ~25% |
| Code | ~17% |
| Multilingual data | ~8% |
The LLaMA 3.1 405B model was trained on a cluster of 16,384 NVIDIA H100 80GB GPUs, making it one of the largest single training runs conducted on publicly disclosed infrastructure at the time [5]. The total training compute for the 405B model was approximately 3.8 x 10^25 FLOPs, and the run consumed a cumulative 39.3 million GPU hours on H100-80GB hardware (rated at a thermal design power of 700W per GPU) [3].
Training a model of this scale required advanced distributed training techniques. Meta employed a 4D parallelism strategy that combined four forms of parallelism simultaneously [3]:
| Parallelism Type | Description |
|---|---|
| Tensor Parallelism (TP) | Splits individual weight matrices across multiple GPUs within a node |
| Pipeline Parallelism (PP) | Distributes different layers of the model across groups of GPUs |
| Context Parallelism (CP) | Splits long input sequences across GPUs for memory efficiency |
| Data Parallelism (DP) | Replicates the model and distributes training batches across groups |
The pretraining run for the 405B model took approximately 54 days. During that period, the cluster experienced 466 job interruptions, of which 47 were planned (for automated maintenance) and 419 were unexpected. The unexpected interruptions broke down as follows: 148 were caused by faulty GPUs (30.1%), 72 by GPU HBM3 memory errors (17.2%), 35 by network switch and cable problems (8.4%), 19 by GPU SRAM memory issues (4.5%), and 17 by GPU system processor failures (4.1%). Only two CPU failures were recorded during the entire period. Despite these challenges, the team achieved over 90% effective training time through automated checkpointing and rapid recovery procedures [6].
The models were trained on sequences of 8,192 tokens using document-level masking to prevent self-attention from crossing document boundaries. An important finding from the training process was that model performance continued to improve log-linearly well beyond the Chinchilla-optimal compute allocation. While the Chinchilla-optimal amount of training data for an 8B parameter model is roughly 200 billion tokens, Meta observed continued gains when training on two orders of magnitude more data (15 trillion tokens), which informed their decision to overtrain the smaller models significantly [1].
Instruct-tuned variants of all LLaMA 3 models underwent supervised fine-tuning on publicly available instruction datasets along with over 10 million human-annotated examples. Post-training also included reinforcement learning from human feedback (RLHF) using rejection sampling and direct preference optimization (DPO) to align model outputs with human preferences.
The post-training pipeline followed multiple iterative rounds. In each round, the model was first fine-tuned on curated instruction data (SFT), then improved through rejection sampling (RS) where the model generated multiple candidate responses and a reward model selected the best ones, and finally refined through DPO where the model learned to prefer higher-quality outputs over lower-quality alternatives [3]. This iterative approach allowed each round to build on the improvements of the previous round, progressively improving both helpfulness and safety.
For LLaMA 3.3, the fine-tuning data was further expanded to over 25 million synthetically generated examples, reflecting Meta's growing investment in synthetic data generation as a post-training technique [7].
The initial LLaMA 3 release on April 18, 2024 included two dense text models: 8B and 70B parameters. Both models were released in base (pretrained) and instruct-tuned variants. Despite the relatively short 8,192-token context window, these models set new standards for open-weight performance at their respective sizes [1].
The initial LLaMA 3 models demonstrated strong results across standard benchmarks for their parameter counts [3].
| Benchmark | LLaMA 3 8B | LLaMA 3 70B |
|---|---|---|
| MMLU (5-shot) | 69.4 | 83.6 |
| HumanEval (0-shot) | 72.6 | 80.5 |
| GSM8K (8-shot, CoT) | 84.5 | 95.1 |
| MATH (0-shot, CoT) | 51.9 | 68.0 |
| ARC-Challenge (25-shot) | 83.4 | 94.8 |
The LLaMA 3 8B outperformed the previous LLaMA 2 70B on several benchmarks despite being nearly nine times smaller, illustrating the compounding benefits of more training data, a larger vocabulary, and grouped-query attention at all scales [1]. The 70B model was competitive with the best proprietary models available in early 2024, including GPT-3.5 Turbo and early versions of Claude 3 Sonnet.
The LLaMA 3.1 release on July 23, 2024 updated all three model sizes (8B, 70B, and 405B) with extended context lengths of 128,000 tokens and multilingual support for eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This was a meaningful expansion from the English-centric focus of earlier LLaMA releases [5].
The LLaMA 3.1 405B model was the largest openly available language model at the time of its release. With 405 billion parameters and a 128K token context window, it was positioned as a direct competitor to proprietary frontier models like GPT-4 and Claude 3 Opus [5].
The following table compares the LLaMA 3.1 models against leading proprietary models on standard benchmarks. All scores are for the instruct-tuned variants [3][5].
| Benchmark | LLaMA 3.1 8B | LLaMA 3.1 70B | LLaMA 3.1 405B | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|---|---|
| MMLU (5-shot) | 73.0 | 86.0 | 87.3 | 88.7 | 88.3 |
| HumanEval (0-shot) | 72.6 | 80.5 | 89.0 | 90.2 | 92.0 |
| GSM8K (8-shot, CoT) | 84.5 | 95.1 | 96.8 | 96.1 | 96.4 |
| MATH (0-shot, CoT) | 51.9 | 68.0 | 73.8 | 76.6 | 71.1 |
| ARC-Challenge (25-shot) | 83.4 | 94.8 | 96.9 | -- | -- |
| GPQA (0-shot) | -- | 48.0 | 50.7 | -- | -- |
| Tool Use (BFCL) | -- | 77.5 | 88.5 | 83.6 | 90.2 |
| Multilingual (MGSM) | -- | 86.9 | 91.6 | 90.5 | 91.6 |
On the Scale AI SEAL leaderboard, LLaMA 3.1 405B ranked second in math and reasoning, fourth in coding, and first in instruction following [5]. Experimental evaluations suggested the 405B model performed on par with GPT-4-0125-Preview and Claude 3.5 Sonnet, winning and losing roughly the same percentage of head-to-head comparisons, though it fell slightly behind GPT-4o in direct matchups, winning only 19.1% of comparisons [8].
Meta CEO Mark Zuckerberg described the 405B release as a pivotal moment for open-source AI, comparing it to the role Linux played in democratizing server operating systems. The release was accompanied by a blog post titled "Open Source AI Is the Path Forward," in which Zuckerberg argued that openly available models would ultimately outcompete closed alternatives through community-driven innovation [9].
Released on September 25, 2024, LLaMA 3.2 expanded the family in two significant directions: multimodal vision-language models and lightweight models for edge devices [10].
The LLaMA 3.2 11B-Vision and 90B-Vision models accept both text and image inputs and produce text outputs. They were designed as drop-in replacements for their text-only counterparts while adding image reasoning capabilities. These models can perform tasks such as document understanding (including charts and graphs), image captioning, and visual grounding (identifying objects in images based on natural language descriptions) [10].
The vision capability was integrated through a separately trained vision adapter. This adapter consists of a dedicated image encoder paired with a series of cross-attention layers that feed image encoder representations into the core language model. The adapter weights were trained on text-image pairs to align the visual representations with the language model's internal representations, while the pretrained language model weights remained largely frozen during this alignment stage. This design allowed the vision models to retain the full text performance of their LLaMA 3.1 counterparts while gaining image understanding capabilities [10].
The following table shows benchmark scores for the instruct-tuned vision models on image understanding tasks [11].
| Benchmark | LLaMA 3.2 11B-Vision | LLaMA 3.2 90B-Vision |
|---|---|---|
| MMMU (val, CoT) | 50.7 | 60.3 |
| MMMU-Pro Standard | 33.0 | 45.2 |
| MathVista | 51.5 | 57.3 |
| ChartQA (CoT) | 83.4 | 85.5 |
| AI2 Diagram (test) | 91.1 | 92.3 |
| DocVQA (test) | 88.4 | 90.1 |
| VQAv2 (test) | 75.2 | 78.1 |
Meta reported that the 11B-Vision model outperformed Claude 3 Haiku and was competitive with GPT-4o mini on image recognition and visual understanding tasks [10].
The LLaMA 3.2 1B and 3B models were designed for on-device deployment on mobile phones and edge hardware. Despite their small size, they support the full 128K token context window and achieve competitive performance on summarization, instruction following, and text rewriting tasks [10].
The following table compares the LLaMA 3.2 lightweight models against similarly sized competitors [10].
| Benchmark | LLaMA 3.2 1B | LLaMA 3.2 3B | Gemma 2 2.6B | Phi 3.5-mini |
|---|---|---|---|---|
| MMLU (5-shot) | 49.3 | 63.4 | 57.8 | 69.0 |
| IFEval (Instruction Following) | 59.5 | 77.4 | 61.9 | 59.2 |
| ARC-Challenge | 59.4 | 78.6 | 76.7 | 87.4 |
| Tool Use (BFCL V2) | 25.7 | 67.0 | -- | -- |
| NIH/Multi-Needle | 75.0 | 84.7 | -- | -- |
The 3B model outperformed Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, and tool use tasks, while the 1B model was competitive with Gemma on general knowledge benchmarks [10]. Meta worked with Qualcomm and MediaTek, the two largest mobile system-on-a-chip companies, to optimize these models for mobile processors. The models were also enabled for Arm processors, which provide the foundational compute layer for 99% of mobile devices [10].
Released on December 6, 2024, LLaMA 3.3 70B is an efficiency-optimized model that delivers performance comparable to the much larger LLaMA 3.1 405B at a fraction of the computational cost [7].
The model retains the same architecture as LLaMA 3.1 70B (80 layers, 8,192 model dimension, 64 attention heads) but incorporates advances in post-training techniques. Meta applied expanded supervised fine-tuning with over 25 million synthetically generated examples, along with improved rejection sampling and direct preference optimization rounds, which yielded substantial gains in reasoning, mathematics, instruction following, and tool use [7].
The following table compares LLaMA 3.3 70B against its predecessor and several leading models [7][12].
| Benchmark | LLaMA 3.1 70B | LLaMA 3.3 70B | LLaMA 3.1 405B | GPT-4o | Gemini 1.5 Pro |
|---|---|---|---|---|---|
| MMLU (0-shot, CoT) | 86.0 | 86.0 | 88.6 | 87.5 | 87.1 |
| MMLU-Pro (5-shot, CoT) | 66.4 | 68.9 | -- | 73.8 | 76.1 |
| IFEval (Instruction Following) | 87.5 | 92.1 | 88.6 | 84.6 | 81.9 |
| HumanEval (0-shot) | 80.5 | 88.4 | 89.0 | 86.0 | -- |
| MATH (0-shot, CoT) | 68.0 | 77.0 | 73.8 | 76.9 | 82.9 |
| GPQA Diamond (0-shot, CoT) | 48.0 | 50.5 | 50.7 | -- | 53.5 |
| Multilingual MGSM (0-shot) | 86.9 | 91.1 | -- | 90.6 | -- |
| NIH/Multi-Needle (Long Context) | 97.5 | 97.5 | 98.1 | -- | 94.7 |
| Tool Use (BFCL v2, 0-shot) | 77.5 | 77.3 | -- | -- | 80.3 |
LLaMA 3.3 70B trailed LLaMA 3.1 405B by under 2% on six out of nine evaluated benchmarks and achieved higher scores on three of them (IFEval, HumanEval, and MATH). The model outperformed Google Gemini 1.5 Pro, OpenAI GPT-4o, and Amazon Nova Pro on several benchmarks [7].
With an output speed of approximately 82.9 tokens per second (median across providers) and generation costs roughly five times lower than the 405B model, LLaMA 3.3 70B quickly became one of the most cost-effective open-weight models available. Independent benchmarks measured inference speeds reaching 276 tokens per second on Groq hardware. At typical API pricing of $0.10 per million input tokens and $0.40 per million output tokens, the model was approximately 25 times cheaper than GPT-4o for equivalent workloads [12].
Alongside the LLaMA 3 models, Meta developed and released Llama Guard, a safety classification system designed to moderate inputs and outputs of language model applications [13].
| Version | Base Model | Release | Key Capabilities |
|---|---|---|---|
| Llama Guard (original) | LLaMA 2 7B | December 2023 | Text input/output classification |
| Llama Guard 2 | LLaMA 3 8B | April 2024 | Updated taxonomy, improved accuracy |
| Llama Guard 3 (8B) | LLaMA 3.1 8B | July 2024 | 8 languages, tool call safety, MLCommons taxonomy |
| Llama Guard 3 Vision (11B) | LLaMA 3.2 11B | September 2024 | Multimodal (text + image) moderation |
| Llama Guard 4 (12B) | LLaMA 4 Scout | April 2025 | Unified text + vision safety, pruned architecture |
Llama Guard classifies both user prompts and model responses as safe or unsafe across a predefined set of hazard categories aligned with the MLCommons standardized hazards taxonomy. It can be deployed as a filter layer in front of or behind any language model, not just Meta's own [13].
Introduced alongside LLaMA 3.1 in July 2024, Llama Stack is a developer framework that standardizes the building blocks for constructing AI applications on top of LLaMA models [14].
Llama Stack provides a unified API layer covering inference, retrieval-augmented generation (RAG), AI agents, tool use, safety moderation, and evaluation. It supports a plugin architecture that allows different backend implementations for local development, on-premises servers, cloud environments, and mobile devices. SDKs are available for Python, TypeScript, iOS, and Android [14].
The ecosystem launched with over 25 integration partners, including AWS, NVIDIA, Databricks, Groq, Dell, Microsoft Azure, Google Cloud, IBM, Intel, Oracle Cloud, AMD, and Snowflake.
The LLaMA 3 family has had a substantial impact on the open-source AI landscape. The models have been downloaded hundreds of millions of times from Hugging Face and other platforms, and they form the foundation for thousands of community fine-tunes and specialized models.
Several factors contributed to this adoption. First, the 8B and 70B models hit practical sweet spots for cost and capability, making them suitable for a wide range of applications from chatbots to code assistants. Second, Meta's partnerships with cloud providers (AWS, Azure, Google Cloud) and hardware vendors (NVIDIA, Qualcomm, MediaTek) ensured that the models were immediately deployable across diverse infrastructure. Third, the release of the 405B model demonstrated that open-weight models could compete with the best proprietary systems on standard benchmarks, lending credibility to the broader open-source AI movement [9].
Meta released torchtune alongside the LLaMA 3 family, a PyTorch-native library designed for fine-tuning and experimenting with LLMs. The library provides memory-efficient training recipes and integrates with platforms such as Hugging Face, Weights & Biases, and EleutherAI's evaluation harness, making it straightforward for researchers and developers to adapt LLaMA 3 models to custom tasks and domains [2].
The competitive pressure from LLaMA 3 is credited with influencing other organizations to open their models or adopt more permissive licensing. Even OpenAI CEO Sam Altman acknowledged in late 2024 that the company "may need to pursue a more rigorous open source strategy" in response to the rise of open models [9].
LLaMA 3 models are released under the Llama 3 Community License, which Meta describes as "open" but which the Open Source Initiative has determined does not meet the formal Open Source Definition [15].
Key restrictions in the license include:
The license is governed under California law. Despite these restrictions, the Llama Community License is considerably more permissive than fully proprietary licenses and has enabled widespread commercial and research adoption.
The LLaMA 3 family's release reshaped the competitive landscape of the AI industry in 2024. Before LLaMA 3, the prevailing view in much of the industry was that only closed-model companies with proprietary data advantages could produce frontier-quality systems. The LLaMA 3.1 405B result, matching or exceeding GPT-4 Turbo on multiple benchmarks, challenged that assumption directly.
The influence extended beyond direct adoption. Google accelerated the release cadence of its Gemma open-weight models, Mistral AI expanded its own open-weight offerings, and several Chinese AI labs (including Alibaba with Qwen and 01.AI with Yi) released increasingly competitive open-weight alternatives, creating a feedback loop that accelerated the entire open-weight ecosystem.
The LLaMA 3 family also proved that overtraining smaller models well beyond the Chinchilla-optimal compute budget could produce highly capable models at accessible sizes. The finding that an 8B model trained on 15 trillion tokens could match or exceed the performance of 70B models trained on only 2 trillion tokens influenced training strategies across the industry and contributed to the trend of prioritizing data quality and quantity over raw parameter count [3].
On April 5, 2025, Meta released the first models in the LLaMA 4 generation, marking a significant architectural departure from LLaMA 3 [16].
LLaMA 4 models use a mixture-of-experts (MoE) architecture, replacing the dense transformer design of LLaMA 3. In MoE models, each layer contains multiple expert subnetworks, and a gating function routes each token to only a subset of experts, allowing the model to maintain the knowledge capacity of a much larger model while keeping inference costs proportional to only the active parameters.
LLaMA 4 also introduces native multimodality through early fusion, meaning that text and image inputs are processed together from the earliest layers of the model rather than being handled by separate encoders that are later combined [16].
| Model | Total Parameters | Active Parameters | Experts | Context Length |
|---|---|---|---|---|
| LLaMA 4 Scout | ~109B | 17B | 16 | 10,000,000 |
| LLaMA 4 Maverick | ~400B | 17B | 128 | 1,000,000 |
| LLaMA 4 Behemoth (unreleased) | ~2T | -- | -- | -- |
LLaMA 4 Scout uses 16 experts per MoE layer with 17 billion active parameters per token out of approximately 109 billion total. Its most striking feature is an industry-leading context window of 10 million tokens, vastly exceeding any previous open model [16].
LLaMA 4 Maverick scales up to 128 routed experts (plus a shared expert) per MoE layer, with MoE and dense layers alternating so that experts are applied in half of the layers. Each token is routed to the shared expert and one of the 128 routed experts. Maverick supports a context window of 1 million tokens and was reported to outperform GPT-4o and Gemini 2.0 Flash across a broad range of benchmarks [16].
LLaMA 4 Behemoth is a forthcoming model with approximately 2 trillion total parameters that Meta has teased but not yet released as of early 2026.
Meta published a detailed technical report titled "The Llama 3 Herd of Models" (arXiv:2407.21783), which provides comprehensive documentation of the training methodology, scaling experiments, architectural decisions, and evaluation results for the LLaMA 3 and 3.1 families [3]. The paper is 92 pages long and covers topics including data curation, pretraining scaling laws, long-context extension, multilingual training, post-training alignment, safety evaluations, and inference optimization. It has become one of the most widely cited AI papers of 2024.
As of March 2026, the LLaMA model family remains one of the two dominant open-weight model ecosystems in the AI industry, alongside Mistral AI's model family. The LLaMA 3.3 70B and LLaMA 4 Scout and Maverick models are widely deployed across cloud platforms, enterprise applications, and research institutions. The Llama Stack developer framework continues to expand with new integrations and tooling.
Meta has committed to continuing its open-weight release strategy, with LLaMA 4 Behemoth expected to be the next major release. The company's approach of releasing frontier-scale models openly has reshaped industry dynamics and established a viable alternative to the closed-model paradigm championed by OpenAI and others.