LLaMA 4 (Large Language Model Meta AI 4) is a family of natively multimodal large language models developed by Meta and released on April 5, 2025. It is the first generation of the Llama series to adopt a mixture of experts (MoE) architecture and the first to support text and images as native input modalities from the ground up. The initial release included two models, Llama 4 Scout and Llama 4 Maverick, with a third and much larger model, Llama 4 Behemoth, announced but still in training at the time of release. Scout features 17 billion active parameters drawn from a 109 billion total parameter pool across 16 experts, while Maverick uses 17 billion active parameters from approximately 400 billion total parameters spread across 128 experts. Llama 4 Scout supports a context window of up to 10 million tokens, the longest of any openly available model at launch [1].
The release was marred by controversy. Within days, the AI community raised concerns that Meta had used a specially tuned "experimental" version of Maverick to achieve high scores on the Chatbot Arena leaderboard, a version that differed from the publicly released model. The incident, later confirmed by departing Meta AI chief scientist Yann LeCun as benchmark manipulation, led to significant reputational damage and internal upheaval at Meta [2][3].
Llama 4 was developed against the backdrop of rapidly escalating competition in the AI industry during late 2024 and early 2025. OpenAI was preparing GPT-5, Google had released Gemini 2.0 and was working on Gemini 2.5, and Anthropic had launched Claude 3.7 Sonnet with extended thinking capabilities. Meanwhile, DeepSeek had shocked the industry with its V3 and R1 models in late 2024 and early 2025, demonstrating that competitive frontier models could be built at a fraction of the cost assumed by Western labs.
Meta's previous Llama 3 family, released in stages throughout 2024, had established the company as the leading provider of open-weight language models. Llama 3.1 405B, released in July 2024, was the largest openly available dense model and performed competitively with GPT-4 on many benchmarks. However, Llama 3 remained a text-only, dense-architecture model family, and the industry had begun moving toward multimodal and MoE designs.
Meta CEO Mark Zuckerberg reportedly set ambitious targets for Llama 4, wanting the new family to match or exceed frontier closed models while continuing Meta's open-weight strategy [3].
Llama 4 was announced with three model variants, two of which were released at launch:
| Model | Total Parameters | Active Parameters | Experts | Architecture | Max Context (Instruct) | Status at Launch |
|---|---|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | 16 | MoE (all layers) | 10M tokens | Released |
| Llama 4 Maverick | ~400B | 17B | 128 (+ 1 shared) | Alternating dense/MoE | 1M tokens | Released |
| Llama 4 Behemoth | ~2T | 288B | 16 | MoE | Not disclosed | Training (not released) |
Scout is the smaller of the two released models, designed for efficiency and deployability. With 109 billion total parameters and 16 experts, it uses a full MoE architecture where every transformer layer is a mixture-of-experts layer. Only 17 billion parameters are active per token, meaning the model's inference cost is comparable to a 17B dense model while drawing on a much larger knowledge base.
Scout's most distinctive feature is its 10-million-token context window in the Instruct variant, achieved through a combination of the iRoPE architecture and inference-time temperature scaling of attention. The base model was pretrained with a 256K token context, then extended during fine-tuning. This context length enables processing of entire codebases, lengthy legal documents, or full books within a single prompt. On the Needle-in-a-Haystack (NIAH) evaluation, Scout achieves perfect retrieval across its full 10M token context [1].
Scout is designed to fit on a single NVIDIA H100 GPU when quantized to INT4, making it accessible to a broad range of developers and organizations.
Maverick is the flagship released model, using a larger and more complex architecture. It has approximately 400 billion total parameters with 128 routed experts plus one shared expert. Unlike Scout, Maverick uses an alternating architecture where dense layers and MoE layers alternate in a 1:1 ratio; experts are applied in half of the layers, while the other half are standard dense transformer layers. Each token activates the shared expert plus exactly one of the 128 routed experts in MoE layers [4].
The Instruct variant supports a context window of up to 1 million tokens. Maverick was also co-distilled from the larger Behemoth model during training, using a novel loss function that dynamically weights the student and teacher logits. This knowledge distillation from a more capable teacher model is one of the reasons Maverick achieves performance that exceeds what might be expected from a model with only 17 billion active parameters [1].
Behemoth is the largest model in the Llama 4 family, with nearly 2 trillion total parameters, 288 billion active parameters, and 16 experts. At the time of the April 2025 announcement, Behemoth was still in training and was not released. Meta described it as a "teacher model" whose primary function is to generate high-quality synthetic data and provide the knowledge base for distilling the smaller Scout and Maverick models [1].
Meta released preliminary benchmark results showing Behemoth outperforming GPT-4.5 and Claude 3.7 Sonnet on several STEM benchmarks, including a score of 92.4 on MATH-500. However, given the later controversy around benchmark reporting, these numbers were received with some skepticism by the research community [5].
As of early 2026, Behemoth has still not been publicly released, and speculation persists that the full training run encountered difficulties or that Meta shifted priorities following the launch controversy.
Llama 4 is the first model in the Llama series to use a mixture of experts architecture. In a standard dense transformer, every parameter is involved in processing every token. In MoE, a routing network selects a small subset of specialist sub-networks (experts) for each token, dramatically reducing computational cost while maintaining a large total parameter count.
Scout uses a straightforward MoE design where all transformer layers contain expert routing with 16 experts. Maverick takes a different approach with its alternating dense/MoE design: half the layers are standard dense layers, and the other half are MoE layers with 128 routed experts plus one shared expert. The shared expert processes every token, ensuring a baseline of common knowledge is always applied, while the routed expert handles more specialized processing [4].
This architectural choice gives Maverick a sparsity ratio of 64 (128 experts with top-1 routing in half the layers), which is unusually high compared to other MoE models like Mixtral (sparsity of 4) or DeepSeek-V3 (sparsity of 32) [6].
One of Llama 4's most significant architectural innovations is its native multimodal design through early fusion. Rather than processing text and images through separate encoders and combining them at a late stage (as in models like LLaVA or previous multimodal systems), Llama 4 integrates visual information at the earliest stage of processing.
The architecture uses an enhanced MetaCLIP-based vision encoder to convert images into visual tokens. These visual tokens are then immediately concatenated with text tokens into a single unified sequence before any deep transformer processing begins. This early fusion approach means that text and image representations can interact through cross-attention from the very first layer, allowing the model to develop richer cross-modal understanding than late-fusion approaches [1].
The vision encoder processes images at high resolution and produces a variable number of visual tokens depending on the image's content and resolution. This native multimodal capability means Llama 4 can handle tasks like visual question answering, image captioning, document understanding, and chart interpretation without requiring any additional adapter modules or post-hoc integration.
Llama 4 introduces the iRoPE (interleaved Rotary Position Embedding) architecture, a modification of the standard RoPE positional encoding used in previous Llama models. The iRoPE design alternates between two types of attention layers:
By interleaving these two layer types, the model can handle extremely long sequences more effectively. The RoPE layers provide strong local context modeling, while the NoPE layers allow the model to attend to distant tokens without the degradation that typically occurs when standard positional encodings are extrapolated far beyond training lengths. This design, combined with inference-time temperature scaling of the attention logits, enables Scout's 10-million-token context window [7].
Llama 4 models were pretrained on over 30 trillion tokens, more than double the approximately 15 trillion tokens used for Llama 3. The training mixture includes diverse text data in over 200 languages, as well as image and video data for multimodal training. The base models were pretrained with a context length of 256,000 tokens [1].
The pretraining data composition was not disclosed in detail, but Meta indicated it included web text, code, scientific literature, books, and multilingual content. The image and video training data covered a wide range of visual domains to support the model's native multimodal capabilities.
A notable aspect of Llama 4's training is the use of knowledge distillation from the larger Behemoth model to the smaller Maverick and Scout models. Meta developed a novel co-distillation approach where the smaller models learn not just from the training data but also from Behemoth's output distributions. The loss function dynamically adjusts the weighting between the standard language modeling loss (from the training data) and the distillation loss (from Behemoth's predictions), allowing the student models to benefit from the teacher's broader knowledge without being overly constrained by it [1].
This distillation process is one reason why Maverick, despite having only 17 billion active parameters, can compete with much larger dense models on several benchmarks.
The Instruct variants of Scout and Maverick underwent extensive post-training, including supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The post-training process extended the context length from the 256K base pretraining context to 10M tokens for Scout and 1M tokens for Maverick. Meta also applied safety training and red-teaming to address harmful outputs [1].
Meta reported benchmark results for Scout and Maverick across a range of evaluations, comparing them primarily against other models in similar compute classes.
| Benchmark | Llama 4 Scout | Llama 3.1 8B | Gemma 3 12B | Gemini 2.0 Flash Lite |
|---|---|---|---|---|
| MMLU (0-shot, CoT) | 79.6 | 73.0 | 78.5 | 76.1 |
| GPQA Diamond | 57.2 | 32.8 | 42.4 | 50.4 |
| LiveCodeBench (10/01-02/01) | 32.8 | 13.0 | 24.4 | 27.2 |
| MMMU (0-shot, CoT) | 69.4 | N/A (text only) | 64.8 | 58.4 |
| MathVista | 70.7 | N/A (text only) | 68.0 | 61.5 |
Scout, with its 17B active parameters, consistently outperformed models in the sub-20B class across both text and multimodal benchmarks. Its GPQA Diamond score of 57.2 represented a particularly strong result for a model of its size, and its multimodal scores on MMMU (69.4) and MathVista (70.7) were competitive with much larger models [1].
| Benchmark | Llama 4 Maverick | GPT-4o | Gemini 2.0 Flash | Claude 3.7 Sonnet | Llama 3.1 405B |
|---|---|---|---|---|---|
| MMLU (0-shot, CoT) | 85.5 | 85.7 | 84.4 | 84.2 | 85.2 |
| GPQA Diamond | 69.8 | 53.6 | 64.6 | 68.0 | 49.0 |
| LiveCodeBench (10/01-02/01) | 43.4 | 32.3 | 34.5 | 38.5 | 27.7 |
| MMMU (0-shot, CoT) | 73.4 | 69.1 | 71.7 | 68.6 | N/A |
| MathVista | 73.7 | 63.8 | 73.3 | 70.4 | N/A |
| MATH-500 | 88.1 | 74.6 | 82.3 | 89.9 | 73.8 |
Maverick's results were strong across the board. On MMLU, it scored 85.5, essentially matching GPT-4o. Its GPQA Diamond score of 69.8 exceeded GPT-4o (53.6) by a wide margin. On multimodal benchmarks, Maverick scored 73.4 on MMMU and 73.7 on MathVista, outperforming GPT-4o on both. Maverick's LiveCodeBench score of 43.4 surpassed all listed competitors, including Claude 3.7 Sonnet (38.5) and GPT-4o (32.3) [1][8].
On the MATH-500 benchmark, Maverick scored 88.1, slightly below Claude 3.7 Sonnet's 89.9 but above GPT-4o (74.6) and Gemini 2.0 Flash (82.3).
Scout's 10-million-token context was evaluated on the Needle-in-a-Haystack test, where the model must retrieve a specific piece of information embedded at a random position within a very long document. Scout achieved perfect retrieval across the full 10M context [1].
On the MTOB (Machine Translation of Books) benchmark, which requires processing entire books for translation, both Scout and Maverick maintained coherence and accuracy across full-length books, while competitor models with 128K context windows could not process the complete texts [8].
| Feature | Llama 3 / 3.1 | Llama 4 |
|---|---|---|
| Architecture | Dense transformer | Mixture of Experts |
| Largest released model | 405B (dense) | Maverick (~400B total, 17B active) |
| Modalities | Text only (3.0/3.1); text+vision (3.2) | Native text + image (early fusion) |
| Training tokens | ~15 trillion | 30+ trillion |
| Max context (Instruct) | 128K tokens | 10M tokens (Scout) |
| MoE | No | Yes |
| Knowledge distillation | No | Yes (from Behemoth) |
| iRoPE | No | Yes |
| MMLU (best) | 85.2 (405B) | 85.5 (Maverick, 17B active) |
| Inference efficiency | Proportional to model size | 17B active params for all sizes |
The most striking difference is efficiency. Llama 3.1 405B required activating all 405 billion parameters per token, while Llama 4 Maverick achieves comparable MMLU performance with only 17 billion active parameters, roughly 24 times fewer. This translates directly to lower inference costs and latency. Meta stated that Maverick achieves better results than GPT-4o at approximately one-ninth the cost per token [8].
The Llama 4 launch was overshadowed by a controversy involving benchmark manipulation on the LMArena (Chatbot Arena) leaderboard, which escalated over the days and weeks following the April 5 release.
Shortly after launch, users on the LMArena platform noticed that "Llama-4-Maverick-03-26-Experimental" had appeared near the top of the Chatbot Arena leaderboard, ranking second behind only Google's Gemini 2.5 Pro. However, the publicly released version of Maverick did not perform nearly as well in users' own testing. The experimental version produced notably different outputs: verbose responses frequently peppered with emojis, a style seemingly optimized to win user preference votes in the Arena's head-to-head comparison format, rather than to be genuinely useful [2].
AI researchers and developers quickly pointed out that the model submitted to Chatbot Arena was not the same model available for download. An unverified post by someone claiming to be a former Meta employee alleged that Meta leadership had mixed benchmark test sets into the post-training process to inflate scores and meet internal targets. Although Meta denied this specific allegation, the discrepancy between the Arena submission and the public release eroded trust [9].
Independent testing found that the publicly available Maverick underperformed expectations on coding tasks and general-purpose assistance, particularly compared to its impressive paper benchmarks. The gap between reported and observed performance fueled skepticism about the benchmark numbers across the board.
Meta's VP of Generative AI, Ahmad Al-Dahle, initially denied all allegations, stating: "We've also heard claims that we trained on test sets; that's simply not true and we would never do that" [10]. Meta also published a blog post defending the release and attributing some quality issues to bugs in the model deployment.
However, in January 2026, Yann LeCun, Meta's departing chief AI scientist, confirmed the manipulation in an interview with the Financial Times. LeCun stated that "results were fudged a little bit" and that the team "used different models for different benchmarks to give better results." Rather than submitting a single consistent model for all evaluations (the standard practice), the Llama 4 team selected whichever variant of Scout or Maverick performed best on each individual benchmark [3].
The controversy had significant consequences within Meta. According to LeCun, CEO Mark Zuckerberg was "really upset and basically lost confidence in everyone who was involved" in the Llama 4 release. Zuckerberg subsequently "sidelined the entire GenAI organisation," leading to a restructuring of Meta's AI leadership. LeCun himself departed Meta after more than a decade to start a new venture called Advanced Machine Intelligence Labs, and in interviews following his departure, he criticized Meta's new AI leadership as "young and inexperienced" [3][11].
The controversy also prompted LMArena to update its leaderboard policies to reinforce commitments to fair, reproducible evaluations. The incident highlighted a broader problem in the AI industry: the lack of standardized, independently verified benchmarking procedures, which allows model developers to selectively report favorable results [12].
Llama 4 was released under the Llama 4 Community License Agreement, effective April 5, 2025. The license structure is similar to previous Llama licenses but with some notable provisions:
The Open Source Initiative (OSI) has consistently maintained that the Llama Community License does not qualify as "open source" under its definition, citing the use restrictions, the MAU threshold, and the lack of training data release. The OSI also noted that newer versions of the license exclude certain users in the European Union from using the model under specific conditions, which violates the principle of non-discrimination [14]. Meta and supporters of the Llama license argue that "open weights" is the appropriate term and that the license provides substantially more freedom than fully closed alternatives.
Llama 4 entered a fiercely competitive landscape in 2025:
| Model | Developer | Release | Architecture | Key Strengths |
|---|---|---|---|---|
| Llama 4 Maverick | Meta | April 2025 | MoE (400B total, 17B active) | Multimodal, efficient, open weights |
| GPT-4o | OpenAI | May 2024 | Dense (proprietary) | Strong all-around, voice/vision |
| GPT-4.5 | OpenAI | February 2025 | Dense (proprietary) | Improved reasoning, reduced hallucination |
| Claude 3.7 Sonnet | Anthropic | February 2025 | Dense (proprietary) | Extended thinking, strong coding |
| Gemini 2.0 Flash/Pro | December 2024 | MoE (proprietary) | Speed, multimodal, long context | |
| Gemini 2.5 Pro | March 2025 | MoE (proprietary) | Reasoning, 1M context | |
| DeepSeek-V3 | DeepSeek | December 2024 | MoE (671B total, 37B active) | Cost-efficient training, open weights |
Maverick's benchmark results placed it competitively with these models on paper, particularly in multimodal tasks and coding. However, the benchmark controversy made direct comparisons difficult to trust. In practice, independent evaluations through 2025 found Maverick to be a capable but not clearly frontier model, with particular strengths in multimodal understanding and long-context tasks, and relative weaknesses in complex multi-step reasoning compared to Claude 3.7 Sonnet and Gemini 2.5 Pro [15].
The efficiency argument remained Maverick's strongest selling point: with only 17 billion active parameters, it could be served at much lower cost than dense models of comparable quality, making it attractive for high-volume commercial deployments where cost per token is a primary concern.
As of early 2026, the Llama 4 family occupies an uncertain position in the AI landscape. Several developments have shaped its trajectory since the April 2025 launch:
The Llama 4 release will likely be remembered as a turning point for Meta's AI strategy: a technically ambitious model family whose impact was diminished by the benchmark manipulation controversy, leading to lasting changes in how AI companies report and verify model performance.