Llama 4 (Large Language Model Meta AI 4) is a family of natively multimodal large language models developed by Meta and released on April 5, 2025. It is the first generation of the Llama series to adopt a mixture of experts (MoE) architecture and the first to support text and images as native input modalities from the ground up. The initial release included two models, Llama 4 Scout and Llama 4 Maverick, with a third and much larger model, Llama 4 Behemoth, announced but still in training at the time of release. Scout features 17 billion active parameters drawn from a 109 billion total parameter pool across 16 experts, while Maverick uses 17 billion active parameters from approximately 400 billion total parameters spread across 128 experts. Llama 4 Scout supports a context window of up to 10 million tokens, the longest of any openly available model at launch [1].
The release was marred by controversy. Within days, the AI community raised concerns that Meta had used a specially tuned "experimental" version of Maverick to achieve high scores on the LMArena (Chatbot Arena) leaderboard, a version that differed from the publicly released model. The incident, later confirmed by departing Meta AI chief scientist Yann LeCun as benchmark manipulation, led to significant reputational damage and internal upheaval at Meta [2][3]. Despite this, Llama 4 became one of the most widely deployed open-weight model families of 2025, powering Meta AI inside WhatsApp, Messenger, Instagram, and Facebook, and reaching same-day availability on the major cloud platforms.
Llama 4 was developed against the backdrop of rapidly escalating competition in the AI industry during late 2024 and early 2025. OpenAI was preparing GPT-5, Google had released Gemini 2.0 and was working on Gemini 2.5, and Anthropic had launched Claude 3.7 Sonnet with extended thinking capabilities. Meanwhile, DeepSeek had shocked the industry with its DeepSeek V3 and R1 models in late 2024 and early 2025, demonstrating that competitive frontier models could be built at a fraction of the cost assumed by Western labs. DeepSeek-V3 in particular, a 671B-total / 37B-active MoE released under a permissive license, put direct pressure on Meta's open-weight strategy and accelerated the company's pivot away from dense architectures.
Meta's previous Llama 3 family, released in stages throughout 2024, had established the company as the leading provider of open-weight language models. Llama 3.1 405B, released in July 2024, was the largest openly available dense model and performed competitively with GPT-4 on many benchmarks. Llama 3.2 added vision-capable 11B and 90B variants in September 2024, along with small text-only 1B and 3B models for on-device use. Llama 3.3 70B, released in December 2024, distilled most of the 405B model's capabilities into a much smaller and cheaper-to-serve dense model. However, even with these incremental updates, Llama 3 remained a text-first, dense-architecture family, and the industry had begun moving aggressively toward multimodal and MoE designs.
Meta CEO Mark Zuckerberg reportedly set ambitious targets for Llama 4, wanting the new family to match or exceed frontier closed models while continuing Meta's open-weight strategy [3]. Internal pressure to match DeepSeek's apparent efficiency advantage and to ship a credible response to GPT-4o and Gemini 2.0 shaped both the architectural choices and, by some accounts, the rushed nature of the launch.
Llama 4 was announced with three model variants, two of which were released at launch:
| Model | Total parameters | Active parameters | Experts | Architecture | Max context (Instruct) | Status at launch |
|---|---|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | 16 | MoE (all layers) | 10M tokens | Released |
| Llama 4 Maverick | ~400B | 17B | 128 (+ 1 shared) | Alternating dense/MoE | 1M tokens | Released |
| Llama 4 Behemoth | ~2T | 288B | 16 | MoE | Not disclosed | Training (not released) |
Scout is the smaller of the two released models, designed for efficiency and deployability. With 109 billion total parameters and 16 experts, it uses a full MoE architecture where every transformer layer is a mixture-of-experts layer. Only 17 billion parameters are active per token, meaning the model's inference cost is comparable to a 17B dense model while drawing on a much larger knowledge base.
Scout's most distinctive feature is its 10-million-token context window in the Instruct variant, achieved through a combination of the iRoPE architecture and inference-time temperature scaling of the attention logits. The base model was pretrained with a 256K-token context, then extended during fine-tuning. This context length enables processing of entire codebases, lengthy legal documents, or full books within a single prompt. On the Needle-in-a-Haystack (NIAH) evaluation, Scout achieves perfect retrieval across its full 10M token context [1]. Scout is also the only Llama 4 variant that uses non-learnable RMS normalization on its query and key projections in the RoPE layers, a small but important stability trick for very long sequences [4].
Scout is designed to fit on a single NVIDIA H100 GPU when quantized to INT4, making it accessible to a broad range of developers and organizations. In bf16 it requires roughly two H100s, and Hugging Face also publishes an official FP8 checkpoint that lets a single H100 host the model in higher quality than INT4. Combined with its fully sparse MoE design, this makes Scout the most laptop-and-workstation friendly serious frontier-class model from any major lab as of mid-2025.
Maverick is the flagship released model, using a larger and more complex architecture. It has approximately 400 billion total parameters with 128 routed experts plus one shared expert. Unlike Scout, Maverick uses an alternating architecture where dense layers and MoE layers alternate in a 1:1 ratio; experts are applied in half of the layers, while the other half are standard dense transformer layers. Each token activates the shared expert plus exactly one of the 128 routed experts in MoE layers, a top-1 routing scheme that pushes Maverick's sparsity unusually high [5].
The Instruct variant supports a context window of up to 1 million tokens. Maverick was also co-distilled from the larger Behemoth model during training, using a novel loss function that dynamically weights the student and teacher logits. This knowledge distillation from a more capable teacher model is one of the reasons Maverick achieves performance that exceeds what might be expected from a model with only 17 billion active parameters [1]. Hugging Face publishes Maverick in both BF16 and FP8 formats, and the model fits on a single NVIDIA H100 DGX node (8 H100s) for production serving [4].
Behemoth is the largest model in the Llama 4 family, with nearly 2 trillion total parameters, 288 billion active parameters, and 16 experts. At the time of the April 2025 announcement, Behemoth was still in training and was not released. Meta described it as a "teacher model" whose primary function is to generate high-quality synthetic data and provide the knowledge base for distilling the smaller Scout and Maverick models [1].
Meta released preliminary benchmark results showing Behemoth outperforming GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Pro on several STEM benchmarks, including a score of 95.0 on MATH-500 and 73.7 on GPQA Diamond [1]. However, given the later controversy around benchmark reporting, these numbers were received with some skepticism by the research community.
Behemoth's release was originally targeted for early summer 2025, then pushed back to the fall, then pushed back again. By mid-2025, multiple outlets reported that Meta had encountered serious training instability with Behemoth: the team had switched the MoE routing method partway through training, which disrupted expert specialization, and the chunked-attention scheme used in Llama 4 created blind spots at chunk boundaries that hurt long-form reasoning at very large scale [6][7]. A New York Times report later said the model had finished training but was being held back due to "poor internal performance," and that after Meta announced its new "superintelligence" lab in mid-2025, teams stopped running new evaluations on Behemoth altogether. As of early 2026, Behemoth has still not been publicly released, and Meta has never issued a formal cancellation.
Llama 4 is the first model in the Llama series to use a mixture of experts architecture. In a standard dense transformer, every parameter is involved in processing every token. In MoE, a routing network selects a small subset of specialist sub-networks (experts) for each token, dramatically reducing computational cost while maintaining a large total parameter count. The technique was popularized in modern LLMs by Mistral's Mixtral 8x7B in late 2023 and pushed to frontier scale by DeepSeek-V3 in late 2024.
Scout uses a straightforward MoE design where all transformer layers contain expert routing with 16 experts. Maverick takes a different approach with its alternating dense/MoE design: half the layers are standard dense layers, and the other half are MoE layers with 128 routed experts plus one shared expert. The shared expert processes every token, ensuring a baseline of common knowledge is always applied, while the routed expert handles more specialized processing [5].
This architectural choice gives Maverick a sparsity ratio of 64 (128 experts with top-1 routing in half the layers), which is unusually high compared to other MoE models like Mixtral 8x7B (sparsity of 4) or DeepSeek-V3 (sparsity of around 32) [8]. The very high sparsity is a deliberate efficiency bet: it lets Maverick keep total capacity competitive with dense 400B models while running inference at the cost of a 17B model. The trade-off is harder load balancing during training and tighter routing margins during inference.
One of Llama 4's most significant architectural innovations is its native multimodal design through early fusion. Rather than processing text and images through separate encoders and combining them at a late stage (as in models like LLaVA or previous multimodal systems), Llama 4 integrates visual information at the earliest stage of processing.
The architecture uses an enhanced MetaCLIP-based vision encoder to convert images into visual tokens. Meta retrained the encoder "in conjunction with a frozen Llama model" so that the visual tokens it produces are already aligned with the language model's representational space [1]. These visual tokens are then immediately concatenated with text tokens into a single unified sequence before any deep transformer processing begins. This early-fusion approach means that text and image representations can interact through cross-attention from the very first layer, allowing the model to develop richer cross-modal understanding than late-fusion approaches.
The vision encoder processes images at high resolution and produces a variable number of visual tokens depending on the image's content and resolution. The instruction-tuned variants of Scout and Maverick officially support up to 8 input images per prompt, with Meta reporting that internal testing held up well to roughly that limit. This native multimodal capability means Llama 4 can handle tasks like visual question answering, image captioning, document understanding, and chart interpretation without requiring any additional adapter modules or post-hoc integration. It also means the model can interleave images and text freely in a single conversation, which is critical for the assistant use cases that Meta AI runs in WhatsApp and Instagram.
Llama 4 introduces the iRoPE (interleaved Rotary Position Embedding) architecture, a modification of the standard RoPE positional encoding used in previous Llama models. The iRoPE design alternates between two types of attention layers:
By interleaving these two layer types, the model can handle extremely long sequences more effectively. The chunked RoPE layers provide strong local context modeling at modest memory cost, while the NoPE layers allow the model to attend to distant tokens without the degradation that typically occurs when standard positional encodings are extrapolated far beyond training lengths. To stop attention probabilities from collapsing toward uniformity at very long context, Meta also applies inference-time temperature scaling of the softmax in the NoPE layers, which is the trick that pushes Scout from a 256K base context to the advertised 10M tokens [4][9].
Llama 4 models were pretrained on more than 30 trillion tokens, more than double the approximately 15 trillion tokens used for Llama 3, with some Hugging Face documentation citing a figure of up to 40 trillion tokens including multimodal data [4]. The training mixture covers more than 200 languages, including more than 100 with at least 1 billion tokens each, giving Llama 4 roughly 10 times more multilingual data than Llama 3 [1]. Image and video data were included in the pretraining mixture from the start, which is what makes the multimodal capabilities native rather than bolted on. The base models were pretrained with a context length of 256,000 tokens.
The pretraining data composition was not disclosed in detail, but Meta indicated it included web text, code, scientific literature, books, and multilingual content, plus images and short video clips. Pretraining was done in FP8 precision; Meta reported sustaining roughly 390 TFLOPs per GPU during the run, a number that is competitive with the best published numbers for FP8 training on H100s.
For a model family with three very different sizes, picking learning rates and other per-layer hyperparameters separately for each variant would have been prohibitively expensive. Meta instead developed a technique called MetaP that lets the team "reliably set critical model hyperparameters such as per-layer learning rates and initialization scales" once and have them transfer across model sizes and training budgets [1]. The approach is conceptually similar to muP and related width-transfer techniques used by other labs and is part of why Meta was able to ship two production-grade MoE variants with the same active parameter count but very different total capacity.
A notable aspect of Llama 4's training is the use of knowledge distillation from the larger Behemoth model to the smaller Maverick and Scout models. Meta developed a novel co-distillation approach where the smaller models learn not just from the training data but also from Behemoth's output distributions. The loss function dynamically adjusts the weighting between the standard language modeling loss (from the training data) and the distillation loss (from Behemoth's predictions), allowing the student models to benefit from the teacher's broader knowledge without being overly constrained by it [1].
This distillation process is one reason why Maverick, despite having only 17 billion active parameters, can compete with much larger dense models on several benchmarks. It also makes Behemoth's continued absence from the public release more notable: the teacher model that shaped Maverick's behavior is something the public never gets to interact with directly.
The Instruct variants of Scout and Maverick underwent extensive post-training. Meta described its post-training pipeline as "lightweight supervised fine-tuning (SFT), online reinforcement learning (RL), and lightweight direct preference optimization (DPO)" [1]. Crucially, Meta also reported aggressive data filtering before SFT, removing more than 50 percent of the SFT data tagged as "easy" for Maverick and more than 95 percent for Behemoth. The intuition is that easy examples drag the post-trained model toward shallow patterns, so concentrating on harder examples during SFT preserves the harder reasoning signal coming from RL.
During the online RL phase, Meta alternated between training the model and using it to filter and re-rank candidate prompts, a feedback loop that progressively raises the difficulty of the training distribution. The post-training process also extended the context length from the 256K base pretraining context to 10M tokens for Scout and 1M tokens for Maverick. Meta then layered safety training and red-teaming on top to address harmful outputs.
Alongside the model weights themselves, Meta released a refreshed safety stack designed to be deployed in front of and behind any Llama 4 system in production:
| Tool | Purpose |
|---|---|
| Llama Guard | Input and output safety classifier for filtering harmful prompts and completions |
| Prompt Guard | Lightweight detector for jailbreaks and prompt injection attempts |
| CyberSecEval | Benchmark suite for measuring a model's offensive cybersecurity capability and refusal behavior |
| GOAT (Generative Offensive Agent Testing) | Multi-turn adversarial agent that simulates a persistent attacker and stress-tests model defenses |
Meta also reported deliberate work on "refusal balance." According to the official launch blog, Llama 4 Scout refuses on debated political and social topics roughly 2 percent of the time, down from about 7 percent in Llama 3.3, and the imbalance between refusals on left-coded versus right-coded prompts dropped to under 1 percent [1]. Meta described the resulting political-lean rate as comparable to xAI's Grok and roughly half that of Llama 3.3, framing the change as an explicit design goal of "being more helpful on contested topics" rather than an accidental side effect.
Meta reported benchmark results for Scout and Maverick across a range of evaluations, comparing them primarily against other models in similar compute classes. All Llama 4 numbers below were reported by Meta as 0-shot, temperature 0, with no majority voting or parallel test-time compute [1].
| Benchmark | Llama 4 Scout | Llama 3.1 8B | Gemma 3 12B | Gemini 2.0 Flash Lite |
|---|---|---|---|---|
| MMLU (0-shot, CoT) | 79.6 | 73.0 | 78.5 | 76.1 |
| MMLU Pro | 74.3 | 48.3 | 60.6 | 65.1 |
| GPQA Diamond | 57.2 | 32.8 | 42.4 | 50.4 |
| LiveCodeBench (10/01-02/01) | 32.8 | 13.0 | 24.4 | 27.2 |
| MMMU (0-shot, CoT) | 69.4 | N/A (text only) | 64.8 | 58.4 |
| MathVista | 70.7 | N/A (text only) | 68.0 | 61.5 |
| ChartQA | 83.4 | N/A (text only) | 74.7 | 72.2 |
| DocVQA | 89.4 | N/A (text only) | 87.1 | 84.0 |
Scout, with its 17B active parameters, consistently outperformed models in the sub-20B class across both text and multimodal benchmarks. Its GPQA Diamond score of 57.2 represented a particularly strong result for a model of its size, and its multimodal scores on MMMU (69.4) and MathVista (70.7) were competitive with much larger models [1].
| Benchmark | Llama 4 Maverick | GPT-4o | Gemini 2.0 Flash | Claude 3.7 Sonnet | Llama 3.1 405B |
|---|---|---|---|---|---|
| MMLU (0-shot, CoT) | 85.5 | 85.7 | 84.4 | 84.2 | 85.2 |
| MMLU Pro | 80.5 | 74.7 | 77.6 | 78.0 | 73.4 |
| GPQA Diamond | 69.8 | 53.6 | 64.6 | 68.0 | 49.0 |
| LiveCodeBench (10/01-02/01) | 43.4 | 32.3 | 34.5 | 38.5 | 27.7 |
| MMMU (0-shot, CoT) | 73.4 | 69.1 | 71.7 | 68.6 | N/A |
| MathVista | 73.7 | 63.8 | 73.3 | 70.4 | N/A |
| MATH-500 | 88.1 | 74.6 | 82.3 | 89.9 | 73.8 |
| ChartQA | 90.0 | 85.7 | N/A | 81.5 | N/A |
| DocVQA | 94.4 | 92.8 | N/A | 95.2 | N/A |
Maverick's results were strong across the board. On MMLU, it scored 85.5, essentially matching GPT-4o. Its GPQA Diamond score of 69.8 exceeded GPT-4o (53.6) by a wide margin. On multimodal benchmarks, Maverick scored 73.4 on MMMU and 73.7 on MathVista, outperforming GPT-4o on both. Its LiveCodeBench score of 43.4 surpassed all listed competitors, including Claude 3.7 Sonnet (38.5) and GPT-4o (32.3) [1][10]. On the MATH-500 benchmark, Maverick scored 88.1, slightly below Claude 3.7 Sonnet's 89.9 but above GPT-4o (74.6) and Gemini 2.0 Flash (82.3).
Meta also published preview numbers for Behemoth, drawn from an internal training checkpoint rather than a finalized release:
| Benchmark | Llama 4 Behemoth (preview) | GPT-4.5 | Claude 3.7 Sonnet | Gemini 2.0 Pro |
|---|---|---|---|---|
| MATH-500 | 95.0 | 92.4 | 92.0 | 91.8 |
| GPQA Diamond | 73.7 | 71.4 | 68.0 | 64.7 |
| MMLU Pro | 82.2 | N/A | N/A | N/A |
These numbers were widely cited at launch but were never independently reproduced because Behemoth was never released to the public. The combination of the LMArena controversy, the postponement of Behemoth's release, and reports of training instability later in 2025 has left these scores as one of the more disputed datasets in the Llama 4 story.
Scout's 10-million-token context was evaluated on the Needle-in-a-Haystack test, where the model must retrieve a specific piece of information embedded at a random position within a very long document. Scout achieved perfect retrieval across the full 10M context in Meta's reported runs [1].
Independent evaluations were less generous. On harder long-context benchmarks like Fiction.LiveBench and several reasoning-style retrieval tasks, both Scout and Maverick degraded sharply once the input exceeded a few hundred thousand tokens, even though they could technically ingest much more [11]. The chunked-attention scheme that lets iRoPE scale to 10M tokens cheaply also makes it harder for the model to follow a chain of reasoning that crosses chunk boundaries, which is exactly the kind of task long-context users care about. NIAH-style retrieval is now widely viewed as a weak proxy for genuine long-context reasoning, and Llama 4 became one of the canonical examples of why.
On the MTOB (Machine Translation of Books) benchmark, which requires processing entire books for translation, both Scout and Maverick maintained coherence and accuracy across full-length books, while competitor models with 128K context windows could not process the complete texts [10].
| Feature | Llama 3 / 3.1 / 3.3 | Llama 4 |
|---|---|---|
| Architecture | Dense transformer | Mixture of experts |
| Largest released model | 405B (dense) | Maverick (~400B total, 17B active) |
| Modalities | Text only (3.0/3.1/3.3); text + vision (3.2) | Native text + image (early fusion) |
| Training tokens | ~15 trillion | 30+ trillion (up to ~40T including multimodal) |
| Languages | ~30 with meaningful coverage | 200, with 100+ above 1B tokens |
| Max context (Instruct) | 128K tokens | 10M tokens (Scout), 1M (Maverick) |
| Mixture of experts | No | Yes |
| Knowledge distillation | No (dense, all sizes trained from scratch) | Yes (from Behemoth) |
| iRoPE | No | Yes |
| MMLU (best) | 85.2 (405B) | 85.5 (Maverick, 17B active) |
| Inference efficiency | Proportional to model size | 17B active params for both released sizes |
The most striking difference is efficiency. Llama 3.1 405B required activating all 405 billion parameters per token, while Llama 4 Maverick achieves comparable MMLU performance with only 17 billion active parameters, roughly 24 times fewer. This translates directly to lower inference costs and latency. Meta stated that Maverick achieves better results than GPT-4o at approximately one-ninth the cost per token [10]. The other big jump is multilingual coverage and native multimodality. Llama 3.2's vision models bolted a separate image adapter onto a frozen text model; Llama 4 trains text and vision tokens together from the very first batch of pretraining, which closes a long-standing gap with GPT-4o and Gemini.
The Llama 4 launch was overshadowed by a controversy involving benchmark manipulation on the LMArena (Chatbot Arena) leaderboard, which escalated over the days and weeks following the April 5 release.
Shortly after launch, users on the LMArena platform noticed that "Llama-4-Maverick-03-26-Experimental" had appeared near the top of the Chatbot Arena leaderboard, ranking second behind only Google's Gemini 2.5 Pro, with an Elo score of 1417. However, the publicly released version of Maverick did not perform nearly as well in users' own testing. The experimental version produced notably different outputs: verbose responses frequently peppered with emojis, a style seemingly optimized to win user preference votes in the Arena's head-to-head comparison format rather than to be genuinely useful [2][12].
AI researchers and developers quickly pointed out that the model submitted to Chatbot Arena was not the same model available for download. An unverified post by someone claiming to be a former Meta employee alleged that Meta leadership had mixed benchmark test sets into the post-training process to inflate scores and meet internal targets. Although Meta denied this specific allegation, the discrepancy between the Arena submission and the public release eroded trust quickly [13].
Independent testing found that the publicly available Maverick underperformed expectations on coding tasks and general-purpose assistance, particularly compared to its impressive paper benchmarks. The gap between reported and observed performance fueled skepticism about the benchmark numbers across the board.
Two days after the release, LMArena posted a public statement on X clarifying the situation. The platform said "Meta's interpretation of our policy did not match what we expect from model providers" and that "Meta should have made it clearer that 'Llama-4-Maverick-03-26-Experimental' was a customized model to optimize for human preference." LMArena said it was updating its leaderboard policies to "reinforce our commitment to fair, reproducible evaluations so this confusion doesn't occur in the future" [12]. Within days, the platform also made the unmodified release version available for community votes for an apples-to-apples comparison.
Meta's spokesperson responded that "we experiment with all types of custom variants" and that the experimental Maverick "is a chat optimized version we experimented with that also performs well on LM Arena." The framing was that submitting a custom variant was within the spirit of the leaderboard, even if the disclosure had been weak.
Meta's VP of Generative AI, Ahmad Al-Dahle, initially denied the harder allegations, stating: "We've also heard claims that we trained on test sets; that's simply not true and we would never do that" [14]. Meta also published a blog post defending the release and attributing some quality issues to bugs in the model deployment.
In January 2026, however, Yann LeCun, Meta's departing chief AI scientist, confirmed the manipulation in an interview with the Financial Times. LeCun stated that "results were fudged a little bit" and that the team "used different models for different benchmarks to give better results." Rather than submitting a single consistent model for all evaluations, which is the standard practice, the Llama 4 team selected whichever variant of Scout or Maverick performed best on each individual benchmark [3].
When the unmodified release version, labeled "Llama-4-Maverick-17B-128E-Instruct," was added to LMArena on April 11, 2025, it ranked 32nd. It came in below months-old models like GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro [12][15]. The 30-place gap between the experimental and the released version became one of the most discussed data points in the open-weight model community for the rest of 2025 and was the strongest single piece of evidence that the launch numbers had been overstated.
The controversy had significant consequences within Meta. According to LeCun, CEO Mark Zuckerberg was "really upset and basically lost confidence in everyone who was involved" in the Llama 4 release. Zuckerberg subsequently "sidelined the entire GenAI organisation," leading to a restructuring of Meta's AI leadership. LeCun himself departed Meta after more than a decade to start a new venture called Advanced Machine Intelligence Labs, and in interviews following his departure, he criticized Meta's new AI leadership as "young and inexperienced" [3][16].
Meta also brought in Alexandr Wang, the former CEO of Scale AI, as part of a broader leadership reshuffle for its AI efforts and the new "superintelligence" research lab announced in mid-2025. The Behemoth team was reportedly folded into that lab and stopped publishing new evaluations.
The controversy also prompted LMArena to update its leaderboard policies more broadly. The incident highlighted a real problem in the AI industry: the lack of standardized, independently verified benchmarking procedures, which allows model developers to selectively report favorable results [17]. A separate study published later in April 2025 accused LMArena of giving large labs preferential testing privileges, which intensified the wider discussion about how to evaluate frontier models fairly.
Llama 4 was released under the Llama 4 Community License Agreement, effective April 5, 2025. The license structure is similar to previous Llama licenses but with some notable provisions:
The Open Source Initiative (OSI) has consistently maintained that the Llama Community License does not qualify as "open source" under its definition, citing the use restrictions, the MAU threshold, and the lack of training data release. The OSI also noted that the Llama 4 license restricts use "with respect to any multimodal models" by individuals domiciled or with their principal place of business in the European Union, which the OSI argues violates the principle of non-discrimination [19]. Meta and supporters of the Llama license argue that "open weights" is the appropriate term and that the license provides substantially more freedom than fully closed alternatives.
Meta made Llama 4 Scout and Maverick available for download from llama.com and from the official Meta organization on Hugging Face on day one. The Hugging Face release shipped four model cards covering the base and Instruct variants of Scout and Maverick, in BF16 and FP8 formats, with day-one support in transformers v4.51.0 and Text Generation Inference (TGI) [4]. Hugging Face also rolled out its Xet storage backend for the Llama 4 weights, which deduplicated roughly 25 percent of the upload and around 40 percent of derivative fine-tunes hosted in the same repos.
Llama 4 reached unusual same-day cloud availability. AWS, Microsoft, and Google all announced support on launch weekend (Llama 4 dropped on a Saturday) [20]. Specific deployments included:
| Platform | Llama 4 availability |
|---|---|
| AWS | Llama 4 Scout and Maverick on Amazon SageMaker JumpStart at launch; fully managed serverless on Amazon Bedrock by late April 2025 |
| Microsoft Azure | Available through Azure AI Foundry and Azure Databricks |
| Google Cloud | Same-day support via Vertex AI Model Garden |
| Databricks | Maverick available across AWS, Azure, and GCP through the Mosaic AI Model Serving foundation-model catalog |
| IBM watsonx | Llama 4 Scout and Maverick added shortly after launch alongside earlier Llama models |
| Snowflake Cortex AI | Llama 4 Maverick offered as a hosted foundation model in Cortex AI |
| Cerebras and Groq | Specialized inference providers offering Maverick at higher tokens-per-second than GPU-based hosts |
The coordinated rollout was striking: the same set of clouds that had taken weeks or months to add earlier Llama generations were ready to host Llama 4 within hours, reflecting both the maturity of Meta's launch playbook and the commercial demand for an open-weight, multimodal alternative to GPT-4o.
Llama 4 also became the model behind Meta AI, the assistant feature that runs across Meta's consumer apps. At launch, Meta AI in WhatsApp, Messenger, Instagram Direct, and on the meta.ai website was upgraded to Llama 4 in 40 countries, with text capabilities available in 13 languages. The models powered features like in-conversation drafting, image generation prompts, and the visual question-answering experiences in the Meta AI mobile app released in late April 2025 [21]. For most consumers, this is the way they encountered Llama 4 in practice; the cloud and Hugging Face availability matter for developers, but the WhatsApp deployment was the first time hundreds of millions of people interacted with a frontier MoE LLM in a chat thread without realizing it.
Reception of Llama 4 was mixed. The technical scope of the release was respected: this was the first frontier-scale, natively multimodal, MoE-based open-weight model family from a major Western lab, and its same-day availability on the major clouds set a new bar for launch logistics. Commentary in the open-weight community, however, kept circling back to two issues.
First, the LMArena controversy permanently colored how the launch numbers were read. Within a week, most independent reviewers had stopped quoting Meta's reported benchmarks without caveats and were comparing the unmodified release against rivals using their own evaluation harnesses. The Decoder summarized the consensus around two months after launch: Llama 4 "shows promise on standard tests but struggles with long-context tasks" [11].
Second, the headline 10-million-token context turned out to be more of a marketing artifact than a usable feature for most workloads. NIAH retrieval is a weak proxy for real long-context behavior, and on harder retrieval-and-reasoning benchmarks Llama 4's quality dropped well before users hit the advertised limit. By contrast, Maverick was generally judged a competent but unremarkable model on standard chat and coding benchmarks, with particular strengths in vision tasks (DocVQA 94.4, ChartQA 90.0) and multilingual coverage, and notable weaknesses in chain-of-thought reasoning relative to Claude 3.7 Sonnet, Gemini 2.5 Pro, and DeepSeek-V3.
Despite this lukewarm critical reception, adoption was real. Scout in particular found a niche as the cheapest credible multimodal frontier model: it ran on a single H100, exposed a reasonable Hugging Face surface, and worked acceptably for document understanding and chart reading at production cost. Maverick was widely hosted by inference vendors and saw heavy use as a backbone for fine-tuned variants from third-party shops looking for a permissively (if not openly) licensed base model with a real vision encoder.
Several limitations have shaped the practical use of Llama 4 since launch:
In the months after launch, Llama 4 settled into a relatively clear set of use cases:
Llama 4 entered a fiercely competitive landscape in 2025:
| Model | Developer | Release | Architecture | Key strengths |
|---|---|---|---|---|
| Llama 4 Maverick | Meta | April 2025 | MoE (400B total, 17B active) | Multimodal, efficient, open weights |
| GPT-4o | OpenAI | May 2024 | Dense (proprietary) | Strong all-around, voice / vision |
| GPT-4.5 | OpenAI | February 2025 | Dense (proprietary) | Improved reasoning, reduced hallucination |
| Claude 3.7 Sonnet | Anthropic | February 2025 | Dense (proprietary) | Extended thinking, strong coding |
| Gemini 2.0 Flash / Pro | December 2024 | MoE (proprietary) | Speed, multimodal, long context | |
| Gemini 2.5 Pro | March 2025 | MoE (proprietary) | Reasoning, 1M context | |
| DeepSeek-V3 | DeepSeek | December 2024 | MoE (671B total, 37B active) | Cost-efficient training, open weights |
| Mistral Large 2 | Mistral | July 2024 (refreshed 2025) | Dense (proprietary) | European compliance focus, multilingual |
Maverick's benchmark results placed it competitively with these models on paper, particularly in multimodal tasks and coding. However, the benchmark controversy made direct comparisons difficult to trust. In practice, independent evaluations through 2025 found Maverick to be a capable but not clearly frontier model, with particular strengths in multimodal understanding and long-context tasks (within reason), and relative weaknesses in complex multi-step reasoning compared to Claude 3.7 Sonnet and Gemini 2.5 Pro [22].
The efficiency argument remained Maverick's strongest selling point: with only 17 billion active parameters, it could be served at much lower cost than dense models of comparable quality, making it attractive for high-volume commercial deployments where cost per token is a primary concern. Among open-weight models, the most direct competitor was DeepSeek-V3, whose larger active parameter count (37B vs Maverick's 17B) gave it a quality edge on hard reasoning, while Maverick won on inference cost and on multimodal capability.
As of early 2026, the Llama 4 family occupies an uncertain position in the AI landscape. Several developments have shaped its trajectory since the April 2025 launch:
The Llama 4 release will likely be remembered as a turning point for Meta's AI strategy: a technically ambitious model family whose impact was diminished by the benchmark manipulation controversy, leading to lasting changes in how AI companies report and verify model performance, and triggering one of the largest internal AI reorganizations at any major lab in years.