Llama 4 Scout and Llama 4 Maverick are open-weight multimodal AI large language models developed by Meta and released on April 5, 2025. They are the first two publicly available members of the Llama 4 family, which also includes Llama 4 Behemoth, a much larger model still in training at the time of the announcement. Scout and Maverick are the first open-weight models to combine a Mixture of Experts (MoE) architecture with native multimodal capabilities through early fusion, supporting both text and image inputs from the ground up.
Scout is a 109-billion total parameter model with 16 experts and 17 billion active parameters per token, offering a 10-million token context window. Maverick has 400 billion total parameters, 128 experts, and the same 17 billion active parameter budget, with a 1-million token context window. Both models were trained jointly on text, image, and video data and distilled in part from Llama 4 Behemoth, a nearly two-trillion parameter teacher model.
The release was accompanied by controversy after it emerged that the version Meta submitted to the LMArena chatbot leaderboard was a specially optimized experimental variant not available to the public, which performed significantly better on that benchmark than the publicly released model.
Meta has released successive generations of open-weight language models under the LLaMA and Llama branding since 2023. Llama 3.3 and the broader Llama 3 generation introduced strong text-only performance, but those models relied on separate vision adapters added after pretraining rather than native multimodal training.
For Llama 4, Meta pursued a different approach. The company trained a family of models from scratch with multimodal data included in pretraining, so that text and image understanding are integrated at the architectural level rather than grafted on afterward. Meta also adopted a Mixture of Experts architecture for the first time in the Llama series, following similar moves by other labs including Google DeepMind with Gemini and Mistral AI with Mixtral.
The Llama 4 family was announced under the tagline "the beginning of a new era of natively multimodal AI innovation." At launch, Scout and Maverick weights were made available for download from llama.com (subject to license acceptance), and the models were immediately integrated into Meta's own Meta.ai assistant product as the underlying backbone.
Llama 4 Scout is a sparse MoE model with 16 experts and 17 billion active parameters per forward pass. The full parameter count across all experts is approximately 109 billion. During inference, only the active 17 billion parameters are used for any given token, making the per-token compute comparable to a dense 17B model despite the much larger total capacity.
Scout was pretrained on approximately 40 trillion tokens of multimodal data, more than Maverick's 22 trillion tokens, which Meta attributes to Scout's training being more data-efficient. The pretraining context length was 256,000 tokens; the instruct-tuned version extends this to 10 million tokens through a mid-training phase on long-context data.
With Int4 quantization, Scout fits on a single NVIDIA H100 GPU, making it the more accessible of the two released models for self-hosted deployments.
The 10-million token context window is one of Scout's most distinctive features. At release, it was the longest context window of any publicly available open-weight model. To put the scale in perspective, 10 million tokens can accommodate roughly 7,500 pages of text, around 20 hours of transcribed speech, or large multi-file codebases in a single prompt.
Meta reports near-perfect retrieval accuracy on Needle-in-a-Haystack evaluations up to the full 10-million token limit. In the multilingual translation benchmark MTOB (half-book), Scout scored 42.2 on English-to-Kinyarwanda and 36.6 on Kinyarwanda-to-English tasks.
The extended context is enabled by the iRoPE architecture (described in its own section below), which combines layers without positional encoding, chunked attention on most layers, and inference-time temperature scaling.
On standard academic benchmarks, Scout scores 79.6 on MMLU, 74.3 on MMLU Pro, 57.2 on GPQA Diamond, 50.3 on MATH, 32.8 on LiveCodeBench, 88.8 on ChartQA, and 94.4 on DocVQA. On the multilingual math benchmark MGSM, Scout achieves 90.6.
Meta positions Scout as outperforming Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a range of benchmarks at comparable or lower inference cost.
Meta designed Scout for applications that require processing very long documents or large codebases in a single pass. Specific scenarios include enterprise document retrieval (searching across thousands of files to answer a query), multi-document summarization, reasoning over lengthy technical manuals, and analyzing large software repositories. The single-GPU inference footprint also makes Scout well-suited for organizations that want to run a capable open model without multi-GPU server infrastructure.
Llama 4 Maverick has 128 experts and 17 billion active parameters per token, with approximately 400 billion total parameters across all experts. The instruct-tuned version supports a 1-million token context window. Maverick's architecture alternates between dense layers and MoE layers, unlike Scout which uses MoE throughout.
Maverick was pretrained on approximately 22 trillion tokens. It was also codistilled from Llama 4 Behemoth (see the distillation section below), which Meta credits as a significant driver of quality gains over what pretraining alone would have produced.
Maverick requires a single NVIDIA H100 DGX host (8 GPUs) for inference in BF16 precision. FP8 weights are also available for reduced memory footprint.
Maverick scores 85.5 on MMLU, 80.5 on MMLU Pro, 69.8 on GPQA Diamond, 61.2 on MATH, 43.4 on LiveCodeBench, 90.0 on ChartQA, and 94.4 on DocVQA. On MGSM, Maverick scores 92.3. On Multilingual MMLU, it scores 84.6.
On the MMMU image reasoning benchmark, Maverick scores 73.4 and Scout scores 69.4. On MathVista, Maverick scores 73.7 and Scout scores 70.7.
Compared to other models at similar active parameter counts, Maverick's GPQA Diamond score of 69.8 is notably higher than GPT-4o's reported score of 53.6 on the same benchmark, though direct comparisons across benchmark versions and evaluation conditions require caution. Meta also claims Maverick matches or surpasses DeepSeek V3 on reasoning and coding benchmarks while using less than half the active parameters.
At launch, Meta highlighted that a version of Maverick achieved an ELO score above 1,400 on the LMArena chatbot leaderboard, placing it second overall, just behind Google's Gemini 2.5 Pro. This became controversial (see the LMArena controversy section below).
Maverick targets applications requiring strong multimodal reasoning alongside text generation quality. Described use cases include customer support systems that process user-uploaded images, multilingual assistants (Llama 4 models support 200 languages, with 100 or more having over one billion training tokens each), enterprise question-answering over rich media, and creative generation tasks where vision and language understanding combine. Meta describes Maverick as offering the best performance-to-cost ratio in its class among models available at the time of release.
Llama 4 Behemoth is the largest model in the Llama 4 family, with approximately 288 billion active parameters and 16 experts across a total of nearly two trillion parameters. At the time of the April 2025 announcement, Behemoth was still in training; Meta disclosed it as a preview to describe the broader Llama 4 research direction and its role as a teacher model.
Meta states that Behemoth, even in its incomplete training state, outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks including MATH-500 and GPQA Diamond. The full model is intended for agentic and reasoning tasks that require the largest available capacity.
Behemoth's primary role in the Llama 4 release is as the teacher in the codistillation process used to train Maverick (and to a lesser extent Scout). Unlike Scout and Maverick, Behemoth weights are not publicly available, and Meta has not committed to an open-weight release.
Behemoth was pretrained on more than 30 trillion tokens using FP8 precision across 32,000 GPUs, achieving 390 teraFLOPS per GPU during training. Meta used the MetaP method to set hyperparameters, which allowed stable transfer of learning rate and batch size schedules across model scales.
All Llama 4 models use what Meta calls the iRoPE architecture for positional encoding and long-context support. The name combines "i" (for interleaved or, by Meta's framing, a nod toward infinite context) with "RoPE" (Rotary Position Embeddings), the positional encoding scheme used in Llama 3 and many other recent transformer models.
iRoPE works by interleaving two different types of attention layers throughout the model:
The majority of layers (three out of every four) use standard RoPE positional encodings with chunked attention. In chunked attention, the input sequence is divided into fixed chunks (8,192 tokens per chunk for Llama 4), and each token only attends within its chunk plus to any global context tokens. This keeps the computational cost manageable even for very long inputs.
Every fourth layer is a NoPE layer (No Positional Encoding). NoPE layers apply full causal attention across the entire input sequence without any positional information encoded. Because these layers have no position bias, they can generalize to sequence lengths far beyond those seen in training, acting as global context aggregators that can relate tokens from the beginning and end of a very long document.
At inference time, Meta applies temperature scaling on the attention scores in the NoPE layers. This prevents attention entropy collapse in long sequences, where without scaling, attention scores tend to become uniform and lose discriminative power as the sequence grows longer. Temperature scaling was applied post-training without requiring retraining.
Scout additionally uses RMS normalization of query and key vectors in the RoPE layers (QK normalization), which provides further training stability.
This combination of chunked RoPE attention, NoPE full-attention layers, and temperature scaling is what allows Scout to generalize to 10 million tokens from a pretraining context of 256,000 tokens.
Previous Llama models were text-only at the base and added visual understanding through separate adapter modules trained after the language backbone was complete. This late-fusion approach, also called post-hoc multimodality, typically limits how deeply visual and textual representations are integrated.
Llama 4 uses early fusion instead. Text tokens and image tokens are fed into the same transformer backbone from the beginning, processed jointly through all layers. This means the model develops shared representations for visual and linguistic content rather than treating them as separate modalities that communicate through a narrow adapter interface.
The vision input is processed by a MetaCLIP-based vision encoder. Meta adapted the encoder specifically to work with the Llama backbone, training the encoder in a frozen-Llama configuration to align visual representations with the language model's internal representation space. The encoder converts images into a sequence of visual tokens that are interleaved with text tokens before entering the main transformer.
Joint pretraining on text, images, and video data allows the model to learn natural correspondences between modalities at scale. Meta trained on diverse image-text pairs, documents with embedded figures, and video with associated transcripts. Supporting up to multiple images in a single prompt, Llama 4 models can also anchor responses to specific regions within an image, a capability enabled by the unified token-level processing in early fusion.
Native multimodality is one of the main reasons Meta cites for the strong performance on document understanding benchmarks like DocVQA (94.4 for both Scout and Maverick) and chart comprehension benchmarks like ChartQA (88.8 for Scout, 90.0 for Maverick).
After pretraining, Scout and Maverick went through a multi-stage post-training process. The pipeline consisted of lightweight supervised fine-tuning (SFT), followed by online reinforcement learning (RL), and then a final lightweight direct preference optimization (DPO) phase.
For Maverick, Meta removed more than 50 percent of data tagged as easy from the SFT stage, concentrating the fine-tuning on harder examples where the model needed more guidance. The online RL phase used adaptive difficulty filtering: prompts where the model had no meaningful gradient signal (because it already answered correctly with high confidence, yielding zero advantage estimates) were dropped from training batches. This continuous filtering kept the RL training focused on the edge of the model's current ability.
For Behemoth, the post-training process was even more aggressive. Meta pruned 95 percent of SFT data and relied heavily on large-scale RL with hard prompt sampling. The goal for Behemoth was to push STEM reasoning performance as far as possible without the distillation constraints that apply to the smaller models.
Bias and over-refusal were also targets of the post-training work. Meta reduced political and social topic refusal rates from around 7 percent in Llama 3.3 down to below 2 percent in Llama 4. Unequal response rates across demographic groups were reduced to below 1 percent.
Llama 4 Maverick was trained using a process Meta calls codistillation from Llama 4 Behemoth. Standard knowledge distillation trains a smaller student model to mimic the output distributions of a larger teacher model, typically after the teacher is fully trained. Codistillation runs the student and teacher training processes concurrently, which reduces cost by amortizing the teacher's forward passes over the student's full training run.
For the majority of training tokens, Behemoth's output distribution (the soft targets) was already available from the teacher's own training run, so the student could use those targets without requiring additional teacher inference. For any new data added specifically to the student's training mixture, Meta ran Behemoth forward passes to generate fresh distillation targets.
Meta developed a novel dynamic distillation loss function that adjusts the weighting between soft targets (Behemoth's output distribution) and hard targets (ground truth labels) over the course of training. Early in training, when the student model is far from convergence, hard targets provide stronger gradient signal. As training progresses and the student stabilizes, soft targets become more informative because they encode the teacher's nuanced uncertainty. The dynamic weighting captures this shift automatically.
Meta reports that codistillation produced substantial quality improvements in Maverick compared to training without a teacher, particularly on reasoning and knowledge-intensive tasks.
Llama 4 Scout and Maverick are released under the Llama 4 Community License Agreement. The license permits commercial use for most organizations, with two significant restrictions:
Organizations with more than 700 million monthly active users as of April 2025 must obtain a separate license from Meta, granted at Meta's sole discretion. This clause is aimed at hyperscale technology platforms that could use Llama to build competing products at scale. The 700 million MAU threshold applies to the entire legal entity, including subsidiaries and affiliates.
The multimodal capabilities (vision) of Llama 4 models cannot be used by organizations or individuals domiciled in the European Union, or by companies whose principal place of business is in the EU. The text-only paths are not restricted. Meta did not publicly explain the specific regulatory reason for this restriction, but it coincides with obligations under the EU AI Act and related data protection frameworks. This restriction applies to the licensee, not to end users of a product built with the models.
Derivative models and products must include "Built with Llama" attribution. The license does not permit using Llama 4 outputs to train other foundational AI models that compete with Meta's products.
The table below shows performance on selected benchmarks for Llama 4 Scout and Llama 4 Maverick, compared to GPT-4o, DeepSeek V3, and Gemini 2.0 Flash.
| Benchmark | Category | Scout | Maverick | GPT-4o | DeepSeek V3 | Gemini 2.0 Flash |
|---|---|---|---|---|---|---|
| MMLU | Knowledge | 79.6 | 85.5 | 85.7 | 88.5 | 85.2 |
| MMLU Pro | Knowledge | 74.3 | 80.5 | 72.6 | 75.9 | 77.6 |
| GPQA Diamond | Science reasoning | 57.2 | 69.8 | 53.6 | 59.4 | 60.1 |
| MATH | Mathematics | 50.3 | 61.2 | 74.6 | 87.1 | 89.7 |
| LiveCodeBench | Coding | 32.8 | 43.4 | 32.3 | 49.2 | 44.5 |
| MMMU | Image reasoning | 69.4 | 73.4 | 69.1 | N/A | 70.7 |
| MathVista | Visual math | 70.7 | 73.7 | 63.8 | N/A | 73.1 |
| ChartQA | Chart understanding | 88.8 | 90.0 | 85.7 | N/A | 87.2 |
| DocVQA | Document understanding | 94.4 | 94.4 | 91.1 | N/A | 92.1 |
| MGSM | Multilingual math | 90.6 | 92.3 | 90.5 | 91.1 | 89.2 |
Note: Benchmark scores are drawn from Meta's official model card and the Hugging Face release post. Cross-benchmark comparisons require caution because evaluation conditions, prompt formats, and model versions vary across organizations.
When Meta announced Llama 4 on April 5, 2025, it cited an ELO score above 1,400 on the LMArena chatbot leaderboard as evidence of Maverick's quality. LMArena is a widely followed benchmark where human evaluators compare AI model outputs in blind side-by-side tests.
The model Meta submitted to LMArena was not the same as the model released to the public. The submitted version was labeled "Llama-4-Maverick-03-26-Experimental," a variant tuned specifically for human preference rather than general instruction-following. The experimental version consistently produced verbose, heavily emoji-laden responses, a style that tends to score well on preference-based evaluations but that many users find inappropriate for professional or technical applications.
When the publicly released Llama-4-Maverick-17B-128E-Instruct was separately evaluated on LMArena, it ranked approximately 32nd on the leaderboard, well below the experimental version's second-place finish and below established competitors including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, all of which were older models.
Meta acknowledged the distinction without apologizing, stating that the experimental version was "a chat-optimized version we experimented with" and that testing custom variants is standard practice. LMArena updated its leaderboard policies in response, stating that "Meta's interpretation of our policy did not match what we expect from model providers" and committing to clearer standards for which model variants can be submitted. The platform also published the 2,000-plus battles from the experimental submission so researchers could examine the preference data.
The episode drew wider commentary about the reliability of preference-based benchmarks. Because LMArena rewards conversational style, verbosity, and formatting choices that appeal to casual evaluators, models can be optimized to perform well on the benchmark without those optimizations translating to better real-world performance across technical or professional tasks.
The table below focuses on the positioning of Maverick, the stronger of the two released models, against the three main competitors Meta cited at launch.
| Model | Total params | Active params | Context | Multimodal | Open weight |
|---|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | 10M | Yes | Yes |
| Llama 4 Maverick | 400B | 17B | 1M | Yes | Yes |
| GPT-4o | Not disclosed | Not disclosed | 128K | Yes | No |
| DeepSeek V3 | 671B | 37B | 128K | No | Yes |
| Gemini 2.0 Flash | Not disclosed | Not disclosed | 1M | Yes | No |
On reasoning benchmarks, Maverick's GPQA Diamond score (69.8) is notably stronger than GPT-4o (53.6) but weaker than DeepSeek V3 on MATH (87.1 vs. 61.2). Maverick shows competitive results on coding benchmarks (LiveCodeBench 43.4) that are broadly comparable to Gemini 2.0 Flash and close to DeepSeek V3, despite having roughly half the active parameters.
For document and chart understanding tasks where multimodal capabilities matter, Maverick and Scout are competitive with or ahead of GPT-4o, which is notable because document AI has historically been a strength of closed models with native vision integration.
DeepSeek V3, which at the time of Llama 4's release was one of the strongest open-weight text models, does not natively process images, making direct multimodal comparison impossible for vision tasks.
The clearest practical difference is in context length. Scout's 10-million token window is substantially larger than any competitor in the table. Gemini 2.0 Flash also supports up to 1 million tokens, matching Maverick, while GPT-4o and DeepSeek V3 are limited to 128,000 tokens.
Scout is aimed at applications that require long-context reasoning within a resource-efficient deployment footprint. Practical scenarios include:
Enterprise knowledge retrieval: processing entire document libraries, SharePoint repositories, or internal knowledge bases in a single context window, without the chunking and retrieval pipeline complexity required by shorter-context models.
Code analysis: loading complete multi-file codebases for security auditing, refactoring analysis, or automated code review without losing cross-file context.
Long document processing: summarizing legal contracts, technical specifications, financial filings, or multi-volume research reports in a single pass.
Maverick targets applications where multimodal reasoning quality is the primary requirement alongside text generation:
Document AI: extracting structured data from invoices, forms, charts, and reports where text and visual layout are both relevant.
Customer support: handling support tickets that include user-uploaded screenshots, product photos, or error message images alongside text descriptions.
Content analysis: processing mixed media inputs for moderation, classification, or information extraction at scale.
Multilingual applications: Meta's training on 200 languages with 1 billion or more tokens each for over 100 of them makes both models notably stronger on non-English languages than most comparably sized models.
Both models were available on the day of announcement through multiple channels. Model weights can be downloaded from llama.com after accepting the license terms. Hugging Face hosts the models at meta-llama/Llama-4-Scout-17B-16E and meta-llama/Llama-4-Maverick-17B-128E-Instruct, with support for the Transformers library starting from version 4.51.0.
Cloud providers moved quickly to offer hosted access. Amazon Web Services made both models available in Amazon Bedrock as a serverless option in the US East and US West regions on the day of release, with cross-region inference available in US East (Ohio). Microsoft Azure offered the models through Azure AI Foundry and Azure Databricks. Google Cloud also announced same-day support.
Third-party inference providers including Groq, Together.ai, and Fireworks also offered API access at launch. Groq reported output speeds of over 400 tokens per second for Scout, enabled by its LPU (Language Processing Unit) hardware. IBM made both models available in its watsonx.ai platform.
For local deployment, Scout can run on a single H100 GPU with Int4 quantization using tools like llama.cpp or Ollama. Maverick requires a multi-GPU setup, with 8 GPUs recommended for BF16 inference using tensor parallelism.
The release of Scout and Maverick was broadly covered as a significant step in Meta's open-weight model program and as a direct competitive response to DeepSeek, which had attracted widespread attention in early 2025 by demonstrating that high-quality models could be trained at lower cost than previously assumed.
Positive reception focused on the multimodal integration quality (particularly document and chart understanding scores that matched or exceeded closed models), the long context windows, and the inference efficiency enabled by the MoE architecture. The fact that both models are open-weight drew particular interest from enterprises and researchers who want to fine-tune or self-host models without per-token API costs.
The codistillation from Behemoth was noted as an interesting training-time technique, though some observers pointed out that the teacher model being used to improve the student is itself not publicly available, which limits independent replication.
Criticism centered primarily on the LMArena benchmark controversy (see above), which led to questions about the reliability of the performance claims in Meta's launch materials. Security researchers at Virtue AI and ProtectAI published redteaming analyses shortly after release, finding that both Scout and Maverick had jailbreak vulnerability rates in the medium-risk range (52 to 58 percent success rates on standard adversarial test suites), with Maverick showing better compliance behavior than Scout.
Some practitioners noted that Scout's real-world coding performance felt weaker than benchmarks suggested, with the model occasionally struggling on complex multi-step programming tasks where DeepSeek V3 or GPT-4o performed more reliably.
Several limitations were documented at and after release:
Hallucination rate: Both models exhibit factual hallucination in multi-step reasoning chains. Maverick's rate is lower than Scout's but higher than GPT-4.5 based on third-party evaluations. Neither model should be used for high-stakes factual retrieval without external grounding.
Coding reliability: Despite competitive LiveCodeBench scores, developer feedback in the weeks after release was mixed. Scout in particular drew criticism for difficulty with complex algorithmic tasks and for producing plausible-but-incorrect code at a higher rate than DeepSeek V3.
EU restriction: The multimodal license restriction means that EU-based organizations cannot legally build or deploy products that use Llama 4's vision capabilities, substantially narrowing the open-weight advantage for European teams.
Behemoth availability: The teacher model used to improve Maverick through codistillation is not publicly available and was not fully trained at announcement. This limits both independent evaluation of the distillation claims and any attempt to reproduce the training process.
Context quality at scale: While Meta reports near-perfect needle-in-a-haystack accuracy, sustained reasoning over very long contexts (millions of tokens) introduces real-world latency and memory constraints that make the 10-million token window practically accessible only on hardware with substantial RAM and fast interconnect.