Llama 4

Llama 4 (Large Language Model Meta AI 4) is a family of natively multimodal large language models developed by Meta and released on April 5, 2025. It is the first generation of the Llama series to adopt a mixture of experts (MoE) architecture and the first to support text and images as native input modalities from the ground up. The initial release included two models, Llama 4 Scout and Llama 4 Maverick, with a third and much larger model, Llama 4 Behemoth, announced but still in training at the time of release. Scout features 17 billion active parameters drawn from a 109 billion total parameter pool across 16 experts, while Maverick uses 17 billion active parameters from approximately 400 billion total parameters spread across 128 experts. Llama 4 Scout supports a context window of up to 10 million tokens, the longest of any openly available model at launch ^[1].

The release was marred by controversy. Within days, the AI community raised concerns that Meta had used a specially tuned "experimental" version of Maverick to achieve high scores on the LMArena (Chatbot Arena) leaderboard, a version that differed from the publicly released model. The incident, later confirmed by departing Meta AI chief scientist Yann LeCun as benchmark manipulation, led to significant reputational damage and internal upheaval at Meta ^[2]^[3]. Despite this, Llama 4 became one of the most widely deployed open-weight model families of 2025, powering Meta AI inside WhatsApp, Messenger, Instagram, and Facebook, and reaching same-day availability on the major cloud platforms. By April 2026, Llama 4 had become, in retrospect, the last major open-weight Llama generation: Meta launched its closed-source successor Muse Spark on April 8, 2026, formally ending the Llama brand and abandoning the open-weight strategy that had defined the company's generative AI position since 2023 ^[23]^[24].

Background

Llama 4 was developed against the backdrop of rapidly escalating competition in the AI industry during late 2024 and early 2025. OpenAI was preparing GPT-5, Google had released Gemini 2.0 and was working on Gemini 2.5, and Anthropic had launched Claude 3.7 Sonnet with extended thinking capabilities. Meanwhile, DeepSeek had shocked the industry with its DeepSeek V3 and R1 models in late 2024 and early 2025, demonstrating that competitive frontier models could be built at a fraction of the cost assumed by Western labs. DeepSeek-V3 in particular, a 671B-total / 37B-active MoE released under a permissive license, put direct pressure on Meta's open-weight strategy and accelerated the company's pivot away from dense architectures.

Meta's previous Llama 3 family, released in stages throughout 2024, had established the company as the leading provider of open-weight language models. Llama 3.1 405B, released in July 2024, was the largest openly available dense model and performed competitively with GPT-4 on many benchmarks. Llama 3.2 added vision-capable 11B and 90B variants in September 2024, along with small text-only 1B and 3B models for on-device use. Llama 3.3 70B, released in December 2024, distilled most of the 405B model's capabilities into a much smaller and cheaper-to-serve dense model. However, even with these incremental updates, Llama 3 remained a text-first, dense-architecture family, and the industry had begun moving aggressively toward multimodal and MoE designs.

Meta CEO Mark Zuckerberg reportedly set ambitious targets for Llama 4, wanting the new family to match or exceed frontier closed models while continuing Meta's open-weight strategy ^[3]. Internal pressure to match DeepSeek's apparent efficiency advantage and to ship a credible response to GPT-4o and Gemini 2.0 shaped both the architectural choices and, by some accounts, the rushed nature of the launch.

Model variants

Llama 4 was announced with three model variants, two of which were released at launch:

Model	Total parameters	Active parameters	Experts	Architecture	Max context (Instruct)	Status at launch
Llama 4 Scout	109B	17B	16	MoE (all layers)	10M tokens	Released
Llama 4 Maverick	~400B	17B	128 (+ 1 shared)	Alternating dense/MoE	1M tokens	Released
Llama 4 Behemoth	~2T	288B	16	MoE	Not disclosed	Training (not released)

Llama 4 Scout

Scout is the smaller of the two released models, designed for efficiency and deployability. With 109 billion total parameters and 16 experts, it uses a full MoE architecture where every transformer layer is a mixture-of-experts layer. Only 17 billion parameters are active per token, meaning the model's inference cost is comparable to a 17B dense model while drawing on a much larger knowledge base.

Scout's most distinctive feature is its 10-million-token context window in the Instruct variant, achieved through a combination of the iRoPE architecture and inference-time temperature scaling of the attention logits. The base model was pretrained with a 256K-token context, then extended during fine-tuning. This context length enables processing of entire codebases, lengthy legal documents, or full books within a single prompt. On the Needle-in-a-Haystack (NIAH) evaluation, Scout achieves perfect retrieval across its full 10M token context ^[1]. Scout is also the only Llama 4 variant that uses non-learnable RMS normalization on its query and key projections in the RoPE layers, a small but important stability trick for very long sequences ^[4].

Scout is designed to fit on a single NVIDIA H100 GPU when quantized to INT4, making it accessible to a broad range of developers and organizations. In bf16 it requires roughly two H100s, and Hugging Face also publishes an official FP8 checkpoint that lets a single H100 host the model in higher quality than INT4. Combined with its fully sparse MoE design, this makes Scout the most laptop-and-workstation friendly serious frontier-class model from any major lab as of mid-2025.

Llama 4 Maverick

Maverick is the flagship released model, using a larger and more complex architecture. It has approximately 400 billion total parameters with 128 routed experts plus one shared expert. Unlike Scout, Maverick uses an alternating architecture where dense layers and MoE layers alternate in a 1:1 ratio; experts are applied in half of the layers, while the other half are standard dense transformer layers. Each token activates the shared expert plus exactly one of the 128 routed experts in MoE layers, a top-1 routing scheme that pushes Maverick's sparsity unusually high ^[5].

The Instruct variant supports a context window of up to 1 million tokens. Maverick was also co-distilled from the larger Behemoth model during training, using a novel loss function that dynamically weights the student and teacher logits. This knowledge distillation from a more capable teacher model is one of the reasons Maverick achieves performance that exceeds what might be expected from a model with only 17 billion active parameters ^[1]. Hugging Face publishes Maverick in both BF16 and FP8 formats, and the model fits on a single NVIDIA H100 DGX node (8 H100s) for production serving ^[4].

Llama 4 Behemoth

Behemoth is the largest model in the Llama 4 family, with nearly 2 trillion total parameters, 288 billion active parameters, and 16 experts. At the time of the April 2025 announcement, Behemoth was still in training and was not released. Meta described it as a "teacher model" whose primary function is to generate high-quality synthetic data and provide the knowledge base for distilling the smaller Scout and Maverick models ^[1].

Meta released preliminary benchmark results showing Behemoth outperforming GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Pro on several STEM benchmarks, including a score of 95.0 on MATH-500 and 73.7 on GPQA Diamond ^[1]. However, given the later controversy around benchmark reporting, these numbers were received with some skepticism by the research community.

Behemoth's release was originally targeted for early summer 2025, then pushed back to the fall, then pushed back again. By mid-2025, multiple outlets reported that Meta had encountered serious training instability with Behemoth: the team had switched the MoE routing method partway through training, which disrupted expert specialization, and the chunked-attention scheme used in Llama 4 created blind spots at chunk boundaries that hurt long-form reasoning at very large scale ^[6]^[7]. A New York Times report later said the model had finished training but was being held back due to "poor internal performance," and that after Meta announced its new "superintelligence" lab in mid-2025, teams stopped running new evaluations on Behemoth altogether. As of May 2026, Behemoth has still not been publicly released, Meta has never issued a formal cancellation, and the launch of the closed-source Muse Spark in April 2026 has effectively closed the window in which a public Behemoth release could have happened ^[23].

Architecture

Mixture of experts

Llama 4 is the first model in the Llama series to use a mixture of experts architecture. In a standard dense transformer, every parameter is involved in processing every token. In MoE, a routing network selects a small subset of specialist sub-networks (experts) for each token, dramatically reducing computational cost while maintaining a large total parameter count. The technique was popularized in modern LLMs by Mistral's Mixtral 8x7B in late 2023 and pushed to frontier scale by DeepSeek-V3 in late 2024.

Scout uses a straightforward MoE design where all transformer layers contain expert routing with 16 experts. Maverick takes a different approach with its alternating dense/MoE design: half the layers are standard dense layers, and the other half are MoE layers with 128 routed experts plus one shared expert. The shared expert processes every token, ensuring a baseline of common knowledge is always applied, while the routed expert handles more specialized processing ^[5].

This architectural choice gives Maverick a sparsity ratio of 64 (128 experts with top-1 routing in half the layers), which is unusually high compared to other MoE models like Mixtral 8x7B (sparsity of 4) or DeepSeek-V3 (sparsity of around 32) ^[8]. The very high sparsity is a deliberate efficiency bet: it lets Maverick keep total capacity competitive with dense 400B models while running inference at the cost of a 17B model. The trade-off is harder load balancing during training and tighter routing margins during inference.

Early-fusion multimodality

One of Llama 4's most significant architectural innovations is its native multimodal design through early fusion. Rather than processing text and images through separate encoders and combining them at a late stage (as in models like LLaVA or previous multimodal systems), Llama 4 integrates visual information at the earliest stage of processing.

The architecture uses an enhanced MetaCLIP-based vision encoder to convert images into visual tokens. Meta retrained the encoder "in conjunction with a frozen Llama model" so that the visual tokens it produces are already aligned with the language model's representational space ^[1]. These visual tokens are then immediately concatenated with text tokens into a single unified sequence before any deep transformer processing begins. This early-fusion approach means that text and image representations can interact through cross-attention from the very first layer, allowing the model to develop richer cross-modal understanding than late-fusion approaches.

The vision encoder processes images at high resolution and produces a variable number of visual tokens depending on the image's content and resolution. The instruction-tuned variants of Scout and Maverick officially support up to 8 input images per prompt, with Meta reporting that internal testing held up well to roughly that limit. This native multimodal capability means Llama 4 can handle tasks like visual question answering, image captioning, document understanding, and chart interpretation without requiring any additional adapter modules or post-hoc integration. It also means the model can interleave images and text freely in a single conversation, which is critical for the assistant use cases that Meta AI runs in WhatsApp and Instagram.

iRoPE architecture

Llama 4 introduces the iRoPE (interleaved Rotary Position Embedding) architecture, a modification of the standard RoPE positional encoding used in previous Llama models. The iRoPE design alternates between two types of attention layers:

RoPE layers with chunked attention. These standard attention layers use rotary positional embeddings and operate over a fixed chunk of the context (8,192 tokens in Scout). They make up roughly three out of every four decoder layers and are responsible for capturing local structure cheaply.
NoPE (No Positional Embedding) layers. These attention layers have no positional encoding at all and run with a full causal mask over the entire context. They appear roughly every fourth layer and are the part of the model that actually attends across millions of tokens.

By interleaving these two layer types, the model can handle extremely long sequences more effectively. The chunked RoPE layers provide strong local context modeling at modest memory cost, while the NoPE layers allow the model to attend to distant tokens without the degradation that typically occurs when standard positional encodings are extrapolated far beyond training lengths. To stop attention probabilities from collapsing toward uniformity at very long context, Meta also applies inference-time temperature scaling of the softmax in the NoPE layers, which is the trick that pushes Scout from a 256K base context to the advertised 10M tokens ^[4]^[9].

Training

Pretraining

Llama 4 models were pretrained on more than 30 trillion tokens, more than double the approximately 15 trillion tokens used for Llama 3, with some Hugging Face documentation citing a figure of up to 40 trillion tokens including multimodal data ^[4]. The training mixture covers more than 200 languages, including more than 100 with at least 1 billion tokens each, giving Llama 4 roughly 10 times more multilingual data than Llama 3 ^[1]. Image and video data were included in the pretraining mixture from the start, which is what makes the multimodal capabilities native rather than bolted on. The base models were pretrained with a context length of 256,000 tokens.

The pretraining data composition was not disclosed in detail, but Meta indicated it included web text, code, scientific literature, books, and multilingual content, plus images and short video clips. Pretraining was done in FP8 precision; Meta reported sustaining roughly 390 TFLOPs per GPU during the run, a number that is competitive with the best published numbers for FP8 training on H100s.

MetaP and hyperparameter transfer

For a model family with three very different sizes, picking learning rates and other per-layer hyperparameters separately for each variant would have been prohibitively expensive. Meta instead developed a technique called MetaP that lets the team "reliably set critical model hyperparameters such as per-layer learning rates and initialization scales" once and have them transfer across model sizes and training budgets ^[1]. The approach is conceptually similar to muP and related width-transfer techniques used by other labs and is part of why Meta was able to ship two production-grade MoE variants with the same active parameter count but very different total capacity.

Knowledge distillation from Behemoth

A notable aspect of Llama 4's training is the use of knowledge distillation from the larger Behemoth model to the smaller Maverick and Scout models. Meta developed a novel co-distillation approach where the smaller models learn not just from the training data but also from Behemoth's output distributions. The loss function dynamically adjusts the weighting between the standard language modeling loss (from the training data) and the distillation loss (from Behemoth's predictions), allowing the student models to benefit from the teacher's broader knowledge without being overly constrained by it ^[1].

This distillation process is one reason why Maverick, despite having only 17 billion active parameters, can compete with much larger dense models on several benchmarks. It also makes Behemoth's continued absence from the public release more notable: the teacher model that shaped Maverick's behavior is something the public never gets to interact with directly.

Post-training

The Instruct variants of Scout and Maverick underwent extensive post-training. Meta described its post-training pipeline as "lightweight supervised fine-tuning (SFT), online reinforcement learning (RL), and lightweight direct preference optimization (DPO)" ^[1]. Crucially, Meta also reported aggressive data filtering before SFT, removing more than 50 percent of the SFT data tagged as "easy" for Maverick and more than 95 percent for Behemoth. The intuition is that easy examples drag the post-trained model toward shallow patterns, so concentrating on harder examples during SFT preserves the harder reasoning signal coming from RL.

During the online RL phase, Meta alternated between training the model and using it to filter and re-rank candidate prompts, a feedback loop that progressively raises the difficulty of the training distribution. The post-training process also extended the context length from the 256K base pretraining context to 10M tokens for Scout and 1M tokens for Maverick. Meta then layered safety training and red-teaming on top to address harmful outputs.

Safety and red-teaming tools

Alongside the model weights themselves, Meta released a refreshed safety stack designed to be deployed in front of and behind any Llama 4 system in production:

Tool	Purpose
Llama Guard	Input and output safety classifier for filtering harmful prompts and completions
Prompt Guard	Lightweight detector for jailbreaks and prompt injection attempts
CyberSecEval	Benchmark suite for measuring a model's offensive cybersecurity capability and refusal behavior
GOAT (Generative Offensive Agent Testing)	Multi-turn adversarial agent that simulates a persistent attacker and stress-tests model defenses

Meta also reported deliberate work on "refusal balance." According to the official launch blog, Llama 4 Scout refuses on debated political and social topics roughly 2 percent of the time, down from about 7 percent in Llama 3.3, and the imbalance between refusals on left-coded versus right-coded prompts dropped to under 1 percent ^[1]. Meta described the resulting political-lean rate as comparable to xAI's Grok and roughly half that of Llama 3.3, framing the change as an explicit design goal of "being more helpful on contested topics" rather than an accidental side effect.

Benchmarks

Meta reported benchmark results for Scout and Maverick across a range of evaluations, comparing them primarily against other models in similar compute classes. All Llama 4 numbers below were reported by Meta as 0-shot, temperature 0, with no majority voting or parallel test-time compute ^[1].

Scout benchmarks

Benchmark	Llama 4 Scout	Llama 3.1 8B	Gemma 3 12B	Gemini 2.0 Flash Lite
MMLU (0-shot, CoT)	79.6	73.0	78.5	76.1
MMLU Pro	74.3	48.3	60.6	65.1
GPQA Diamond	57.2	32.8	42.4	50.4
LiveCodeBench (10/01-02/01)	32.8	13.0	24.4	27.2
MMMU (0-shot, CoT)	69.4	N/A (text only)	64.8	58.4
MathVista	70.7	N/A (text only)	68.0	61.5
ChartQA	83.4	N/A (text only)	74.7	72.2
DocVQA	89.4	N/A (text only)	87.1	84.0

Scout, with its 17B active parameters, consistently outperformed models in the sub-20B class across both text and multimodal benchmarks. Its GPQA Diamond score of 57.2 represented a particularly strong result for a model of its size, and its multimodal scores on MMMU (69.4) and MathVista (70.7) were competitive with much larger models ^[1].

Maverick benchmarks

Benchmark	Llama 4 Maverick	GPT-4o	Gemini 2.0 Flash	Claude 3.7 Sonnet	Llama 3.1 405B
MMLU (0-shot, CoT)	85.5	85.7	84.4	84.2	85.2
MMLU Pro	80.5	74.7	77.6	78.0	73.4
GPQA Diamond	69.8	53.6	64.6	68.0	49.0
LiveCodeBench (10/01-02/01)	43.4	32.3	34.5	38.5	27.7
MMMU (0-shot, CoT)	73.4	69.1	71.7	68.6	N/A
MathVista	73.7	63.8	73.3	70.4	N/A
MATH-500	88.1	74.6	82.3	89.9	73.8
ChartQA	90.0	85.7	N/A	81.5	N/A
DocVQA	94.4	92.8	N/A	95.2	N/A

Maverick's results were strong across the board. On MMLU, it scored 85.5, essentially matching GPT-4o. Its GPQA Diamond score of 69.8 exceeded GPT-4o (53.6) by a wide margin. On multimodal benchmarks, Maverick scored 73.4 on MMMU and 73.7 on MathVista, outperforming GPT-4o on both. Its LiveCodeBench score of 43.4 surpassed all listed competitors, including Claude 3.7 Sonnet (38.5) and GPT-4o (32.3) ^[1]^[10]. On the MATH-500 benchmark, Maverick scored 88.1, slightly below Claude 3.7 Sonnet's 89.9 but above GPT-4o (74.6) and Gemini 2.0 Flash (82.3).

Behemoth (preview) benchmarks

Meta also published preview numbers for Behemoth, drawn from an internal training checkpoint rather than a finalized release:

Benchmark	Llama 4 Behemoth (preview)	GPT-4.5	Claude 3.7 Sonnet	Gemini 2.0 Pro
MATH-500	95.0	92.4	92.0	91.8
GPQA Diamond	73.7	71.4	68.0	64.7
MMLU Pro	82.2	N/A	N/A	N/A

These numbers were widely cited at launch but were never independently reproduced because Behemoth was never released to the public. The combination of the LMArena controversy, the postponement of Behemoth's release, and reports of training instability later in 2025 has left these scores as one of the more disputed datasets in the Llama 4 story.

Long-context evaluation

Scout's 10-million-token context was evaluated on the Needle-in-a-Haystack test, where the model must retrieve a specific piece of information embedded at a random position within a very long document. Scout achieved perfect retrieval across the full 10M context in Meta's reported runs ^[1].

Independent evaluations were less generous. On harder long-context benchmarks like Fiction.LiveBench and several reasoning-style retrieval tasks, both Scout and Maverick degraded sharply once the input exceeded a few hundred thousand tokens, even though they could technically ingest much more ^[11]. The chunked-attention scheme that lets iRoPE scale to 10M tokens cheaply also makes it harder for the model to follow a chain of reasoning that crosses chunk boundaries, which is exactly the kind of task long-context users care about. NIAH-style retrieval is now widely viewed as a weak proxy for genuine long-context reasoning, and Llama 4 became one of the canonical examples of why.

On the MTOB (Machine Translation of Books) benchmark, which requires processing entire books for translation, both Scout and Maverick maintained coherence and accuracy across full-length books, while competitor models with 128K context windows could not process the complete texts ^[10].

Independent intelligence scoring

Third-party aggregators continued to score Llama 4 through 2025 and into 2026, and the picture they produced was less flattering than Meta's launch numbers. On the Artificial Analysis Intelligence Index, which combines roughly a dozen public benchmarks into a single score, Llama 4 Maverick stabilized around 18 by spring 2026. For comparison, Meta's own closed successor Muse Spark debuted at 52 on the same index, and frontier closed models from OpenAI, Anthropic, and Google had pulled into the 60-80 range over the same period ^[23]. The headline that Maverick "beat GPT-4o" on Meta's reported numbers became increasingly difficult to defend as more independent harnesses re-ran the same tasks and posted lower scores than the launch figures.

Comparison with Llama 3

Feature	Llama 3 / 3.1 / 3.3	Llama 4
Architecture	Dense transformer	Mixture of experts
Largest released model	405B (dense)	Maverick (~400B total, 17B active)
Modalities	Text only (3.0/3.1/3.3); text + vision (3.2)	Native text + image (early fusion)
Training tokens	~15 trillion	30+ trillion (up to ~40T including multimodal)
Languages	~30 with meaningful coverage	200, with 100+ above 1B tokens
Max context (Instruct)	128K tokens	10M tokens (Scout), 1M (Maverick)
Mixture of experts	No	Yes
Knowledge distillation	No (dense, all sizes trained from scratch)	Yes (from Behemoth)
iRoPE	No	Yes
MMLU (best)	85.2 (405B)	85.5 (Maverick, 17B active)
Inference efficiency	Proportional to model size	17B active params for both released sizes

The most striking difference is efficiency. Llama 3.1 405B required activating all 405 billion parameters per token, while Llama 4 Maverick achieves comparable MMLU performance with only 17 billion active parameters, roughly 24 times fewer. This translates directly to lower inference costs and latency. Meta stated that Maverick achieves better results than GPT-4o at approximately one-ninth the cost per token ^[10]. The other big jump is multilingual coverage and native multimodality. Llama 3.2's vision models bolted a separate image adapter onto a frozen text model; Llama 4 trains text and vision tokens together from the very first batch of pretraining, which closes a long-standing gap with GPT-4o and Gemini.

Chatbot Arena controversy

The Llama 4 launch was overshadowed by a controversy involving benchmark manipulation on the LMArena (Chatbot Arena) leaderboard, which escalated over the days and weeks following the April 5 release.

Initial suspicions

Shortly after launch, users on the LMArena platform noticed that "Llama-4-Maverick-03-26-Experimental" had appeared near the top of the Chatbot Arena leaderboard, ranking second behind only Google's Gemini 2.5 Pro, with an Elo score of 1417. However, the publicly released version of Maverick did not perform nearly as well in users' own testing. The experimental version produced notably different outputs: verbose responses frequently peppered with emojis, a style seemingly optimized to win user preference votes in the Arena's head-to-head comparison format rather than to be genuinely useful ^[2]^[12].

Community backlash

AI researchers and developers quickly pointed out that the model submitted to Chatbot Arena was not the same model available for download. An unverified post by someone claiming to be a former Meta employee alleged that Meta leadership had mixed benchmark test sets into the post-training process to inflate scores and meet internal targets. Although Meta denied this specific allegation, the discrepancy between the Arena submission and the public release eroded trust quickly ^[13].

Independent testing found that the publicly available Maverick underperformed expectations on coding tasks and general-purpose assistance, particularly compared to its impressive paper benchmarks. The gap between reported and observed performance fueled skepticism about the benchmark numbers across the board.

LMArena's clarification

Two days after the release, LMArena posted a public statement on X clarifying the situation. The platform said "Meta's interpretation of our policy did not match what we expect from model providers" and that "Meta should have made it clearer that 'Llama-4-Maverick-03-26-Experimental' was a customized model to optimize for human preference." LMArena said it was updating its leaderboard policies to "reinforce our commitment to fair, reproducible evaluations so this confusion doesn't occur in the future" ^[12]. Within days, the platform also made the unmodified release version available for community votes for an apples-to-apples comparison.

Meta's spokesperson responded that "we experiment with all types of custom variants" and that the experimental Maverick "is a chat optimized version we experimented with that also performs well on LM Arena." The framing was that submitting a custom variant was within the spirit of the leaderboard, even if the disclosure had been weak.

Meta's initial denial and later confirmation

Meta's VP of Generative AI, Ahmad Al-Dahle, initially denied the harder allegations, stating: "We've also heard claims that we trained on test sets; that's simply not true and we would never do that" ^[14]. Meta also published a blog post defending the release and attributing some quality issues to bugs in the model deployment.

In January 2026, however, Yann LeCun, Meta's departing chief AI scientist, confirmed the manipulation in an interview with the Financial Times. LeCun stated that "results were fudged a little bit" and that the team "used different models for different benchmarks to give better results." Rather than submitting a single consistent model for all evaluations, which is the standard practice, the Llama 4 team selected whichever variant of Scout or Maverick performed best on each individual benchmark ^[3].

Unmodified Maverick on LMArena

When the unmodified release version, labeled "Llama-4-Maverick-17B-128E-Instruct," was added to LMArena on April 11, 2025, it ranked 32nd. It came in below months-old models like GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro ^[12]^[15]. The 30-place gap between the experimental and the released version became one of the most discussed data points in the open-weight model community for the rest of 2025 and was the strongest single piece of evidence that the launch numbers had been overstated.

Internal fallout

The controversy had significant consequences within Meta. According to LeCun, CEO Mark Zuckerberg was "really upset and basically lost confidence in everyone who was involved" in the Llama 4 release. Zuckerberg subsequently "sidelined the entire GenAI organisation," leading to a restructuring of Meta's AI leadership. LeCun himself departed Meta after more than a decade to start a new venture called Advanced Machine Intelligence Labs, and in interviews following his departure, he criticized Meta's new AI leadership as "young and inexperienced" ^[3]^[16].

Meta also brought in Alexandr Wang, the former CEO of Scale AI, as part of a broader leadership reshuffle for its AI efforts and the new "superintelligence" research lab announced in mid-2025. The Behemoth team was reportedly folded into that lab and stopped publishing new evaluations.

Impact on LMArena

The controversy also prompted LMArena to update its leaderboard policies more broadly. The incident highlighted a real problem in the AI industry: the lack of standardized, independently verified benchmarking procedures, which allows model developers to selectively report favorable results ^[17]. A separate study published later in April 2025 accused LMArena of giving large labs preferential testing privileges, which intensified the wider discussion about how to evaluate frontier models fairly.

Llama 4 Community License

Llama 4 was released under the Llama 4 Community License Agreement, effective April 5, 2025. The license structure is similar to previous Llama licenses but with some notable provisions:

A non-exclusive, worldwide, non-transferable, royalty-free license to use, reproduce, distribute, and create derivative works from the Llama Materials.
A 700 million monthly active user (MAU) threshold. Companies whose products or services exceeded that threshold at the time the license was offered must request a separate license from Meta.
An attribution requirement: licensees must display "Built with Llama" branding on related websites, user interfaces, or product documentation, and any derivative model names must begin with "Llama."
An Acceptable Use Policy: use must comply with Meta's Acceptable Use Policy, which prohibits harmful, illegal, or deceptive applications ^[18].

The Open Source Initiative (OSI) has consistently maintained that the Llama Community License does not qualify as "open source" under its definition, citing the use restrictions, the MAU threshold, and the lack of training data release. The OSI also noted that the Llama 4 license restricts use "with respect to any multimodal models" by individuals domiciled or with their principal place of business in the European Union, which the OSI argues violates the principle of non-discrimination ^[19]. Meta and supporters of the Llama license argue that "open weights" is the appropriate term and that the license provides substantially more freedom than fully closed alternatives.

Distribution and deployment

Direct distribution

Meta made Llama 4 Scout and Maverick available for download from llama.com and from the official Meta organization on Hugging Face on day one. The Hugging Face release shipped four model cards covering the base and Instruct variants of Scout and Maverick, in BF16 and FP8 formats, with day-one support in transformers v4.51.0 and Text Generation Inference (TGI) ^[4]. Hugging Face also rolled out its Xet storage backend for the Llama 4 weights, which deduplicated roughly 25 percent of the upload and around 40 percent of derivative fine-tunes hosted in the same repos.

Cloud platforms

Llama 4 reached unusual same-day cloud availability. AWS, Microsoft, and Google all announced support on launch weekend (Llama 4 dropped on a Saturday) ^[20]. Specific deployments included:

Platform	Llama 4 availability
AWS	Llama 4 Scout and Maverick on Amazon SageMaker JumpStart at launch; fully managed serverless on Amazon Bedrock by late April 2025
Microsoft Azure	Available through Azure AI Foundry and Azure Databricks
Google Cloud	Same-day support via Vertex AI Model Garden
Databricks	Maverick available across AWS, Azure, and GCP through the Mosaic AI Model Serving foundation-model catalog
IBM watsonx	Llama 4 Scout and Maverick added shortly after launch alongside earlier Llama models
Snowflake Cortex AI	Llama 4 Maverick offered as a hosted foundation model in Cortex AI
Cerebras and Groq	Specialized inference providers offering Maverick at higher tokens-per-second than GPU-based hosts

The coordinated rollout was striking: the same set of clouds that had taken weeks or months to add earlier Llama generations were ready to host Llama 4 within hours, reflecting both the maturity of Meta's launch playbook and the commercial demand for an open-weight, multimodal alternative to GPT-4o.

Meta consumer products

Llama 4 also became the model behind Meta AI, the assistant feature that runs across Meta's consumer apps. At launch, Meta AI in WhatsApp, Messenger, Instagram Direct, and on the meta.ai website was upgraded to Llama 4 in 40 countries, with text capabilities available in 13 languages. The models powered features like in-conversation drafting, image generation prompts, and the visual question-answering experiences in the Meta AI mobile app released in late April 2025 ^[21]. For most consumers, this is the way they encountered Llama 4 in practice; the cloud and Hugging Face availability matter for developers, but the WhatsApp deployment was the first time hundreds of millions of people interacted with a frontier MoE LLM in a chat thread without realizing it.

Adoption and ecosystem

The Llama 4 launch landed inside a Llama brand that was already at unprecedented scale. By March 2025, weeks before Llama 4 shipped, Meta announced that cumulative Llama downloads across all generations had crossed 1 billion, up from roughly 650 million in December 2024 ^[25]. Scout in particular benefited from this surface area: it was the first multimodal Llama variant that could run on a single H100, and it was picked up quickly by inference providers, on-prem deployers, and platform vendors looking for an open-weight alternative to GPT-4o vision pricing.

Quantization and packaging communities responded fast. Unsloth, llama.cpp, MLX-community, and TheBloke-style aggregators all published GGUF, FP8, INT8, and 4-bit quantizations of Scout and Maverick within the first two weeks. Cerebras and Groq positioned Maverick on their wafer-scale and LPU hardware respectively, advertising tokens-per-second figures well above any H100 host.

The community fine-tune story was more muted than for previous Llama generations, however. The MoE architecture made standard LoRA recipes harder to apply cleanly because adapters had to be routed across experts, and the very high sparsity in Maverick made low-rank updates noisy. Most of the prominent open-weight reasoning fine-tunes released in late 2025, including the Nous Research Hermes 4 family, opted to build on the dense Llama 3.1 405B and 70B checkpoints rather than on Llama 4, citing the relative simplicity of fine-tuning dense models and the comparative scarcity of Llama-4 specific tooling ^[26]. The result was an unusual situation in which the most capable open base models from Meta (Llama 4) were not always the most heavily fine-tuned ones, with the older Llama 3.1 dense checkpoints continuing to absorb a significant share of post-training research.

Reception

Reception of Llama 4 was mixed. The technical scope of the release was respected: this was the first frontier-scale, natively multimodal, MoE-based open-weight model family from a major Western lab, and its same-day availability on the major clouds set a new bar for launch logistics. Commentary in the open-weight community, however, kept circling back to two issues.

First, the LMArena controversy permanently colored how the launch numbers were read. Within a week, most independent reviewers had stopped quoting Meta's reported benchmarks without caveats and were comparing the unmodified release against rivals using their own evaluation harnesses. The Decoder summarized the consensus around two months after launch: Llama 4 "shows promise on standard tests but struggles with long-context tasks" ^[11].

Second, the headline 10-million-token context turned out to be more of a marketing artifact than a usable feature for most workloads. NIAH retrieval is a weak proxy for real long-context behavior, and on harder retrieval-and-reasoning benchmarks Llama 4's quality dropped well before users hit the advertised limit. By contrast, Maverick was generally judged a competent but unremarkable model on standard chat and coding benchmarks, with particular strengths in vision tasks (DocVQA 94.4, ChartQA 90.0) and multilingual coverage, and notable weaknesses in chain-of-thought reasoning relative to Claude 3.7 Sonnet, Gemini 2.5 Pro, and DeepSeek-V3.

Despite this lukewarm critical reception, adoption was real. Scout in particular found a niche as the cheapest credible multimodal frontier model: it ran on a single H100, exposed a reasonable Hugging Face surface, and worked acceptably for document understanding and chart reading at production cost. Maverick was widely hosted by inference vendors and saw heavy use as a backbone for fine-tuned variants from third-party shops looking for a permissively (if not openly) licensed base model with a real vision encoder.

Limitations

Several limitations have shaped the practical use of Llama 4 since launch:

Long context underperforms in practice. The advertised 10M context window for Scout works on synthetic NIAH tests but degrades sharply on tasks that require reasoning across chunk boundaries ^[11]. Most production users keep effective context well under 1M tokens.
Reasoning is a relative weakness. Maverick is competitive with GPT-4o on a range of standard benchmarks but trails Claude 3.7 Sonnet, Gemini 2.5 Pro, and DeepSeek-V3 on hard chain-of-thought tasks. Llama 4 has no built-in extended thinking mode comparable to Anthropic's, OpenAI's o-series, or Gemini's reasoning modes.
Behemoth never shipped. The teacher model that Meta used to distill Maverick has not been released as of May 2026, and reports of training instability and mid-run routing changes have raised doubts about whether it ever will be ^[6]^[7].
Trust deficit from LMArena. The submission of a custom "experimental" Maverick to LMArena, and Yann LeCun's later confirmation that Meta cherry-picked variants per benchmark, made every Llama 4 number reported by Meta something to verify independently ^[3].
License is not OSI-compliant. Restrictions on EU users for multimodal models, the 700M MAU cap, attribution requirements, and the lack of training data release together mean that Llama 4 is best described as "open weights with conditions" rather than open source ^[19].
Smaller variants. Unlike Llama 3.2, which shipped 1B and 3B text-only models for on-device use, Llama 4 launched without a small-form-factor option. Users who needed an on-device or edge model continued to rely on Llama 3.2 or third-party distilled variants.
MoE complicates community fine-tuning. The combination of high sparsity, expert routing, and FP8 master weights pushed many serious post-training projects to keep building on the dense Llama 3.1 checkpoints rather than on Llama 4, which slowed the rate at which derivative open-weight models extended Llama 4's reasoning capabilities.

Use cases

In the months after launch, Llama 4 settled into a relatively clear set of use cases:

Multimodal document and chart understanding. Scout's combination of cheap inference, a long context window, and a serious vision encoder made it popular for enterprise pipelines that ingest PDFs, scans, and screenshots and need structured outputs.
Multilingual chat assistants. The 200-language pretraining mixture and Meta AI's deployment in 40 countries pushed Llama 4 into multilingual customer-support and translation pipelines, especially in regions where Llama 3 had thinner coverage.
Drop-in replacement for GPT-4o on cost-sensitive workloads. With per-token costs reported at roughly one-ninth of GPT-4o for Maverick on some platforms, the model became a default choice for workloads where users wanted GPT-4o-level quality without GPT-4o-level pricing.
Backbone for fine-tuning. The permissive (if conditional) license, day-one Hugging Face availability, and FP8 weights made Maverick an attractive base for domain-specific fine-tunes, including legal, healthcare, and code-specific variants released in the second half of 2025.
Consumer assistant features inside Meta apps. The most quietly significant deployment is the WhatsApp / Messenger / Instagram / Facebook integration, which is how Llama 4 reaches end users at scale.

Competition and industry context

Llama 4 entered a fiercely competitive landscape in 2025 and aged into a much harsher one through 2026:

Model	Developer	Release	Architecture	Key strengths
Llama 4 Maverick	Meta	April 2025	MoE (400B total, 17B active)	Multimodal, efficient, open weights
GPT-4o	OpenAI	May 2024	Dense (proprietary)	Strong all-around, voice / vision
GPT-4.5	OpenAI	February 2025	Dense (proprietary)	Improved reasoning, reduced hallucination
Claude 3.7 Sonnet	Anthropic	February 2025	Dense (proprietary)	Extended thinking, strong coding
Gemini 2.0 Flash / Pro	Google	December 2024	MoE (proprietary)	Speed, multimodal, long context
Gemini 2.5 Pro	Google	March 2025	MoE (proprietary)	Reasoning, 1M context
DeepSeek-V3	DeepSeek	December 2024	MoE (671B total, 37B active)	Cost-efficient training, open weights
DeepSeek V3.1 / V3.2	DeepSeek	Late 2025	MoE (671B total, 37B active)	Reasoning, math, open weights
Qwen 3	Alibaba	April 2025	Dense + MoE family	Strong dense small models, hybrid reasoning
Mistral Large 2	Mistral	July 2024 (refreshed 2025)	Dense (proprietary)	European compliance focus, multilingual

Maverick's benchmark results placed it competitively with these models on paper, particularly in multimodal tasks and coding. However, the benchmark controversy made direct comparisons difficult to trust. In practice, independent evaluations through 2025 found Maverick to be a capable but not clearly frontier model, with particular strengths in multimodal understanding and long-context tasks (within reason), and relative weaknesses in complex multi-step reasoning compared to Claude 3.7 Sonnet and Gemini 2.5 Pro ^[22].

The efficiency argument remained Maverick's strongest selling point: with only 17 billion active parameters, it could be served at much lower cost than dense models of comparable quality, making it attractive for high-volume commercial deployments where cost per token is a primary concern. Among open-weight models, the most direct competitors were DeepSeek-V3 and the Qwen 3 family. DeepSeek's larger active parameter count (37B vs Maverick's 17B) gave it a quality edge on hard reasoning, while Maverick won on inference cost and on multimodal capability. Qwen 3, released by Alibaba in April 2025, took a different approach: it shipped a wide spread of dense models down to 0.6B parameters alongside MoE variants, and added a hybrid reasoning mode that could be toggled at inference time. The combination of small dense models (a category Llama 4 did not enter) and a hybrid thinking mode (which Llama 4 did not have) made Qwen 3 the more flexible open-weight family for many builders by late 2025 ^[27].

After Llama 4: end of the open Llama era

By late 2025 Meta had largely stopped public messaging around an incremental "Llama 4.1" or "Llama 4.X" refresh, despite earlier expectations of one. Internally, work shifted into the new Meta Superintelligence Labs, run by Alexandr Wang, and onto two successor codenames: Mango, a smaller research model, and Avocado, a frontier multimodal model expected to be the actual successor to the Llama 4 family. Reports through late 2025 indicated that Avocado would not be released under the Llama Community License and would instead be Meta's first proprietary, closed-weight flagship since the Llama brand began in 2023 ^[24]^[28].

Avocado completed pretraining in early 2026, according to a January 2026 internal memo from a Meta product manager that was widely reported. After several schedule slips, on April 8, 2026, Meta launched the model publicly under a new brand: Muse Spark. Muse Spark was offered exclusively through the Meta AI app, the meta.ai website, and a private API preview. There was no Hugging Face release, no llama.com download page, and no community license. Alexandr Wang publicly described the model as having "triggered safety checks that made it not suitable for open sourcing" at launch, and Meta did not commit to a future open-weight release ^[23]^[29].

The gap between Llama 4 and Muse Spark on independent intelligence aggregators was large. Artificial Analysis put Llama 4 Maverick at roughly 18 on its Intelligence Index in spring 2026, while Muse Spark debuted at 52 on the same composite ^[23]. The strategic shift was equally large: Meta's stated justification combined competitive pressure (rivals had pulled ahead on reasoning), capital allocation (AI infrastructure spending was guided to $115-135 billion in 2026, up from $72 billion in 2025), and a frequently cited concern that Chinese labs had been fine-tuning Llama 4 into competitive commercial products without contributing back to the ecosystem ^[24].

For Llama 4 specifically, the Muse Spark launch had three concrete effects. First, it confirmed that there would be no "Llama 5" under the Llama brand; the successor was deliberately renamed to break with the open-weight lineage. Second, it ended any remaining speculation about a public Behemoth release: Meta's new flagship was a different model line under different policies, and the team behind Behemoth had already been absorbed into Superintelligence Labs. Third, it pushed Llama 4 Scout and Maverick into a longer-than-expected role as Meta's only generally available open-weight models of the 2025-2026 generation, which kept them in active use by enterprises that needed an on-prem or air-gapped multimodal model with permissive licensing.

Current state (2025-2026)

As of May 2026, the Llama 4 family occupies an uncertain position in the AI landscape. Several developments have shaped its trajectory since the April 2025 launch:

Behemoth remains unreleased. Despite initial indications that it would be released after training, the model has been postponed at least twice, and reports of routing-method changes mid-training and underwhelming internal evals have left its public availability in question ^[6]^[7]. The New York Times reported that training had finished but that internal performance was "poor" enough that release was held back, and that work on the model effectively stopped after the announcement of Meta's superintelligence lab. With Muse Spark now serving as Meta's frontier model, a public Behemoth release looks increasingly unlikely.
Community adoption. Despite the benchmark controversy, Scout and Maverick have seen significant adoption in the open-source community. Scout's small active parameter count and long context window make it particularly appealing for deployment scenarios with limited compute budgets. Maverick has been adopted for multimodal applications where its early-fusion vision capabilities provide genuine advantages.
Post-launch updates. Meta released updated versions of both Scout and Maverick with bug fixes and improved post-training in the months following launch, partially addressing the quality concerns raised during the initial release. A planned "Llama 4.X" feature refresh never materialized publicly.
Leadership changes. The internal restructuring triggered by the Llama 4 controversy, including Yann LeCun's departure and the sidelining of the GenAI team, has reshaped Meta's AI strategy. The company brought in new leadership, including Alexandr Wang, formerly of Scale AI, to oversee its AI efforts going forward ^[16].
End of the Llama brand. With the April 2026 launch of the closed-source Muse Spark, the Llama brand effectively retired. Llama 4 Scout and Maverick remain available under the existing Llama 4 Community License, but Meta is no longer marketing a Llama 5, and the team has explicitly framed its successor as a clean break from the open-weight strategy.
Competitive pressure. The release of GPT-5 by OpenAI in late 2025, along with continued advances from Anthropic and Google, has raised the performance bar further. Llama 4 models remain competitive for their efficiency class but are no longer at the frontier on most benchmarks. Open-weight competitors, especially DeepSeek (V3.1 and V3.2) and Qwen 3, have continued to ship rapidly, and the gap between "best open-weight" and "Llama" is wider than at any point since 2023, in the opposite direction from what Meta had hoped.

The Llama 4 release will likely be remembered as a turning point for Meta's AI strategy: a technically ambitious model family whose impact was diminished by the benchmark manipulation controversy, leading to lasting changes in how AI companies report and verify model performance, triggering one of the largest internal AI reorganizations at any major lab in years, and ultimately marking the end of Meta's open-weight era.

References

Meta AI. (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." https://ai.meta.com/blog/llama-4-multimodal-intelligence/
The Register. (2025). "Meta accused of Llama 4 bait-n-switch to juice LMArena rank." https://www.theregister.com/2025/04/08/meta_llama4_cheating/
Fast Company. (2026). "Yann LeCun: Meta 'fudged a little bit' when benchmark-testing Llama 4 model." https://www.fastcompany.com/91469583/yann-lecun-meta-llama-4-model-zuckerberg
Hugging Face. (2025). "Welcome Llama 4 Maverick & Scout on Hugging Face." https://huggingface.co/blog/llama4-release
Bakouch, E. (2025). "Looking back, Llama4 MoE architecture was very different from recent open models." https://x.com/eliebakouch/status/1981747185373827079
SiliconANGLE. (2025). "Meta to postpone release of Llama 4 Behemoth model, report claims." https://siliconangle.com/2025/05/15/meta-postpone-release-llama-4-behemoth-model-report-claims/
Computerworld. (2025). "Meta hits pause on 'Llama 4 Behemoth' AI model amid capability concerns." https://www.computerworld.com/article/3987990/meta-hits-pause-on-llama-4-behemoth-ai-model-amid-capability-concerns.html
Wolfe, C. R. (2025). "Llama 4: The Challenges of Creating a Frontier-Level LLM." https://cameronrwolfe.substack.com/p/llama-4
Singh, M. (2025). "Llama 4's Architecture Deconstructed: MoE, iRoPE, and Early Fusion Explained." https://medium.com/@mandeep0405/llama-4s-architecture-deconstructed-moe-irope-and-early-fusion-explained-e58eb9403067
Meta. (2025). "Unmatched Performance and Efficiency." https://www.llama.com/models/llama-4/
The Decoder. (2025). "Meta's Llama 4 models show promise on standard tests, but struggle with long-context tasks." https://the-decoder.com/metas-llama-4-models-show-promise-on-standard-tests-but-struggle-with-long-context-tasks/
TechCrunch. (2025). "Meta's vanilla Maverick AI model ranks below rivals on a popular chat benchmark." https://techcrunch.com/2025/04/11/metas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark/
TechCrunch. (2025). "Meta releases Llama 4, a new crop of flagship AI models." https://techcrunch.com/2025/04/05/meta-releases-llama-4-a-new-crop-of-flagship-ai-models/
TechCrunch. (2025). "Meta exec denies the company artificially boosted Llama 4's benchmark scores." https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4-s-benchmark-scores/
Slashdot. (2025). "After Meta Cheating Allegations, 'Unmodified' Llama 4 Maverick Model Tested - Ranks #32." https://tech.slashdot.org/story/25/04/13/2226203/after-meta-cheating-allegations-unmodified-llama-4-maverick-model-tested---ranks-32
The Decoder. (2026). "'You certainly don't tell a researcher like me what to do' says LeCun as he exits Meta for his own startup." https://the-decoder.com/you-certainly-dont-tell-a-researcher-like-me-what-to-do-says-lecun-as-he-exits-meta-for-his-own-startup/
TechCrunch. (2025). "Study accuses LM Arena of helping top AI labs game its benchmark." https://techcrunch.com/2025/04/30/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark/
Meta. (2025). "Llama 4 Community License Agreement." https://www.llama.com/llama4/license/
Open Source Initiative. (2025). "Meta's Llama license is still not Open Source." https://opensource.org/blog/metas-llama-license-is-still-not-open-source
Virtualization Review. (2025). "Cloud Giants Race to Provide Same-Day Llama 4 AI Model Support." https://virtualizationreview.com/articles/2025/04/10/cloud-giants-race-to-provide-same-day-llama-4-ai-model-support.aspx
Meta. (2025). "Introducing the Meta AI App: A New Way to Access Your AI Assistant." https://about.fb.com/news/2025/04/introducing-meta-ai-app-new-way-access-ai-assistant/
BDTechTalks. (2025). "What to know about Meta's Llama 4 model family." https://bdtechtalks.com/2025/04/06/meta-llama-4/
The New Stack. (2026). "Meta abandons open-source Llama for proprietary Muse Spark." https://thenewstack.io/meta-abandons-llama-spark/
CNBC. (2025). "From Llamas to Avocados: Meta's shifting AI strategy is causing internal confusion." https://www.cnbc.com/2025/12/09/meta-avocado-ai-strategy-issues.html
Maginative. (2025). "Meta's Llama AI Model Hits 1 Billion Downloads." https://www.maginative.com/article/metas-llama-ai-model-hits-1-billion-downloads/
MarkTechPost. (2025). "Nous Research Team Releases Hermes 4: A Family of Open-Weight AI Models with Hybrid Reasoning." https://www.marktechpost.com/2025/08/27/nous-research-team-releases-hermes-4-a-family-of-open-weight-ai-models-with-hybrid-reasoning/
Spheron. (2025). "DeepSeek V3.2 vs Llama 4 vs Qwen 3: Best Open-Source Models Compared." https://www.spheron.network/blog/deepseek-vs-llama-4-vs-qwen3/
aibase. (2025). "Meta's 'Llama' Coming to an End? New Large Model Code-named Avocado Announced for Q1 2026." https://news.aibase.com/news/23528
Implicator AI. (2026). "Alex Wang Puts Muse Spark Behind Safety Review." https://www.implicator.ai/alex-wang-says-muse-spark-is-not-ready-for-open-source/

Background

Model variants

Llama 4 Scout

Llama 4 Maverick

Llama 4 Behemoth

Architecture

Mixture of experts

Early-fusion multimodality

iRoPE architecture

Training

Pretraining

MetaP and hyperparameter transfer

Knowledge distillation from Behemoth

Post-training

Safety and red-teaming tools

Benchmarks

Scout benchmarks

Maverick benchmarks

Behemoth (preview) benchmarks

Long-context evaluation

Independent intelligence scoring

Comparison with Llama 3

Chatbot Arena controversy

Initial suspicions

Community backlash

LMArena's clarification

Meta's initial denial and later confirmation

Unmodified Maverick on LMArena

Internal fallout

Impact on LMArena

Llama 4 Community License

Distribution and deployment

Direct distribution

Cloud platforms

Meta consumer products

Adoption and ecosystem

Reception

Limitations

Use cases

Competition and industry context

After Llama 4: end of the open Llama era

Current state (2025-2026)

See also

References

Improve this article

Related Articles

LLaMA

LLaMA/Model Card

Purple Llama

AudioCraft

Segment Anything Model and Dataset (SAM and SA-1B)

LLaMA 3

Background

Model variants

Llama 4 Scout

Llama 4 Maverick

Llama 4 Behemoth

Architecture

Mixture of experts

Early-fusion multimodality

iRoPE architecture

Training

Pretraining

MetaP and hyperparameter transfer

Knowledge distillation from Behemoth

Post-training

Safety and red-teaming tools

Benchmarks

Scout benchmarks

Maverick benchmarks

Behemoth (preview) benchmarks

Long-context evaluation

Independent intelligence scoring

Comparison with Llama 3

Chatbot Arena controversy

Initial suspicions

Community backlash

LMArena's clarification

Meta's initial denial and later confirmation

Unmodified Maverick on LMArena