Llama 4 Behemoth
Last reviewed
May 16, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 3,950 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 3,950 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama 4 Behemoth is the announced but unreleased flagship model in the Llama 4 family from Meta AI. It was unveiled on April 5, 2025 alongside its smaller siblings Llama 4 Scout and Maverick, and was positioned as a roughly 2-trillion-parameter Mixture of Experts (MoE) model with 288 billion active parameters routed across 16 experts. Meta described Behemoth not as a product for end users but as a "teacher" model whose primary job was to generate high-quality training signal for the smaller Llama 4 variants through knowledge distillation [1].
At the time of the April 2025 announcement, Behemoth was still in training. Meta published a small set of preview benchmark numbers showing it ahead of GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM evaluations, and said the model would be released after training was finalized. Through 2025 and into 2026, the launch was postponed repeatedly. By May 2026, more than a year after the initial announcement, Behemoth had still not been released to the public, and there was no firm date for when, or whether, it ever would be [2][3][4]. The model has effectively become a case study in the limits of pure scale, the costs of MoE routing at very large active parameter counts, and the internal turmoil that followed the broader Llama 4 launch controversy.
Meta had built its open-weight strategy around increasingly large dense transformer models. Llama 2, released in 2023, topped out at 70B parameters. Llama 3 extended the line in 2024 with the 405B-parameter Llama 3.1, then added vision capability through bolted-on adapters in Llama 3.2 and a distilled 70B refresh in Llama 3.3. Across all of these releases the architecture stayed dense, the modality stayed text-first, and the cost of inference scaled directly with parameter count. By late 2024 that approach was looking increasingly out of step with the rest of the industry. Mistral's Mixtral models had popularized sparse Mixture of Experts at the open-weight scale, and DeepSeek had pushed MoE all the way to the frontier with DeepSeek V3, a 671B-total / 37B-active model released in December 2024 under a permissive license.
Internally at Meta, the response was to commit to a full MoE redesign for the next generation. The official launch blog framed this as the start of "a new era of natively multimodal AI innovation," and the design choices reflected that framing: native multimodal input via early fusion, MoE in every model in the family, and an aggressive bet on extending context length far beyond what other open-weight labs were shipping [1].
Llama 4 was announced publicly on Saturday, April 5, 2025. The announcement covered three models packaged as a "herd": Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. Scout and Maverick were released the same day, with weights on llama.com and Hugging Face. Behemoth was described as still in training but already useful in the role of a teacher for the other two models. The blog post stated that "we codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics" [1].
The choice to announce a model that was not yet finished, and to publish benchmark results from an in-progress checkpoint, attracted some skepticism even before the launch's other controversies emerged. Meta clearly wanted the world to know Behemoth existed and was framed as competitive with frontier proprietary systems, but in retrospect that decision raised expectations the company would later struggle to meet.
Almost everything publicly known about Behemoth's architecture comes from a small number of paragraphs in Meta's launch blog and a single illustrative comparison table. The model was never released, and Meta has not published a model card, a technical report, or detailed evaluation code. The disclosed numbers are summarized below.
| Specification | Value |
|---|---|
| Total parameters | ~2 trillion |
| Active parameters per token | 288 billion |
| Experts | 16 |
| Architecture | Mixture of Experts, transformer |
| Modalities | Multimodal (text + image, native) |
| Context window | Not publicly disclosed |
| Training tokens (family-wide mixture) | 30+ trillion |
| Status at announcement | Still in training |
| Public release | None as of May 2026 |
Sources: [1] (Meta launch blog), [5] (model card aggregators reflecting disclosed specs).
The headline architectural choice for Behemoth is the combination of an extremely large total parameter count (around 2T) with a much smaller active parameter count (288B) routed across only 16 experts. That gives a sparsity ratio in the neighborhood of 7, far less aggressive than Maverick's 64, which uses 128 experts plus a shared expert with top-1 routing. With fewer, larger experts, Behemoth's design pushes more capacity into each expert and fewer routing decisions per layer. The intuition Meta offered in the blog was that this configuration was suited to the role of a teacher: deep specialization within each expert, lots of parameters to memorize hard cases, and enough active capacity per token to generate high-quality reasoning traces for distillation [1].
The trade-off is that 288 billion active parameters is itself an enormous inference footprint. Even with MoE routing, Behemoth requires roughly the compute of a 288B dense model per token, which is well beyond what most labs can comfortably serve. That decision was deliberate. Behemoth was never intended to be served to consumers; Meta described it as an internal teacher model, with Scout (17B active) and Maverick (17B active) as the user-facing products distilled from it.
Llama 4 as a family is natively multimodal through what Meta calls "early fusion." Visual tokens produced by an upgraded MetaCLIP-based vision encoder are concatenated directly into the same token sequence as text and processed through the same transformer layers from the very first block. Meta retrained the vision encoder "in conjunction with a frozen Llama model" so that the visual tokens are aligned to the language model's representational space before deep training begins [1]. Behemoth participated in this design, so its 2-trillion-parameter capacity covered text and image jointly rather than text alone. The launch blog did not specify whether Behemoth additionally trained on video, although the broader pretraining mixture for the Llama 4 family included video data.
Meta said the overall Llama 4 pretraining mixture exceeded 30 trillion tokens, more than double the roughly 15 trillion tokens used for Llama 3, and that this mixture included diverse text, image, and video datasets. The same blog cited improvements in MoE training efficiency, with pretraining performed in FP8 precision and sustaining about 390 TFLOPs per GPU on H100 hardware. Some Hugging Face documentation for the released models cited a figure closer to 40 trillion tokens including multimodal data [1][5]. Behemoth was trained on a subset or superset of this mixture, but Meta did not disclose its exact training corpus or token budget separately.
Llama 4 Behemoth's most concrete function in the published record is as a teacher model for knowledge distillation. In this setup, a large, expensive model generates output distributions, intermediate logits, or rollouts that are used as additional training signal for smaller "student" models. Done well, the student can pick up reasoning patterns, factual knowledge, and stylistic behavior from the teacher that would be hard to learn directly from raw text. The technique has become standard practice across major labs for compressing capability into models that can be served at lower cost.
Meta's claim is that Maverick was "co-distilled" from Behemoth using a custom loss function that dynamically weights the standard next-token prediction loss against the distillation loss from Behemoth's predictions. The blog described the approach as letting the student benefit from the teacher's broader knowledge "without being overly constrained by it" [1]. The same teacher signal was also used during Scout training. According to Meta's framing, this is what allows Maverick, with only 17 billion active parameters, to compete with much larger dense models like GPT-4o on certain benchmarks.
A closely related detail is Meta's reported use of aggressive data filtering during post-training, with the filter informed by the teacher model. The launch blog said Meta removed "more than 50 percent of the SFT data tagged as easy" for Maverick and "more than 95 percent" for Behemoth itself, on the theory that easy supervised fine-tuning examples drag the model toward shallow pattern-matching and dilute the harder signal coming from online RL. With Behemoth filtered to such an extreme degree, the model was effectively trained on a curriculum heavily weighted toward harder reasoning examples, which Meta argued was necessary for the teacher role [1].
In practice, almost nothing about the distillation pipeline can be independently verified, because Behemoth has not been released. Researchers cannot probe the teacher directly, replicate the co-distillation, or audit which behaviors in Scout and Maverick came from the base mixture versus from Behemoth's output distributions. This has become one of the more uncomfortable aspects of the Llama 4 release: a model that was central to the published story is also one that the community cannot inspect. The distinction matters because the same blog post that introduced Behemoth was later challenged on accuracy grounds during the LMArena controversy, raising the bar for what the community is willing to accept on Meta's word alone.
Meta published a small set of preview benchmark results for Behemoth in the April 2025 launch post, drawn from an in-progress training checkpoint rather than a finished release. The reported numbers were:
| Benchmark | Llama 4 Behemoth (preview) | GPT-4.5 | Claude Sonnet 3.7 | Gemini 2.0 Pro |
|---|---|---|---|---|
| MATH-500 | 95.0 | 92.4 | 92.0 | 91.8 |
| GPQA Diamond | 73.7 | 71.4 | 68.0 | 64.7 |
| MMLU Pro | 82.2 | not reported | not reported | not reported |
Meta's framing was that Behemoth "outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks," with MATH-500 and GPQA Diamond cited explicitly [1]. These results were never independently reproduced because Behemoth was never released. After the LMArena controversy in mid-April 2025, and after Yann LeCun's later admission that Meta had selected different model variants for different benchmarks across Llama 4, this preview Behemoth scoreline has become one of the more disputed datasets in the Llama 4 record [6][7].
A few useful caveats for reading these numbers:
In other words, even on Meta's own selected benchmarks the win margins were tight, and the comparison set aged out quickly. Any claim that Behemoth was the strongest model in the world in early 2025 has to be read against that backdrop.
As of May 2026, Llama 4 Behemoth has not been publicly released. There are no weights on Hugging Face, no API endpoint on any cloud platform, no commercial hosting, and no model card published by Meta beyond the original launch blog. The model exists only as an internal artifact at Meta, if it exists in any usable form at all.
The timeline of delays is well documented:
| Date | Event |
|---|---|
| April 5, 2025 | Llama 4 announced; Behemoth described as still in training, release expected after training |
| April 2025 | Original target for release at Meta's LlamaCon developer event |
| Late April 2025 | Release reportedly slipped to early summer, then to June |
| May 15, 2025 | Wall Street Journal reports the release is being postponed to fall 2025 or later, citing concerns that Behemoth's improvements over Maverick are incremental and may not justify a public launch [2][3][8] |
| Mid-2025 | Meta announces the creation of Meta Superintelligence Labs and a new "TBD Lab" under Alexandr Wang, formerly of Scale AI; the existing GenAI organization that produced Llama 4 is reportedly sidelined [6] |
| Late 2025 | Reports indicate training was finished but internal performance was "poor" enough that release was held back, and that work on Behemoth effectively stopped after the superintelligence reorganization [9] |
| November 2025 | Yann LeCun announces his departure from Meta after twelve years to start a new venture; in subsequent interviews he describes Mark Zuckerberg as having lost confidence in the team that produced Llama 4 [6][7] |
| January 2026 | Reuters requests an update from Meta on Behemoth's status and receives no response [4] |
| May 2026 | Behemoth remains unreleased; no public roadmap for release |
The key fact for any reader is straightforward: more than thirteen months after the announcement, Behemoth has not shipped. Meta has not formally canceled the model. There is no press release saying the project is dead. But there is also no credible signal that release is imminent, and the internal organization that built it has been restructured around a new leadership team focused on a separate "superintelligence" effort.
The reasons given for the postponements have come in several waves. The Wall Street Journal's May 2025 report attributed the delay to internal concerns that Behemoth's gains over Maverick were not large enough to justify a public release. According to the report, "the sentiment is split, some feel the improvements over earlier versions are incremental at best" [2][3]. Subsequent coverage added detail. Computerworld and SiliconANGLE noted that MoE routing was harder to stabilize at Behemoth's 288B active parameter scale than the team had hoped, with reports that the routing method was changed partway through training, disrupting expert specialization that had already formed [3][8]. Training costs reportedly exceeded $500 million by mid-2025, a figure that put pressure on whether the model justified continued investment [10].
The broader scaling story matters too. By mid-2025, OpenAI's GPT-4.5 had been received as an underwhelming step beyond GPT-4o, Google's Gemini 2.5 had taken a more reasoning-focused approach, and Anthropic was emphasizing extended thinking over raw parameter growth. The industry narrative shifted from "bigger is better" toward inference-time compute, reasoning models, and test-time search. In that context, a 2-trillion-parameter dense-in-spirit teacher model whose marginal gain over a much smaller distilled student is uncertain became a harder sell internally, regardless of whether its training was technically successful.
Axios summarized the broader concern in May 2025: "Meta's disappointments mirror a broader worry inside the AI industry that progress dependent on scaling up models may be plateauing" [10]. Behemoth ended up tangled in that worry rather than as a counterexample to it.
The story of Behemoth cannot be cleanly separated from the broader Llama 4 launch controversy. Within days of the April 5, 2025 release, AI researchers noticed that the version of Maverick submitted to the LMArena (Chatbot Arena) leaderboard was a custom "experimental" variant, not the model available for download. The released model ranked far lower than the leaderboard entry suggested. LMArena later said publicly that "Meta's interpretation of our policy did not match what we expect from model providers" and updated its evaluation rules in response [11].
In an interview after his departure from Meta, Yann LeCun confirmed the harder version of the story: that Meta's team had "fudged a little bit" and had "used different models for different benchmarks to give better results" [7]. The fallout reached the CEO directly. Per LeCun, "Mark was really upset and basically lost confidence in everyone who was involved in this. And so basically sidelined the entire GenAI organization" [7]. Behemoth was part of that organization's work, and the loss of executive confidence in the team is a major reason its release stalled.
In mid-2025, Mark Zuckerberg reorganized Meta's AI research under a new umbrella called Meta Superintelligence Labs, including a new "TBD Lab" focused on frontier models. The change was paired with a roughly $14.5 billion investment in Scale AI that brought Alexandr Wang, Scale AI's 28-year-old founder, in as Meta's chief AI officer running the new lab. Wang technically became LeCun's superior in the org chart, which contributed to LeCun's departure several months later [6]. Behemoth's team was reportedly folded into the broader restructuring, and the model effectively stopped being a public priority.
Yann LeCun, Meta's longtime chief AI scientist and a Turing Award winner, left Meta in November 2025 after twelve years to start his own startup focused on AMI (Advanced Machine Intelligence) Labs. In interviews after his departure, LeCun was sharply critical of Meta's new direction. He described Wang as "young" and "inexperienced," said the new leadership lacked "experience with research or how you practice research," and confirmed publicly that "results were fudged a little bit" during the Llama 4 launch [6][7]. He raised about $1 billion in seed funding for his new venture in early 2026 [12].
While LeCun was not directly leading Llama 4 training, his very public exit and parting commentary cemented the perception that the team and program that produced Behemoth no longer had executive backing inside Meta. That is the political and organizational context in which the model has remained shelved.
Meta's launch comparison set focused on three contemporaries. The table below summarizes how those models were positioned at the time of Behemoth's announcement and how they have evolved since.
| Model | Developer | Architecture | Release year | Public availability |
|---|---|---|---|---|
| Llama 4 Behemoth (preview) | Meta | MoE (~2T total, 288B active, 16 experts) | Announced April 2025, unreleased | Not released as of May 2026 |
| GPT-4.5 | OpenAI | Dense, undisclosed scale | February 2025 | Available via API and ChatGPT |
| Claude Sonnet 3.7 | Anthropic | Dense, undisclosed scale | February 2025 | Available via API and Claude apps |
| Gemini 2.0 Pro | Google DeepMind | MoE, undisclosed scale | December 2024 | Available via Vertex AI and Gemini app |
By mid-2025, both Anthropic and Google had moved past the specific competitors Meta cited at announcement. Claude 4 Opus and Sonnet shipped in 2025 with extended thinking, and Gemini 2.5 Pro consolidated Google's lead on the LMArena leaderboard while raising the bar on reasoning benchmarks. DeepSeek's R1 reasoning model and subsequent open-weight releases pushed the open-weight frontier further down the cost curve. Behemoth's preview numbers from April 2025 looked competitive at the moment they were published; by the time it might have launched, the relevant comparison set had largely already moved on.
The open-weight comparison was the harder one for Meta. DeepSeek V3 and the subsequent reasoning-focused DeepSeek releases offered competitive performance with open weights, no benchmark controversies, and dramatically lower training costs than Meta's reported numbers. Even if Behemoth had shipped, it would have had to justify both its absolute capability and its enormous active parameter footprint against models that achieved most of the same scores with far less compute.
Reception of Behemoth has been distinctive in that the model is being judged largely on its non-arrival rather than on direct testing. A few threads run through the coverage.
The AI industry in 2024 and 2025 saw several large-scale models announced but quietly delayed or scaled back, including OpenAI's repeated GPT-5 launches and reports of training-run difficulties at other major labs. Behemoth fit a recognizable pattern: an aggressive announcement aimed at preserving competitive position, followed by months of slipping deadlines and finally an awkward silence. Independent commentary on the Llama 4 release noted that publishing benchmarks from an in-progress run, while presenting the model as competitive with shipped products, set up an unforgiving comparison once external scrutiny came in [10][13].
The LMArena controversy and LeCun's later confirmation that the Llama 4 team "used different models for different benchmarks" meant that the preview numbers Meta published for Behemoth were received with significant doubt by mid-2025 [7]. The Behemoth scoreline against GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro was never reproduced by an independent evaluator, and the community largely stopped quoting those numbers without caveats.
Analysts began treating Behemoth less as an upcoming release and more as a marker of where Meta wanted to be. Skywork's mid-2025 retrospective described the model as "a directional signal, not a generally released model" [13]. Industry analyses cited Behemoth alongside other large-but-unreleased systems as evidence that pure model scaling was running into harder limits than the optimistic 2024 framing had implied [10].
For users of open-weight models, the practical effect of Behemoth's absence has been minimal in the short term. Scout and Maverick are the models people actually run. The longer-term effect, however, has been to dim expectations that a major lab will keep shipping ever-larger open-weight models on the same schedule. With Behemoth shelved, Qwen and DeepSeek effectively own the upper end of the open-weight scale through 2025 and 2026, and Meta has yet to clearly signal where Llama 5 or a successor to Behemoth might land.