Llama 4 Behemoth
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 5,084 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 5,084 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama 4 Behemoth is the announced but never publicly released flagship model in the Llama 4 family from Meta AI. It was unveiled on April 5, 2025 alongside its smaller siblings Llama 4 Scout and Maverick, and was positioned as a roughly 2-trillion-parameter Mixture of Experts (MoE) model with 288 billion active parameters routed across 16 experts. Meta described Behemoth not as a product for end users but as a "teacher" model whose primary job was to generate high-quality training signal for the smaller Llama 4 variants through knowledge distillation[1].
At the time of the April 2025 announcement, Behemoth was still in training. Meta published a small set of preview benchmark numbers showing it ahead of GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM evaluations, and said the model would be released after training was finalized[1]. Through 2025 and into 2026, the launch was postponed repeatedly. By May 2026, more than a year after the initial announcement, Behemoth had still not been released to the public, and there was no firm date for when, or whether, it ever would be[2][3][4]. After Meta's April 8, 2026 unveiling of Muse Spark - the inaugural model from Meta Superintelligence Labs, whose blog post benchmarked the new model against Llama 4 Maverick but did not mention Behemoth - the older flagship was widely treated as effectively superseded[14][15][16]. The model has become a case study in the limits of pure scale, the costs of MoE routing at very large active parameter counts, and the organizational turmoil that followed the broader Llama 4 launch controversy.
Meta had built its open-weight strategy around increasingly large dense transformer models. Llama 2, released in 2023, topped out at 70B parameters. Llama 3 extended the line in 2024 with the 405B-parameter Llama 3.1, then added vision capability through bolted-on adapters in Llama 3.2 and a distilled 70B refresh in Llama 3.3. Across all of these releases the architecture stayed dense, the modality stayed text-first, and the cost of inference scaled directly with parameter count. By late 2024 that approach was looking increasingly out of step with the rest of the industry. Mistral's Mixtral models had popularized sparse Mixture of Experts at the open-weight scale, and DeepSeek had pushed MoE all the way to the frontier with DeepSeek V3, a 671B-total / 37B-active model released in December 2024 under a permissive license.
Internally at Meta, the response was to commit to a full MoE redesign for the next generation. The official launch blog framed this as the start of "a new era of natively multimodal AI innovation," and the design choices reflected that framing: native multimodal input via early fusion, MoE in every model in the family, and an aggressive bet on extending context length far beyond what other open-weight labs were shipping[1].
Llama 4 was announced publicly on Saturday, April 5, 2025. The announcement covered three models packaged as a "herd": Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. Scout and Maverick were released the same day, with weights on llama.com and Hugging Face. Behemoth was described as still in training but already useful in the role of a teacher for the other two models. The blog post stated that "we codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics"[1].
The choice to announce a model that was not yet finished, and to publish benchmark results from an in-progress checkpoint, attracted some skepticism even before the launch's other controversies emerged. Meta clearly wanted the world to know Behemoth existed and was framed as competitive with frontier proprietary systems, but in retrospect that decision raised expectations the company would later struggle to meet.
Meta has consistently referred to the model as "Llama 4 Behemoth" in its official communications, including the April 2025 launch blog, model documentation, and subsequent statements[1][5]. There was no formal rebrand to "Llama 4.5 Behemoth" or any "Llama 5" line during Behemoth's prolonged training period. By the time Meta unveiled Muse Spark in April 2026, the company had stopped using the Llama branding entirely for its frontier work, treating Muse Spark as a separate model series built from scratch by a new organization rather than as a successor to either Behemoth or the broader Llama 4 herd[14][15][16].
Almost everything publicly known about Behemoth's architecture comes from a small number of paragraphs in Meta's launch blog and a single illustrative comparison table. The model was never released, and Meta has not published a model card, a technical report, or detailed evaluation code. The disclosed numbers are summarized below.
| Specification | Value |
|---|---|
| Total parameters | ~2 trillion |
| Active parameters per token | 288 billion |
| Experts | 16 |
| Architecture | Mixture of Experts, transformer |
| Modalities | Multimodal (text + image, native) |
| Context window | Not publicly disclosed |
| Training tokens (family-wide mixture) | 30+ trillion |
| Training precision | FP8 (per Meta launch blog) |
| Status at announcement | Still in training |
| Public release | None as of May 2026 |
| Formal cancellation | None as of May 2026 |
| Effective successor program | Muse Spark (MSL, April 2026), not in Llama line |
Sources: [1] (Meta launch blog), [5] (model card aggregators reflecting disclosed specs), [14][15] (Muse Spark coverage).
The headline architectural choice for Behemoth is the combination of an extremely large total parameter count (around 2T) with a much smaller active parameter count (288B) routed across only 16 experts. That gives a sparsity ratio in the neighborhood of 7, far less aggressive than Maverick's 64, which uses 128 experts plus a shared expert with top-1 routing. With fewer, larger experts, Behemoth's design pushes more capacity into each expert and fewer routing decisions per layer. The intuition Meta offered in the blog was that this configuration was suited to the role of a teacher: deep specialization within each expert, lots of parameters to memorize hard cases, and enough active capacity per token to generate high-quality reasoning traces for distillation[1].
The trade-off is that 288 billion active parameters is itself an enormous inference footprint. Even with MoE routing, Behemoth requires roughly the compute of a 288B dense model per token, which is well beyond what most labs can comfortably serve. That decision was deliberate. Behemoth was never intended to be served to consumers; Meta described it as an internal teacher model, with Scout (17B active) and Maverick (17B active) as the user-facing products distilled from it.
Subsequent reporting indicated that this configuration also created real engineering pain. Coverage in mid-2025 said the team had to revise the routing method partway through training, disrupting expert specialization that had already started to form, and that stabilizing the routing at 288B active turned out to be harder than the team had hoped[3][8]. These reports are consistent with the picture later painted by Yann LeCun in interviews after his departure, in which he described the broader Llama 4 effort as scientifically uneven and the management decisions around release as overconfident[6][7].
Llama 4 as a family is natively multimodal through what Meta calls "early fusion." Visual tokens produced by an upgraded MetaCLIP-based vision encoder are concatenated directly into the same token sequence as text and processed through the same transformer layers from the very first block. Meta retrained the vision encoder "in conjunction with a frozen Llama model" so that the visual tokens are aligned to the language model's representational space before deep training begins[1]. Behemoth participated in this design, so its 2-trillion-parameter capacity covered text and image jointly rather than text alone. The launch blog did not specify whether Behemoth additionally trained on video, although the broader pretraining mixture for the Llama 4 family included video data.
Meta said the overall Llama 4 pretraining mixture exceeded 30 trillion tokens, more than double the roughly 15 trillion tokens used for Llama 3, and that this mixture included diverse text, image, and video datasets. The same blog cited improvements in MoE training efficiency, with pretraining performed in FP8 precision and sustaining about 390 TFLOPs per GPU on H100 hardware[1]. Some Hugging Face documentation for the released models cited a figure closer to 40 trillion tokens including multimodal data[5]. Behemoth was trained on a subset or superset of this mixture, but Meta did not disclose its exact training corpus or token budget separately. Reporting in May 2025 estimated that the Behemoth training run had crossed $500 million in cumulative cost by mid-2025, a number that began to weigh on internal discussions of whether to continue investing in the model[10].
Llama 4 Behemoth's most concrete function in the published record is as a teacher model for knowledge distillation. In this setup, a large, expensive model generates output distributions, intermediate logits, or rollouts that are used as additional training signal for smaller "student" models. Done well, the student can pick up reasoning patterns, factual knowledge, and stylistic behavior from the teacher that would be hard to learn directly from raw text. The technique has become standard practice across major labs for compressing capability into models that can be served at lower cost.
Meta's claim is that Maverick was "co-distilled" from Behemoth using a custom loss function that dynamically weights the standard next-token prediction loss against the distillation loss from Behemoth's predictions. The blog described the approach as letting the student benefit from the teacher's broader knowledge "without being overly constrained by it"[1]. The same teacher signal was also used during Scout training. According to Meta's framing, this is what allows Maverick, with only 17 billion active parameters, to compete with much larger dense models like GPT-4o on certain benchmarks.
A closely related detail is Meta's reported use of aggressive data filtering during post-training, with the filter informed by the teacher model. The launch blog said Meta removed "more than 50 percent of the SFT data tagged as easy" for Maverick and "more than 95 percent" for Behemoth itself, on the theory that easy supervised fine-tuning examples drag the model toward shallow pattern-matching and dilute the harder signal coming from online RL. With Behemoth filtered to such an extreme degree, the model was effectively trained on a curriculum heavily weighted toward harder reasoning examples, which Meta argued was necessary for the teacher role[1].
In practice, almost nothing about the distillation pipeline can be independently verified, because Behemoth has not been released. Researchers cannot probe the teacher directly, replicate the co-distillation, or audit which behaviors in Scout and Maverick came from the base mixture versus from Behemoth's output distributions. This has become one of the more uncomfortable aspects of the Llama 4 release: a model that was central to the published story is also one that the community cannot inspect. The distinction matters because the same blog post that introduced Behemoth was later challenged on accuracy grounds during the LMArena controversy, raising the bar for what the community is willing to accept on Meta's word alone[7][11].
Some observers also pointed out that the codistillation claim makes Behemoth load-bearing for the credibility of Scout and Maverick. If Behemoth never ships, the claim that Maverick's strong benchmark numbers come from a 2T-parameter teacher cannot be verified, and the open-weight community has to take Meta's training narrative on faith[13].
Meta published a small set of preview benchmark results for Behemoth in the April 2025 launch post, drawn from an in-progress training checkpoint rather than a finished release. The reported numbers were:
| Benchmark | Llama 4 Behemoth (preview) | GPT-4.5 | Claude Sonnet 3.7 | Gemini 2.0 Pro |
|---|---|---|---|---|
| MATH-500 | 95.0 | 92.4 | 92.0 | 91.8 |
| GPQA Diamond | 73.7 | 71.4 | 68.0 | 64.7 |
| MMLU Pro | 82.2 | not reported | not reported | not reported |
Meta's framing was that Behemoth "outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks," with MATH-500 and GPQA Diamond cited explicitly[1]. These results were never independently reproduced because Behemoth was never released. After the LMArena controversy in mid-April 2025, and after Yann LeCun's later admission that Meta had selected different model variants for different benchmarks across Llama 4, this preview Behemoth scoreline has become one of the more disputed datasets in the Llama 4 record[6][7].
A few useful caveats for reading these numbers:
In other words, even on Meta's own selected benchmarks the win margins were tight, and the comparison set aged out quickly. Any claim that Behemoth was the strongest model in the world in early 2025 has to be read against that backdrop.
As of May 2026, Llama 4 Behemoth has not been publicly released. There are no weights on Hugging Face, no API endpoint on any cloud platform, no commercial hosting, and no model card published by Meta beyond the original launch blog. Meta has also not issued a formal cancellation. The model exists as an artifact inside Meta, but coverage through late 2025 and 2026 has converged on the view that the project has been quietly shelved rather than wound down on the record[4][17][18].
The timeline of delays and reorganization is well documented:
| Date | Event |
|---|---|
| April 5, 2025 | Llama 4 announced; Behemoth described as still in training, release expected after training[1] |
| April 2025 | Original informal target for release tied to Meta's LlamaCon developer event[2] |
| Late April 2025 | Release reportedly slipped to early summer, then to June[2][3] |
| May 15, 2025 | Wall Street Journal reports release is being postponed to fall 2025 or later, citing concerns that Behemoth's improvements over Maverick are incremental and may not justify a public launch[2][3][8] |
| June 30, 2025 | Meta announces the formation of Meta Superintelligence Labs, folding FAIR, the GenAI organization, and a new "TBD Lab" under one umbrella; Alexandr Wang of Scale AI is appointed Chief AI Officer and Nat Friedman is brought in to lead AI products[6][19] |
| Summer 2025 | Daniel Gross joins as Friedman's counterpart; the existing GenAI organization that produced Llama 4 is reportedly sidelined inside the new structure[6][19] |
| Late 2025 | Reports indicate Behemoth's training was finished but internal performance was "poor" enough that release was held back, and that work on Behemoth effectively stopped after the superintelligence reorganization[9][17] |
| November 19, 2025 | Yann LeCun announces his departure from Meta after twelve years to start a new venture; in subsequent interviews he describes Mark Zuckerberg as having lost confidence in the team that produced Llama 4[6][7] |
| December 2025 | Reports surface that Meta's next frontier models, codenamed Avocado (text and code) and Mango (multimodal image/video), will be developed inside Meta Superintelligence Labs and released as closed-source proprietary models, not in the open-weight Llama line[17][18][20] |
| January 2026 | Reuters and others request updates on Behemoth's status and receive no roadmap commitments from Meta[4] |
| April 8, 2026 | Meta unveils Muse Spark, the first model from Meta Superintelligence Labs; the Meta blog explicitly benchmarks Muse Spark against Llama 4 Maverick, claims an order-of-magnitude compute reduction at matching capability, and does not reference Behemoth as a teacher or as a successor[14][15][16] |
| May 2026 | Behemoth remains unreleased; Meta has issued no formal cancellation but has also stopped including Behemoth in roadmap statements[21] |
The key fact for any reader is straightforward: more than thirteen months after the announcement, Behemoth has not shipped. Meta has not formally canceled the model. There is no press release saying the project is dead. But there is also no credible signal that release is imminent, the open-weight Llama line has not produced a successor flagship, and the internal organization that built Behemoth has been restructured around a separate "superintelligence" effort that is producing closed-source models under a different brand.
The reasons given for the postponements have come in several waves. The Wall Street Journal's May 2025 report attributed the delay to internal concerns that Behemoth's gains over Maverick were not large enough to justify a public release. According to the report, "the sentiment is split, some feel the improvements over earlier versions are incremental at best"[2][3]. Subsequent coverage added detail. Computerworld and SiliconANGLE noted that MoE routing was harder to stabilize at Behemoth's 288B active parameter scale than the team had hoped, with reports that the routing method was changed partway through training, disrupting expert specialization that had already formed[3][8]. Training costs reportedly exceeded $500 million by mid-2025, a figure that put pressure on whether the model justified continued investment[10].
The broader scaling story matters too. By mid-2025, OpenAI's GPT-4.5 had been received as an underwhelming step beyond GPT-4o, Google's Gemini 2.5 had taken a more reasoning-focused approach, and Anthropic was emphasizing extended thinking over raw parameter growth. The industry narrative shifted from "bigger is better" toward inference-time compute, reasoning models, and test-time search. In that context, a 2-trillion-parameter dense-in-spirit teacher model whose marginal gain over a much smaller distilled student is uncertain became a harder sell internally, regardless of whether its training was technically successful.
Axios summarized the broader concern in May 2025: "Meta's disappointments mirror a broader worry inside the AI industry that progress dependent on scaling up models may be plateauing"[10]. Behemoth ended up tangled in that worry rather than as a counterexample to it.
The clearest signal that Behemoth was no longer a strategic priority came at the end of 2025 and the start of 2026. Reporting in December 2025 indicated that Meta Superintelligence Labs was working on two new frontier models, codenamed Avocado (a text and code reasoning model) and Mango (a multimodal model focused on image and video generation), with closed-source releases targeted for the first half of 2026[17][18][20]. Coverage emphasized that leadership had grown wary of releasing open weights at the frontier after watching DeepSeek and other groups iterate on top of Meta's prior releases without bearing the underlying research costs, and that Wang's TBD Lab was structured around a more proprietary posture than the GenAI organization that produced Llama 4[17][20].
The Avocado launch then slipped from its original Q1 2026 target to spring or early summer 2026[22], and on April 8, 2026 Meta instead introduced Muse Spark, billed as "the first in a new series of large language models built by Meta Superintelligence Labs"[14][15]. Meta's blog described Muse Spark as "the first step on our scaling ladder" and said that with rebuilt training infrastructure and improved data curation, the new model could "reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick"[15]. Notably, the Muse Spark blog did not benchmark against Behemoth, did not describe Behemoth as the teacher model behind any new student, and treated Llama 4 Maverick as the explicit baseline. Several reports framed this as the moment Meta moved past the Behemoth program: a new model line, a new organization, a new naming convention, and no mention of the 2T-parameter teacher the company had spent over a year defending[14][16][21].
The story of Behemoth cannot be cleanly separated from the broader Llama 4 launch controversy. Within days of the April 5, 2025 release, AI researchers noticed that the version of Maverick submitted to the LMArena (Chatbot Arena) leaderboard was a custom "experimental" variant, not the model available for download. The released model ranked far lower than the leaderboard entry suggested. LMArena later said publicly that "Meta's interpretation of our policy did not match what we expect from model providers" and updated its evaluation rules in response[11].
In an interview after his departure from Meta, Yann LeCun confirmed the harder version of the story: that Meta's team had "fudged a little bit" and had "used different models for different benchmarks to give better results"[7]. The fallout reached the CEO directly. Per LeCun, "Mark was really upset and basically lost confidence in everyone who was involved in this. And so basically sidelined the entire GenAI organization"[7]. Behemoth was part of that organization's work, and the loss of executive confidence in the team is a major reason its release stalled.
In mid-2025, Mark Zuckerberg reorganized Meta's AI research under a new umbrella called Meta Superintelligence Labs, formally established on June 30, 2025. The lab was structured around four sub-organizations: TBD Lab (frontier language models), FAIR (long-horizon research), Products and Applied Research, and MSL Infra (training and serving infrastructure)[19]. The change was paired with a roughly $14.3 billion investment in Scale AI that brought Alexandr Wang, Scale AI's 28-year-old founder, in as Meta's chief AI officer running the new lab, with Nat Friedman joining to lead AI products and Daniel Gross joining several weeks later[6][19]. Wang technically became LeCun's superior in the org chart, which contributed to LeCun's departure several months later[6]. Behemoth's team was reportedly folded into the broader restructuring, and the model effectively stopped being a public priority.
Yann LeCun, Meta's longtime chief AI scientist and a Turing Award winner, left Meta in November 2025 after twelve years to start his own startup focused on AMI (Advanced Machine Intelligence)[6]. In interviews after his departure, LeCun was sharply critical of Meta's new direction. He described Wang as "young" and "inexperienced," said the new leadership lacked "experience with research or how you practice research," and confirmed publicly that "results were fudged a little bit" during the Llama 4 launch[6][7]. He raised about $1 billion in seed funding for his new venture in early 2026[12].
While LeCun was not directly leading Llama 4 training, his very public exit and parting commentary cemented the perception that the team and program that produced Behemoth no longer had executive backing inside Meta. That is the political and organizational context in which the model has remained shelved.
Meta's launch comparison set focused on three contemporaries. The table below summarizes how those models were positioned at the time of Behemoth's announcement and how they have evolved since.
| Model | Developer | Architecture | Release year | Public availability |
|---|---|---|---|---|
| Llama 4 Behemoth (preview) | Meta | MoE (~2T total, 288B active, 16 experts) | Announced April 2025, unreleased | Not released as of May 2026 |
| GPT-4.5 | OpenAI | Dense, undisclosed scale | February 2025 | Available via API and ChatGPT |
| Claude Sonnet 3.7 | Anthropic | Dense, undisclosed scale | February 2025 | Available via API and Claude apps |
| Gemini 2.0 Pro | Google DeepMind | MoE, undisclosed scale | December 2024 | Available via Vertex AI and Gemini app |
By mid-2025, both Anthropic and Google had moved past the specific competitors Meta cited at announcement. Claude 4 Opus and Sonnet shipped in 2025 with extended thinking, and Gemini 2.5 Pro consolidated Google's lead on the LMArena leaderboard while raising the bar on reasoning benchmarks. DeepSeek's R1 reasoning model and subsequent open-weight releases pushed the open-weight frontier further down the cost curve. Behemoth's preview numbers from April 2025 looked competitive at the moment they were published; by the time it might have launched, the relevant comparison set had largely already moved on.
The open-weight comparison was the harder one for Meta. DeepSeek V3 and the subsequent reasoning-focused DeepSeek releases offered competitive performance with open weights, no benchmark controversies, and dramatically lower training costs than Meta's reported numbers. By early 2026, Qwen and DeepSeek effectively controlled the upper end of the open-weight scale, while Meta's own frontier work was moving into closed-source territory under the Muse Spark and Avocado names[17][14]. Even if Behemoth had shipped, it would have had to justify both its absolute capability and its enormous active parameter footprint against open-weight models that were achieving most of the same scores with far less compute.
Reception of Behemoth has been distinctive in that the model is being judged largely on its non-arrival rather than on direct testing. A few threads run through the coverage.
The AI industry in 2024 and 2025 saw several large-scale models announced but quietly delayed or scaled back, including OpenAI's repeated GPT-5 launches and reports of training-run difficulties at other major labs. Behemoth fit a recognizable pattern: an aggressive announcement aimed at preserving competitive position, followed by months of slipping deadlines and finally an awkward silence. Independent commentary on the Llama 4 release noted that publishing benchmarks from an in-progress run, while presenting the model as competitive with shipped products, set up an unforgiving comparison once external scrutiny came in[10][13].
The LMArena controversy and LeCun's later confirmation that the Llama 4 team "used different models for different benchmarks" meant that the preview numbers Meta published for Behemoth were received with significant doubt by mid-2025[7]. The Behemoth scoreline against GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro was never reproduced by an independent evaluator, and the community largely stopped quoting those numbers without caveats.
Analysts began treating Behemoth less as an upcoming release and more as a marker of where Meta wanted to be. Skywork's mid-2025 retrospective described the model as "a directional signal, not a generally released model"[13]. Industry analyses cited Behemoth alongside other large-but-unreleased systems as evidence that pure model scaling was running into harder limits than the optimistic 2024 framing had implied[10]. By the time Muse Spark debuted in April 2026, multiple outlets characterized Behemoth retrospectively as the "shelved teacher" that Meta had moved on from rather than a model that was on the verge of release[14][16][21].
For users of open-weight models, the practical effect of Behemoth's absence has been minimal in the short term. Scout and Maverick are the models people actually run. The longer-term effect, however, has been to dim expectations that a major lab will keep shipping ever-larger open-weight models on the same schedule. With Behemoth shelved, Qwen and DeepSeek effectively own the upper end of the open-weight scale through 2025 and 2026, and Meta has signaled through the Avocado, Mango, and Muse Spark programs that its next frontier models will not be released as open weights at all[17][18][20]. Behemoth thus ended up not just unreleased but also as the marker for the end of Meta's open-weight frontier era[16][21].