Reflection 70B controversy
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,672 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,672 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Reflection 70B controversy was an open-source AI credibility episode that began on September 5, 2024, when Matt Shumer, co-founder and chief executive of OthersideAI (the company behind the AI writing assistant HyperWrite), announced a model called Reflection 70B and described it as "the world's top open-source model." [1][2] Shumer said the model was a fine-tune of Meta's Llama 3.1 70B that used a novel training method he called "Reflection-Tuning," and he published AI benchmark results purporting to beat leading closed models such as Claude 3.5 Sonnet and GPT-4o across multiple tests. [3]
Within roughly 72 hours the claims unraveled in public. Independent evaluators, including the firm Artificial Analysis, could not reproduce the headline scores; the weights uploaded to Hugging Face performed no better than an ordinary Llama fine-tune, and at one point scored like the older Llama 3 rather than Llama 3.1. [1][2] Researchers who probed the privately hosted API found behavior consistent with a wrapper that relayed queries to Anthropic's Claude, and later to GPT-4o, rather than to a genuine open model. [1][4] Shumer and Sahil Chaudhary, founder of the training-data company Glaive AI, attributed the discrepancies to upload corruption, hosting problems, and an evaluation bug; critics alleged deliberate misrepresentation, while others framed the affair as overhyping and mishandling rather than proven fraud. [1][5] The episode became a widely cited cautionary tale about unverified benchmark claims and the importance of independent reproduction in model releases. [5][6]
On September 5, 2024, Shumer posted on X (formerly Twitter) that Reflection 70B "holds its own against even the top closed-source models (Claude 3.5 Sonnet, GPT-4o)," asserting it was "the top LLM in (at least) MMLU, MATH, IFEval, GSM8K," that it "beats GPT-4o on every benchmark tested," and that it "clobbers Llama 3.1 405B." [3] The model was released openly on Hugging Face, with a larger 405B version promised to follow. [1]
The marketing centered on a technique branded "Reflection-Tuning," in which the model is trained to externalize its reasoning inside special tags and to catch and correct its own mistakes before answering. The published format used three XML-style tags: a <thinking> block where the model lays out its reasoning, a <reflection> block where it flags and fixes errors it notices in that reasoning, and an <output> block containing the final answer. [7] Shumer credited the synthetic training data to Glaive AI, a startup run by Sahil Chaudhary that produces datasets for model fine-tuning. [2][7]
The accompanying benchmark table claimed state-of-the-art numbers for an openly downloadable model. Reported figures included roughly 89.9% on MMLU, 99.2% on GSM8K, about 90.1% on the HumanEval coding benchmark, 79.7% on MATH, and a high IFEval score, positioning Reflection 70B at or above frontier closed models. [7][8] The claims spread quickly and the model briefly topped community attention as a possible open-source breakthrough. [2]
| Benchmark | Claimed Reflection 70B score |
|---|---|
| MMLU | ~89.9% |
| GSM8K | ~99.2% |
| HumanEval | ~90.1% |
| MATH | ~79.7% |
| IFEval | high (claimed best-in-class) |
Source: Shumer's launch materials, as reported by secondary outlets. [7][8] These figures are the disputed claims, not independently confirmed results.
Skepticism appeared almost immediately. On September 8, 2024, Artificial Analysis, an independent model-evaluation organization, reported that its own testing did not match the announcement. The group stated that "our evaluation of Reflection Llama 3.1 70B's MMLU score resulted in the same score as Llama 3 70B and significantly lower than Meta's Llama 3.1 70B," a striking discrepancy with Shumer's published numbers and one that also suggested the released weights might be based on the older Llama 3 rather than Llama 3.1. [1] Members of the open-source community on forums such as Reddit's r/LocalLLaMA likewise found the Hugging Face weights underperforming the base model, with the touted system prompt producing no measurable benefit. [2][5]
The Reflection team initially said the publicly uploaded weights were faulty. Shumer wrote that the weights had been "fucked up during the upload process" to Hugging Face and that this could explain why the public download was worse than an internal, privately hosted API version. [1] Re-uploaded copies, however, still failed to reach the advertised scores. [7]
Attention then shifted to that private API, where investigators reported behavior inconsistent with an open Llama model. When asked to identify itself, the hosted model would say it was "Claude, built by Anthropic," and it appeared to filter or refuse to emit the literal word "Claude," consistent with a thin wrapper around Anthropic's Claude 3.5 Sonnet with post-processing to hide its identity. [2][4] Testers reported that, given identical prompts, the API returned outputs matching Claude's. After the Claude-relay route appeared to be cut off, observers said the endpoint began exhibiting behavior associated with GPT-4o, suggesting the backend had been switched between providers. [4] On September 8, 2024, the X user known as Shin Megami Boson publicly accused Shumer of "fraud in the AI research community," posting screenshots and other material as evidence. [1] Some commentators noted an additional concern about transparency: Shumer had an investment in Glaive AI that was not disclosed alongside the launch, raising questions about conflicts of interest in how the model and its data were promoted. [1][5]
After a period of near-silence, Shumer apologized on September 10, 2024. He wrote: "I got ahead of myself when I announced this project, and I am sorry. That was not my intention," adding that he had decided to ship the approach "based on the information that we had at the moment" and acknowledging that many supporters had become skeptical. [1] Critics judged the statement insufficient, because it did not explain why the hosted API behaved like Claude or why the public weights could not reproduce the claims. [1][5]
Chaudhary, speaking for Glaive AI, conceded early on that "the benchmark scores I shared with Matt haven't been reproducible so far." [1] On or around October 3 to 4, 2024, he published a longer postmortem on the Glaive AI blog titled an update on Reflection-70B, and released artifacts intended to let outsiders re-run the work, including model weights, training data, and training and evaluation scripts on GitHub and Hugging Face. [4][6] In the writeup, Chaudhary said a bug in the evaluation code had inflated some scores, particularly on tasks such as MATH and GSM8K, due to an error in how the harness handled responses from an external scoring API. [4] He acknowledged the launch had been rushed: "We shouldn't have launched without testing, and with the tall claims of having the best open-source model," and said neither he nor Shumer had verified that the community could easily download and run the model. [4]
On the most serious allegation, Chaudhary denied deliberately serving Anthropic's Claude through the API and said the uploaded weights could reproduce the unusual identity behavior locally. [4][6] Reception remained skeptical: commenters on Hacker News and elsewhere noted that the original benchmark harness had not been shared, that the explanations leaned on file corruption and methodology rather than directly resolving the wrapper evidence, and that questions lingered about the quality and provenance of the training data. [4][6] The affair never produced a definitive, universally accepted account of exactly what the hosted API was doing at each moment.
The Reflection 70B episode caused notable reputational damage to those involved and became a reference point in debates over benchmark integrity. Coverage ranged from outlets reporting fraud accusations to analyses arguing the most charitable reading was severe overhyping and poor process rather than proven intent to deceive. Some commentators went further: the venture firm Air Street Capital wrote that, in its opinion, the sequence of events, an entirely different model first appearing on Hugging Face and the API then cycling between Claude, GPT-4o, and Llama responses, "gave the appearance of being a case of genuine fraud," while cautioning that this was their characterization. [5] Throughout, the strongest claims of intent remained allegations rather than adjudicated findings.
Several durable lessons were drawn. First, the case underscored how vulnerable public leaderboards are to inflated or contaminated scores, and how a single uncorrected evaluation bug can manufacture apparent state-of-the-art results. [4][5] Second, it highlighted the value of fast, independent reproduction: organizations like Artificial Analysis and the broader open-source community debunked the headline claims within days, which observers cited as a sign the ecosystem's self-correction worked, even as they argued that "bad practice, however quickly it unravels, needs to be taken seriously." [5] Third, it foregrounded the need to release verifiable artifacts (uploadable weights, datasets, and evaluation harnesses) at announcement time rather than after the fact. [4][6]
The branding also intersected with a genuine research direction. The general idea behind "reflection," prompting or training a model to reason step by step and revise its own errors, overlaps with established chain-of-thought methods and was subsequently validated far more rigorously by reasoning-focused systems such as OpenAI's o1-style models, which made deliberate test-time reasoning and self-correction a credible, reproducible capability. [5] In that sense the underlying intuition was not the problem; the controversy was about unverified claims, opaque hosting, and the gap between marketing and independently confirmed evidence. The Reflection 70B saga is now commonly invoked alongside other reproducibility disputes as a warning that extraordinary benchmark claims require independent verification before they are believed. [5][6]