The Leaderboard Illusion

AI Companies Model Evaluation

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 2,095 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Leaderboard Illusion is a 2025 research paper, led by Cohere and Cohere Labs with academic collaborators, that argues the most influential public ranking of large language models, Chatbot Arena, is systematically distorted in favor of a small number of large providers. Posted to arXiv as 2504.20879 on April 29, 2025, the paper finds that undisclosed private testing, selective retraction of weaker models, and asymmetric access to Arena battle data combine to inflate the scores of major laboratories such as Google, OpenAI, and Meta ^[1]. Its central claim, in the authors' words, is that these dynamics produce "a distorted playing field" that rewards "overfitting to Arena-specific dynamics rather than general model quality" ^[1].

Overview

The Leaderboard Illusion is a research paper, first posted to arXiv on April 29, 2025, that documents systematic distortions in Chatbot Arena, the crowdsourced human-preference leaderboard operated by LMArena (formerly LMSYS). The paper was led by researchers at Cohere and Cohere Labs together with academic collaborators, and it argues that the most influential public ranking of large language models is shaped by policies that systematically favor a small number of large providers ^[1]^[2].

The authors contend that three interacting mechanisms inflate the Arena scores of major laboratories: undisclosed private testing of many model variants combined with selective disclosure of only the best result; the silent retraction or deprecation of weaker models; and asymmetric access to the Arena's user-prompt and battle data, which the authors say allows large providers to overfit to the Arena distribution ^[1]. The paper presents this as a case study in Goodhart's law, the principle that a measure ceases to be a good measure once it becomes a target.

Published shortly after the Llama 4 Arena controversy, in which Meta was found to have submitted an Arena-tuned variant distinct from its public release, the paper intensified scrutiny of leaderboard gaming and prompted a detailed public rebuttal from LMArena. The debate that followed pushed the community toward greater caution in interpreting leaderboard standings and led LMArena to announce several transparency reforms ^[3]^[4].

What is "The Leaderboard Illusion"?

The Leaderboard Illusion is a 2025 study auditing how models are evaluated on Chatbot Arena, the human-preference leaderboard that pits two anonymous chatbots against each other and aggregates user votes into a ranking. The paper concludes that the leaderboard's headline numbers are not a neutral measurement of model quality but a contested signal shaped by Arena policies. Its core argument is that a handful of large providers gain an advantage by privately testing many variants and disclosing only the best, by having weaker models quietly removed, and by accumulating disproportionate amounts of Arena data they can train on. The authors frame the result as a textbook instance of Goodhart's law: once the Arena ranking became a target that providers optimize for, it stopped being a reliable measure of general capability.

Who wrote the paper and when was it published?

The paper is authored by Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker ^[1]. The lead and senior authors are affiliated with Cohere and Cohere Labs (Cohere's research arm), with collaborators drawn from institutions including Princeton University, Stanford University, the Allen Institute for AI, the University of Waterloo, and the University of Washington ^[2]. The arXiv version (2504.20879), first submitted April 29, 2025 and revised May 12, 2025, runs to roughly 68 pages including appendices, and the work was subsequently submitted for peer review ^[1].

Sara Hooker, who leads Cohere Labs, characterized the stakes bluntly in interviews tied to the paper's release: "One of the benchmarks that is most widely used, most highly visible, has shown a clear pattern of unreliability in rankings" ^[6]. She framed the broader problem as one of scientific rigor, saying she hoped the paper would spur "a sense of integrity" and an acknowledgement "that this is just bad science" ^[6].

Why is Chatbot Arena so influential?

Chatbot Arena launched in 2023 as a project of LMSYS, a research group associated with UC Berkeley, and later spun out under the LMArena name. The platform pits two anonymous models against each other on a user-supplied prompt; the user votes for the better response, and the aggregated votes are converted into a ranking using a Bradley-Terry style rating model (often described in terms of Elo) ^[1]. Because it aggregates real human preferences across millions of open-ended prompts rather than a fixed test set, Chatbot Arena became widely regarded as a more contamination-resistant signal than static AI benchmarks, and its top-line ranking grew into one of the most cited indicators of frontier model quality.

That influence created strong incentives. Model providers increasingly cited Arena standings in launch announcements and marketing, and a number of laboratories openly optimized for Arena performance. The paper's central concern is that, as the leaderboard became a target, the conditions under which models are evaluated on it stopped being neutral.

What problems did it find in Chatbot Arena?

The analysis draws on roughly 2 million Arena battles spanning 243 models from 42 providers, covering data from January 2024 through April 2025, supplemented by tracking of anonymous pre-release models from January to March 2025 ^[1]. The authors report three principal findings, summarized below. These figures are the paper's own claims and several were disputed by LMArena (see the following section).

Claim (as stated by the paper)	Reported figure
Arena battles analyzed	about 2 million ^[1]
Models covered, across 42 providers	243 ^[1]
Private variants tested by Meta before Llama 4	27 ^[1]^[5]
Private variants observed from Google (Jan to Mar 2025)	about 10 ^[1]
Share of Arena test prompts to Google	19.2% ^[1]
Share of Arena test prompts to OpenAI	20.4% ^[1]
Combined share to 83 open-weight models	29.7% ^[1]
Combined share to fully open-source models	8.9% ^[1]
Models silently deprecated (vs. 47 officially listed)	205 ^[1]
Win-rate gain on ArenaHard from 0% to 70% Arena training data	23.5% to 49.9%, a 112% relative gain ^[1]
Score gap between two identical model checkpoints	about 17 points ^[1]

Undisclosed private testing and selective disclosure

The paper's most prominent claim is that some providers, especially large proprietary labs, are permitted to test many private model variants on the Arena before a public launch, then publish only the best-scoring variant while withdrawing the rest ^[1]. Because each variant is an independent draw from a noisy distribution, the authors argue that publishing the maximum of many draws is a best-of-N selection that mechanically inflates the reported score. They report identifying 27 private Meta variants in the run-up to Llama 4 and roughly 10 from Google in early 2025 ^[1]^[5]. The paper states the point directly: "At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release" ^[1]. To illustrate the variance the practice exploits, the authors say they submitted two identical checkpoints under different aliases and observed a score difference of about 17 points ^[1].

Selective retraction and deprecation

The authors describe an asymmetry in how models leave the Arena. They report that 205 models were "silently" deprecated, meaning quietly removed or down-weighted without being marked as retired, compared with 47 models officially listed as deprecated, writing that this figure "substantially exceeds the 47 models officially marked as deprecated by Chatbot Arena" ^[1]. They argue that proprietary models are sampled in more battles and removed less often than open-weight and open-source models, which both concentrates data on a few providers and allows weaker results to disappear from view rather than dragging down a provider's standing ^[1].

Asymmetric data access

The third finding concerns access to the Arena's battle data, which the paper treats as a valuable training signal. The authors estimate that Google and OpenAI each received roughly a fifth of all Arena test prompts (19.2% and 20.4% respectively), while 83 open-weight models collectively received 29.7% and fully open-source models received only 8.9% ^[1]. To quantify the value of this access, they fine-tuned a model on varying proportions of Arena data and reported that raising the Arena share from 0% to 70% more than doubled the win-rate on the ArenaHard benchmark, from 23.5% to 49.9%, a relative gain of about 112% ^[1]. The authors interpret this as evidence that privileged data access enables overfitting to the Arena distribution, a form of benchmark contamination specific to the Arena's prompt distribution rather than to any fixed test set.

What reforms did the paper propose?

The paper closes with concrete recommendations intended to reduce these distortions: prohibit the retraction of scores once a model has been evaluated; cap the number of private variants any provider may test simultaneously (it suggests a maximum of three concurrent variants per provider); make model deprecation stratified, transparent, and auditable, applied equally to proprietary, open-weight, and open-source models; adopt variance-aware sampling so that ratings reflect genuine differences; and publish regular transparency reports on data access and pre-release testing ^[1].

How did LMArena respond?

LMArena published a detailed rebuttal disputing six of the paper's central claims and defending its policies ^[3]. Its principal points were:

Open-model share. LMArena said the paper's figure of roughly 8.8% open-source representation omitted open-weight models such as Llama and Gemma, and that by its own accounting open models make up about 40.9% of the leaderboard ^[3].
The score-boost plot. LMArena stated that the figure suggesting gains of 100-plus points from pre-release testing was a simulation using Gaussian draws rather than measured Arena data, and that in practice the boost is small, on the order of about 11 Elo after 50 tests, because the platform continually collects fresh votes that wash out selection effects ^[3].
Identical checkpoints. LMArena argued the roughly 17-point gap between identical checkpoints falls within expected statistical noise once overlapping confidence intervals are accounted for ^[3].
The 112% figure. LMArena noted that this experiment was run on ArenaHard, a static benchmark of about 500 prompts, and therefore does not characterize the live Chatbot Arena itself ^[3].
Policy transparency. LMArena said its pre-release testing policy had been publicly documented since March 1, 2024, and was not secret, and that any provider may submit models; larger labs simply produce more of them. It added that Cohere itself had received two to three times more pre-release tests than OpenAI or xAI ^[3].

Alongside the rebuttal, LMArena announced changes, including clearer marking of retired models, an explicit statement that providers may test multiple variants, and "provisional" scores for models tested pre-release when ten or more variants are evaluated at once, held provisional until roughly 2,000 additional post-release votes accumulate ^[3].

Why does the paper matter?

The paper landed during a period of heightened skepticism about leaderboard integrity. Days earlier, the Llama 4 launch had drawn criticism when it emerged that the high-ranking "Llama-4-Maverick-Experimental" entry on Chatbot Arena was an Arena-optimized chat variant distinct from the publicly released weights, prompting LMSYS to update its policies even before the paper appeared ^[4]. The Leaderboard Illusion crystallized those concerns into a quantified, peer-style critique and was widely covered in the technical press, with the head of Cohere Labs publicly characterizing the reliability of leaderboard rankings as a crisis for the field ^[6].

Beyond the specific dispute with LMArena, the paper became a reference point in broader discussions of evaluation methodology. It reinforced the argument that any single headline metric, including a human-preference leaderboard once thought resistant to gaming, is vulnerable to Goodhart-style optimization, and it strengthened calls for transparency reforms such as disclosed pre-release testing, auditable deprecation, and reporting of data access. The episode is frequently cited as evidence that benchmark and leaderboard results should be read as contested claims rather than neutral facts, and that robust model evaluation requires multiple independent measures rather than reliance on a single ranking.

References

Singh, Shivalika, et al. "The Leaderboard Illusion." arXiv:2504.20879, April 29, 2025. https://arxiv.org/abs/2504.20879 ↩
"The Leaderboard Illusion." OpenReview. https://openreview.net/forum?id=4Ae8edNqm0 ↩
"LMArena Response to 'The Leaderboard Illusion' Writeup." LMArena Blog. https://arena.ai/blog/our-response/ ↩
Willison, Simon. "Understanding the recent criticism of the Chatbot Arena." April 30, 2025. https://simonwillison.net/2025/Apr/30/criticism-of-the-chatbot-arena/ ↩
"Leaderboard illusion: How big tech skewed AI rankings on Chatbot Arena." Computerworld. https://www.computerworld.com/article/3976355/leaderboard-illusion-how-big-tech-skewed-ai-rankings-on-chatbot-arena.html ↩
"Cohere Labs head calls 'unreliable' AI leaderboard rankings a 'crisis' in the field." BetaKit. https://betakit.com/cohere-labs-head-calls-unreliable-ai-leaderboard-rankings-a-crisis-in-the-field/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Arena-Hard Cohere Command A Elo rating system (AI model ranking)

Overview

What is "The Leaderboard Illusion"?

Who wrote the paper and when was it published?

Why is Chatbot Arena so influential?

What problems did it find in Chatbot Arena?

Undisclosed private testing and selective disclosure

Selective retraction and deprecation

Asymmetric data access

What reforms did the paper propose?

How did LMArena respond?

Why does the paper matter?

References

Improve this article

Related Articles

Helicone

Patronus AI

Langfuse

LangSmith

Arize Phoenix

LMArena

What links here

Related Articles

Helicone

Patronus AI

Langfuse

LangSmith

Arize Phoenix

LMArena

What links here