The Leaderboard Illusion
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,664 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,664 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Leaderboard Illusion is a research paper, first posted to arXiv on April 29, 2025, that documents systematic distortions in Chatbot Arena, the crowdsourced human-preference leaderboard operated by LMArena (formerly LMSYS). The paper was led by researchers at Cohere and Cohere Labs together with academic collaborators, and it argues that the most influential public ranking of large language models is shaped by policies that systematically favor a small number of large providers [1][2].
The authors contend that three interacting mechanisms inflate the Arena scores of major laboratories: undisclosed private testing of many model variants combined with selective disclosure of only the best result; the silent retraction or deprecation of weaker models; and asymmetric access to the Arena's user-prompt and battle data, which the authors say allows large providers to overfit to the Arena distribution [1]. The paper presents this as a case study in Goodhart's law, the principle that a measure ceases to be a good measure once it becomes a target.
Published shortly after the Llama 4 Arena controversy, in which Meta was found to have submitted an Arena-tuned variant distinct from its public release, the paper intensified scrutiny of leaderboard gaming and prompted a detailed public rebuttal from LMArena. The debate that followed pushed the community toward greater caution in interpreting leaderboard standings and led LMArena to announce several transparency reforms [3][4].
The paper is authored by Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker [1]. The lead and senior authors are affiliated with Cohere and Cohere Labs (Cohere's research arm), with collaborators drawn from institutions including Princeton University, Stanford University, the Allen Institute for AI, the University of Waterloo, and the University of Washington [2]. The arXiv version runs to roughly 68 pages including appendices, and the work was subsequently submitted for peer review.
Chatbot Arena launched in 2023 as a project of LMSYS, a research group associated with UC Berkeley, and later spun out under the LMArena name. The platform pits two anonymous models against each other on a user-supplied prompt; the user votes for the better response, and the aggregated votes are converted into a ranking using a Bradley-Terry style rating model (often described in terms of Elo) [1]. Because it aggregates real human preferences across millions of open-ended prompts rather than a fixed test set, Chatbot Arena became widely regarded as a more contamination-resistant signal than static AI benchmarks, and its top-line ranking grew into one of the most cited indicators of frontier model quality.
That influence created strong incentives. Model providers increasingly cited Arena standings in launch announcements and marketing, and a number of laboratories openly optimized for Arena performance. The paper's central concern is that, as the leaderboard became a target, the conditions under which models are evaluated on it stopped being neutral.
The analysis draws on roughly 2 million Arena battles spanning 243 models from 42 providers, covering data from January 2024 through April 2025, supplemented by tracking of anonymous pre-release models from January to March 2025 [1]. The authors report three principal findings, summarized below. These figures are the paper's own claims and several were disputed by LMArena (see the following section).
| Claim (as stated by the paper) | Reported figure |
|---|---|
| Private variants tested by Meta before Llama 4 | 27 [1][5] |
| Private variants observed from Google (Jan to Mar 2025) | about 10 [1] |
| Share of Arena test prompts to Google | 19.2% [1] |
| Share of Arena test prompts to OpenAI | 20.4% [1] |
| Combined share to 83 open-weight models | 29.7% [1] |
| Combined share to fully open-source models | 8.9% [1] |
| Models silently deprecated (vs. 47 officially listed) | 205 [1] |
| Win-rate gain on ArenaHard from 0% to 70% Arena training data | 23.5% to 49.9%, a 112% relative gain [1] |
| Score gap between two identical model checkpoints | about 17 points [1] |
The paper's most prominent claim is that some providers, especially large proprietary labs, are permitted to test many private model variants on the Arena before a public launch, then publish only the best-scoring variant while withdrawing the rest [1]. Because each variant is an independent draw from a noisy distribution, the authors argue that publishing the maximum of many draws is a best-of-N selection that mechanically inflates the reported score. They report identifying 27 private Meta variants in the run-up to Llama 4 and roughly 10 from Google in early 2025 [1][5]. To illustrate the variance the practice exploits, the authors say they submitted two identical checkpoints under different aliases and observed a score difference of about 17 points [1].
The authors describe an asymmetry in how models leave the Arena. They report that 205 models were "silently" deprecated, meaning quietly removed or down-weighted without being marked as retired, compared with 47 models officially listed as deprecated [1]. They argue that proprietary models are sampled in more battles and removed less often than open-weight and open-source models, which both concentrates data on a few providers and allows weaker results to disappear from view rather than dragging down a provider's standing [1].
The third finding concerns access to the Arena's battle data, which the paper treats as a valuable training signal. The authors estimate that Google and OpenAI each received roughly a fifth of all Arena test prompts (19.2% and 20.4% respectively), while 83 open-weight models collectively received 29.7% and fully open-source models received only 8.9% [1]. To quantify the value of this access, they fine-tuned a model on varying proportions of Arena data and reported that raising the Arena share from 0% to 70% more than doubled the win-rate on the ArenaHard benchmark, from 23.5% to 49.9%, a relative gain of about 112% [1]. The authors interpret this as evidence that privileged data access enables overfitting to the Arena distribution, a form of benchmark contamination specific to the Arena's prompt distribution rather than to any fixed test set.
The paper closes with concrete recommendations intended to reduce these distortions: prohibit the retraction of scores once a model has been evaluated; cap the number of private variants any provider may test simultaneously; make model deprecation stratified, transparent, and auditable; adopt variance-aware sampling so that ratings reflect genuine differences; and publish regular transparency reports on data access and pre-release testing [1].
LMArena published a detailed rebuttal disputing six of the paper's central claims and defending its policies [3]. Its principal points were:
Alongside the rebuttal, LMArena announced changes, including clearer marking of retired models, an explicit statement that providers may test multiple variants, and "provisional" scores for models tested pre-release when ten or more variants are evaluated at once, held provisional until roughly 2,000 additional post-release votes accumulate [3].
The paper landed during a period of heightened skepticism about leaderboard integrity. Days earlier, the Llama 4 launch had drawn criticism when it emerged that the high-ranking "Llama-4-Maverick-Experimental" entry on Chatbot Arena was an Arena-optimized chat variant distinct from the publicly released weights, prompting LMSYS to update its policies even before the paper appeared [4]. The Leaderboard Illusion crystallized those concerns into a quantified, peer-style critique and was widely covered in the technical press, with the head of Cohere Labs publicly characterizing the reliability of leaderboard rankings as a crisis for the field [6].
Beyond the specific dispute with LMArena, the paper became a reference point in broader discussions of evaluation methodology. It reinforced the argument that any single headline metric, including a human-preference leaderboard once thought resistant to gaming, is vulnerable to Goodhart-style optimization, and it strengthened calls for transparency reforms such as disclosed pre-release testing, auditable deprecation, and reporting of data access. The episode is frequently cited as evidence that benchmark and leaderboard results should be read as contested claims rather than neutral facts, and that robust model evaluation requires multiple independent measures rather than reliance on a single ranking.