# LMArena

> Source: https://aiwiki.ai/wiki/lmarena_org
> Updated: 2026-06-28
> Categories: AI Benchmarks, AI Companies, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**LMArena** is a crowdsourced artificial intelligence evaluation platform and company that ranks [large language models](/wiki/llm) by having anonymous users vote on which of two blind, side-by-side model responses they prefer, aggregating those pairwise human votes into a public leaderboard. It grew out of [Chatbot Arena](/wiki/lmsys_chatbot_arena), an academic research project launched in April 2023 by the Large Model Systems Organization ([LMSYS](/wiki/lmsys)) at the [University of California, Berkeley](/wiki/uc_berkeley); the leaderboard has become one of the most widely referenced rankings in the AI industry.[^1][^2] The project began under LMSYS at Berkeley's Sky Computing Lab, was incorporated as a private company called Arena Intelligence Inc. (operating as LMArena) in April 2025, raised a US$100 million seed round in May 2025 (co-led by [Andreessen Horowitz](/wiki/andreessen_horowitz) and UC Investments), raised a US$150 million Series A in January 2026 at a US$1.7 billion valuation, and was rebranded simply as "Arena" later that month.[^3][^4][^5][^6] As of early 2026 the platform reported more than 5 million monthly users across 150 countries, more than 60 million conversations per month, and community votes on more than 400 models spanning text, vision, web development, search, video, and image generation.[^5][^6]

"In the arena, a user can chat with two anonymous models side-by-side and vote for which one is better," the LMSYS team wrote when it introduced Chatbot Arena on May 3, 2023; that single side-by-side voting mechanic remains the core of LMArena's evaluation method today.[^29]

This article covers LMArena as an organization, including its corporate history, founders, funding, product portfolio, evaluation methodology, and the controversies that have surrounded that methodology. For a deeper treatment of the underlying benchmark, see [Chatbot Arena](/wiki/lmsys_chatbot_arena).

## What is LMArena?

LMArena is both a public, free-to-use [benchmark](/wiki/benchmark) and the venture-backed company that operates it. Users submit a prompt, receive responses from two anonymized [large language models](/wiki/llm), and vote on the better answer (or declare a tie); the identities of the two models are revealed only after the vote. These crowdsourced, blind, pairwise human comparisons are then aggregated by a statistical rating model into an "Arena Score" leaderboard, originally computed with an online Elo update and later with offline Bradley-Terry maximum-likelihood estimation.[^7][^1] Because the votes come from anonymous real users rather than from a fixed set of test questions, supporters describe the Arena as measuring human preference "in the wild," capturing qualities such as helpfulness and fluency that static benchmarks can miss.[^17]

## Who created LMArena?

The roots of LMArena lie in a research effort begun in spring 2023 at Berkeley's Sky Computing Lab (often abbreviated SkyLab), a successor to the RISELab led by [Ion Stoica](/wiki/ion_stoica) and other faculty. A group of graduate students and faculty including Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica developed an open evaluation platform that asked crowdsourced users to compare anonymized model responses side by side and vote on which was better.[^7][^8] The system was hosted under a wider, multi-university collaboration called [LMSYS](/wiki/lmsys) (the Large Model Systems Organization), which had been incubating open-source AI research projects across Berkeley, Stanford, UC San Diego, Carnegie Mellon, and MBZUAI.[^9]

Chatbot Arena began operations in late April 2023 and was publicly announced on May 3, 2023.[^10] LMSYS at the time was also responsible for the [Vicuna](/wiki/vicuna) open-weight chat model and for FastChat, a multi-model serving framework that initially supported the Arena front end.[^11] Lianmin Zheng and Ying Sheng, both then PhD students at Berkeley advised by Stoica and Joseph E. Gonzalez, are commonly described as co-founders of LMSYS itself; Wei-Lin Chiang and Anastasios Angelopoulos were also PhD students working in the Sky Computing Lab.[^11][^12]

By March 2024, the Arena team had published a methodology paper, "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," on arXiv. Authored by Chiang, Zheng, Sheng, Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica, the paper described how more than 240,000 votes had been gathered and analyzed using statistical ranking methods.[^7] The paper was subsequently accepted to the International Conference on Machine Learning ([ICML](/wiki/icml)) and published in the conference's 2024 proceedings (PMLR volume 235).[^13]

LMSYS itself remained active after Chatbot Arena's growth. The organization formally incorporated as a non-profit in September 2024 with the stated aim of incubating early-stage open-source and research projects; its other flagship projects include SGLang, FastChat, and Vicuna. By 2025 LMSYS's own materials described Chatbot Arena as a "graduated" project, indicating that it had moved on from the umbrella organization to operate independently.[^9]

## When did LMArena become a company?

On April 17, 2025, the team announced that Chatbot Arena would be reorganized as a company called Arena Intelligence Inc., operating under the brand name LMArena.[^14] The corporate structure marked the platform's formal separation from LMSYS and from Berkeley as an academic research project, though the team continued to maintain that the company would retain a research-first orientation.[^4]

Reporting on the spinout described it as a transition from a university-led, donation-funded effort to a venture-backed startup. Earlier financial support had come from a mix of academic resources, donations, and grants, including credits and grants from Google's Kaggle platform, Andreessen Horowitz, and Together AI.[^2] Until the company's formal incorporation, Chatbot Arena operated through Berkeley as a free, public evaluation service.

## Who are LMArena's founders?

According to filings, press releases, and company communications, LMArena's principal founders are:

- **Anastasios N. Angelopoulos**, chief executive officer. Angelopoulos was a PhD student in Berkeley's Electrical Engineering and Computer Sciences department working on topics including trustworthy AI systems, black-box decision-making, and uncertainty quantification, with prior research experience at Google DeepMind. He is a co-author of the original Chatbot Arena methodology paper.[^4][^7]
- **Wei-Lin Chiang**, chief technology officer. Chiang was also a Berkeley PhD student in SkyLab, advised by Ion Stoica, with prior research experience at Google Research, Amazon, and Microsoft. He has been the public face of Chatbot Arena since its launch in 2023 and is a co-author of the methodology paper and of the [Vicuna](/wiki/vicuna) open model.[^4][^7][^12]
- **[Ion Stoica](/wiki/ion_stoica)**, co-founder and advisor. Stoica is a Berkeley professor and a serial entrepreneur who previously co-founded [Databricks](/wiki/databricks), [Anyscale](/wiki/anyscale), and the network-monitoring company Conviva. He was both Chiang's PhD advisor and a co-author of the Chatbot Arena methodology paper.[^4][^7]

Coverage of the founding team has described Angelopoulos and Chiang as Berkeley roommates who began Chatbot Arena as a side project in 2023 while pursuing their PhDs.[^4][^15] The broader engineering and research team that built and operated Chatbot Arena before incorporation also included Lianmin Zheng, Ying Sheng, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Joseph E. Gonzalez, and Michael Jordan, all co-authors on the methodology paper.[^7]

## How much funding has LMArena raised?

### Seed round (May 2025)

On May 21, 2025, approximately a month after the spinout was announced, LMArena disclosed that it had raised US$100 million in a seed round that valued the company at US$600 million.[^3][^16] The round was led by [Andreessen Horowitz](/wiki/andreessen_horowitz) and UC Investments, the investment office that manages the University of California system's endowment and other portfolios. Additional participants in the seed round were Lightspeed Venture Partners, Felicis Ventures, and Kleiner Perkins.[^3]

In its own announcement, Andreessen Horowitz framed the investment as a bet on LMArena becoming infrastructure for the AI industry, describing the platform as a "continuous integration pipeline for intelligence" and arguing that "Arena-tested" could become a recognizable seal of quality for AI models.[^17]

### Series A round (January 2026)

On January 6, 2026, LMArena announced a US$150 million Series A round that brought its post-money valuation to approximately US$1.7 billion, almost tripling the seed valuation roughly seven months earlier.[^5][^6] The Series A was led by Felicis and UC Investments, with participation from Andreessen Horowitz, The House Fund, LDVP, Kleiner Perkins, Lightspeed Venture Partners, and Laude Ventures.[^5][^6]

The Series A announcement disclosed several business metrics that had not previously been public. The company reported that its community platform served more than 5 million monthly users across 150 countries and was generating more than 60 million conversations per month. It also disclosed that its commercial AI evaluation product, launched in September 2025, had an annualized consumption run rate of more than US$30 million by December 2025.[^5][^6] In a statement accompanying the round, CEO Anastasios Angelopoulos said, "We cannot deploy AI responsibly without knowing how it delivers value to humans," adding, "To measure the real utility of AI, we need to put it in the hands of real users."[^5]

### Rebrand to "Arena"

Later in January 2026, the company announced that it would drop the "LM" prefix and operate simply as "Arena," migrating its main domain to arena.ai and redirecting traffic from lmarena.ai.[^18] The rebrand did not change the company's legal identity (Arena Intelligence Inc.) or the underlying community-driven evaluation platform.

## What arenas and products does LMArena run?

While Chatbot Arena began as a single side-by-side text comparison interface, LMArena has expanded the platform into a portfolio of category-specific "arenas," each targeting a different evaluation modality. Across these products, the community has contributed votes on more than 400 models spanning text, vision, web development, search, video, and image generation.[^6] Reported platforms include:

- **Text Chat Arena.** The original blind, side-by-side text comparison interface in which two anonymous models respond to a user's prompt and the user votes on which response is preferred. This remains the source of the headline Arena Score leaderboard.[^7]
- **Vision Arena.** A multimodal version of the Arena interface in which models receive an image and a text prompt; the platform added image support in June 2024 and video support in January 2026.[^1]
- **WebDev Arena.** A real-time coding arena focused on web development tasks. According to Contrary Research's profile of LMArena, WebDev Arena launched in December 2024 and had accumulated more than 80,000 votes by March 2025.[^4]
- **Search Arena.** A retrieval-augmented evaluation environment that tests models with web-search capabilities. Contrary Research reports that Search Arena launched in March 2025 with more than 7,000 votes across 11 models.[^4]
- **Copilot Arena.** A code-completion benchmark distributed as a Visual Studio Code extension; Contrary Research reports more than 2,500 downloads and more than 100,000 completions during the platform's growth phase.[^4]
- **RepoChat Arena.** A code-base-grounded chat arena launched in November 2024 with more than 12,000 conversations and more than 4,800 user votes by early 2025, according to Contrary Research.[^4]

The Series A announcement also described a separate, paid commercial evaluation product aimed at enterprises and AI developers, including [OpenAI](/wiki/openai), [Google](/wiki/google_deepmind), and xAI, with the company saying its evaluation work spans software engineering, law, medicine, and scientific research applications.[^5]

## How does LMArena rank models?

The Chatbot Arena methodology paper describes the platform as relying on a pairwise comparison approach in which crowdsourced users compare two anonymous model responses and vote on which is better. To convert these pairwise votes into a leaderboard, the paper describes statistical ranking techniques, with the system originally calculating ratings using an online Elo update and later adopting offline Bradley-Terry maximum-likelihood estimation, which produces more stable ratings; the paper also discusses bootstrap confidence intervals and active sampling strategies designed to accelerate ranking convergence.[^7][^19] As of the original paper, more than 240,000 votes had been collected; by 2025 the platform reported having gathered millions of votes across hundreds of models.[^4][^7]

LMArena has continued to publish methodology updates after spinning out, including a "Prompt-to-Leaderboard" prediction method and category-specific rankings that condition on prompt types.[^4] The detailed mechanics of the leaderboard, including the use of the Bradley-Terry model and bootstrap confidence intervals, are discussed at greater length in the article on the [Chatbot Arena](/wiki/lmsys_chatbot_arena) benchmark; this article focuses on LMArena as an organization rather than restating those details.

## What controversies has LMArena faced?

### Llama 4 Maverick benchmark manipulation (April 2025)

LMArena became the focus of a major controversy in April 2025 when [Meta](/wiki/meta) released its [Llama 4](/wiki/llama_4) family. Meta announced that a version of [Llama 4 Maverick](/wiki/llama_4_scout_maverick) had reached the number-two spot on the Arena leaderboard with an Arena score of 1417.[^20] Observers quickly noticed, however, that the variant Meta had submitted to LMArena, identified on the platform as "Llama-4-Maverick-03-26-Experimental," was different from the publicly available release. The experimental version produced verbose, emoji-laden responses, while the public release was substantially more concise.[^20][^21]

LMArena responded by publishing a statement that "Meta should have made it clearer that Llama-4-Maverick-03-26-Experimental was a customized model to optimize for human preference" and said it would update its leaderboard policies.[^21] The company released more than 2,000 head-to-head battle results from the experimental version to allow public scrutiny.[^21] When the unmodified public release of Llama 4 Maverick was added to the leaderboard, reports indicated it ranked far lower, around the thirty-second position, well below older models from competing providers.[^22]

Almost a year later, in interviews around his departure from [Meta](/wiki/meta) reported in early 2026, [Yann LeCun](/wiki/yann_lecun), Meta's outgoing chief AI scientist, told the Financial Times that the team had "fudged a little bit" by using different model variants for different benchmarks. LeCun said the practice contributed to internal turmoil at Meta after the Llama 4 launch.[^23]

### "The Leaderboard Illusion" critique

A second wave of scrutiny arrived later in April 2025 with the publication of "The Leaderboard Illusion," a research paper first posted to arXiv on April 29, 2025 (with a revised version on May 12, 2025) by a group of researchers from [Cohere](/wiki/cohere) Labs, Princeton, Stanford, MIT, the University of Waterloo, and other institutions.[^24] The authors, listed as Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker, analyzed roughly two million Arena battles across 243 models and 42 providers.[^24]

The paper made several claims about the Arena's evaluation pipeline:

- That a small set of proprietary providers were able to test many private variants and selectively disclose results, with Meta in particular reportedly testing 27 private LLM variants in the lead-up to the [Llama 4](/wiki/llama_4) release.[^24]
- That proprietary models from [OpenAI](/wiki/openai) and [Google](/wiki/google_deepmind) received approximately 19 to 20 percent of all Arena data each, while a combined group of 83 open-weight models received only about 29.7 percent.[^24]
- That hundreds of models had been "silently deprecated" from the leaderboard, with open-source models removed disproportionately.[^24]
- That, in simulation, the authors estimated that limited additional Arena data could lead to relative performance gains of up to 112 percent on Arena-like distributions, raising concerns about overfitting.[^24]

Sara Hooker, head of Cohere Labs, characterized the situation as a "crisis" in AI evaluation.[^25]

LMArena responded with a detailed blog post pushing back on several of the paper's claims.[^26] The company said that official Arena statistics published on April 27, 2025 showed open models accounted for 40.9 percent of the leaderboard, not the much lower share cited in the paper, arguing that the authors' calculation had excluded open-weight families such as Llama and Gemma.[^26] LMArena described one of the paper's plots as a "simulation using Gaussians with mean 1200 and an arbitrarily chosen variance," and said that, by its own estimates, pre-release testing yielded an increase of roughly "+11 Elo after 50 tests and 3000 votes," far less than the figure highlighted in the paper.[^26] The company also said its testing policies had been "publicly available for over a year, published on March 1, 2024," and disputed the characterization that there was an unstated policy.[^26]

At the same time, LMArena's response did promise several changes: it said it would explicitly state that providers can test multiple variants before release, increase clarity around model retirement and the marking of retired models, and mark new model scores as "provisional" until 2,000 fresh post-release votes had accumulated when 10 or more models had been pre-release tested simultaneously.[^26]

Independent commentators, including blogger and developer Simon Willison, framed the dispute as fundamentally one about transparency: the question was less whether LMArena had violated a documented rule than whether vendors should be required to disclose how many variants they had tested before publishing a leaderboard score.[^27]

### Vote rigging and sampling bias research

The "Leaderboard Illusion" paper was not the only academic critique. Earlier research, including a paper titled "Improving Your Model Ranking on Chatbot Arena by Vote Rigging," demonstrated that an attacker controlling only a few hundred votes could meaningfully shift Arena rankings under certain assumptions, raising concerns about the platform's resilience to coordinated voting.[^28] These concerns predate the spinout into LMArena and have been part of the broader academic conversation about preference-based benchmarks.

## Why does LMArena matter for the AI industry?

By the time of its Series A round, LMArena had become one of the most widely cited references for ranking general-purpose [large language models](/wiki/llm), with leading AI providers including [OpenAI](/wiki/openai), [Google](/wiki/google_deepmind), [Anthropic](/wiki/anthropic), xAI, and [Meta](/wiki/meta) either submitting models to the platform or being tracked on its public leaderboards.[^2][^5] Marketing claims about "Arena Score" or about ranking on the leaderboard have appeared in release materials from major model providers, including pre-release entries such as [DeepSeek's](/wiki/deepseek) [R1](/wiki/deepseek_r1), variants of [GPT-5](/wiki/gpt-5) (which appeared on the platform under code names), and pre-release variants of Google's Gemini 2.5 family that surfaced under names such as "Nano Banana."[^1]

LMArena's defenders argue that, while imperfect, large-scale human-preference voting captures dimensions of model quality that static benchmarks miss, particularly around fluency, helpfulness, and alignment with user expectations.[^17] Critics, including some authors of the "Leaderboard Illusion" paper and commentators in the open-source community, argue that the platform's outsized influence has created a Goodhart-style dynamic, in which providers may optimize for Arena preferences rather than for general capability, and that structural disparities in sampling and access risk encoding a bias toward well-resourced proprietary labs.[^24][^25] The combination of LMArena's commercial growth, its expanding suite of specialized arenas, and the active critique of its methodology has made the company an important reference point in the broader debate over how to evaluate frontier AI systems.

## References

[^1]: "LMArena," Wikipedia. https://en.wikipedia.org/wiki/LMArena Accessed 2026-05-19.

[^2]: Maxwell Zeff, "LM Arena, the organization behind popular AI leaderboards, lands $100M," TechCrunch, May 21, 2025. https://techcrunch.com/2025/05/21/lm-arena-the-organization-behind-popular-ai-leaderboards-lands-100m/ Accessed 2026-05-19.

[^3]: "Chatbot Arena Group Goes From Academic Project to $600 Million Startup," Bloomberg News, May 21, 2025. https://www.bloomberg.com/news/articles/2025-05-21/lmarena-goes-from-academic-project-to-600-million-startup Accessed 2026-05-19.

[^4]: "Report: LMArena Business Breakdown & Founding Story," Contrary Research. https://research.contrary.com/company/lmarena Accessed 2026-05-19.

[^5]: "LMArena Raises $150 Million to Build the World's Most Trusted AI Evaluation Platform," PR Newswire, January 6, 2026. https://www.prnewswire.com/news-releases/lmarena-raises-150-million-to-build-the-worlds-most-trusted-ai-evaluation-platform-302653012.html Accessed 2026-05-19.

[^6]: "Fueling the World's Most Trusted AI Evaluation Platform," Arena (LMArena) blog, January 6, 2026. https://arena.ai/blog/series-a/ Accessed 2026-05-19.

[^7]: Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica, "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," arXiv:2403.04132, March 7, 2024. https://arxiv.org/abs/2403.04132 Accessed 2026-05-19.

[^8]: "Chatbot Arena," UC Berkeley Sky Computing Lab project page. https://sky.cs.berkeley.edu/project/chatbot-arena/ Accessed 2026-05-19.

[^9]: "About," LMSYS Org. https://www.lmsys.org/about/ Accessed 2026-05-19.

[^10]: "Arena (AI platform)," Grokipedia. https://grokipedia.com/page/LMSYS_Chatbot_Arena Accessed 2026-05-19.

[^11]: "The Sequence Chat: Lianmin Zheng, UC Berkeley About Vicuna, Chatbot Arena and the Open Source LLM Revolution," The Sequence. https://thesequence.substack.com/p/the-sequence-chat-lianmin-zheng-uc Accessed 2026-05-19.

[^12]: "About me," Wei-Lin Chiang personal website. https://infwinston.github.io/ Accessed 2026-05-19.

[^13]: Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica, "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," Proceedings of the 41st International Conference on Machine Learning, PMLR 235, 2024. https://proceedings.mlr.press/v235/chiang24b.html Accessed 2026-05-19.

[^14]: Chris McKay, "Chatbot Arena Rebrands as LMArena, Becomes a Company," Maginative, April 17, 2025. https://www.maginative.com/article/chatbot-arena-rebrands-as-lmarena-becomes-a-company/ Accessed 2026-05-19.

[^15]: "How two Berkeley roommates built a $1.7B startup that helps you decide which AI to use," Founded. https://www.founded.com/lmarena-arena-ai-ranking-tool-startup-founders/ Accessed 2026-05-19.

[^16]: "AI Evaluation Platform LMArena Raises Series A At Valuation Of $1.7 Billion," OfficeChai. https://officechai.com/ai/ai-evaluation-platform-lmarena-raises-series-a-at-valuation-of-1-7-billion/ Accessed 2026-05-19.

[^17]: "Investing in LMArena: The Reliability Layer for AI," Andreessen Horowitz. https://a16z.com/announcement/investing-in-lmarena-the-reliability-layer-for-ai/ Accessed 2026-05-19.

[^18]: "LMArena is now Arena," Arena blog. https://arena.ai/blog/lmarena-is-now-arena/ Accessed 2026-05-19.

[^19]: "Chatbot Arena," Proceedings of the 41st International Conference on Machine Learning, ACM Digital Library. https://dl.acm.org/doi/abs/10.5555/3692070.3692401 Accessed 2026-05-19.

[^20]: Tobias Mann, "Meta accused of Llama 4 bait-n-switch to juice LMArena rank," The Register, April 8, 2025. https://www.theregister.com/2025/04/08/meta_llama4_cheating/ Accessed 2026-05-19.

[^21]: "Llama 4 Scandal: Meta's release of Llama 4 overshadowed by cheating allegations on AI benchmark," Tech Startups, April 8, 2025. https://techstartups.com/2025/04/08/llama-4-scandal-metas-release-of-llama-4-overshadowed-by-cheating-allegations-on-ai-benchmark/ Accessed 2026-05-19.

[^22]: "Unmodified Llama 4 Maverick ranks below rivals following Meta cheating allegations," Neowin. https://www.neowin.net/news/unmodified-llama-4-maverick-ranks-below-rivals-following-meta-cheating-allegations/ Accessed 2026-05-19.

[^23]: "'Results Were Fudged': Departing Meta AI Chief Confirms Llama 4 Benchmark Manipulation," Slashdot. https://tech.slashdot.org/story/26/01/02/1449227/results-were-fudged-departing-meta-ai-chief-confirms-llama-4-benchmark-manipulation Accessed 2026-05-19.

[^24]: Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker, "The Leaderboard Illusion," arXiv:2504.20879, April 29, 2025 (revised May 12, 2025). https://arxiv.org/abs/2504.20879 Accessed 2026-05-19.

[^25]: "Cohere Labs head calls 'unreliable' AI leaderboard rankings a 'crisis' in the field," BetaKit. https://betakit.com/cohere-labs-head-calls-unreliable-ai-leaderboard-rankings-a-crisis-in-the-field/ Accessed 2026-05-19.

[^26]: "LMArena Response to 'The Leaderboard Illusion' Writeup," Arena (LMArena) blog. https://arena.ai/blog/our-response/ Accessed 2026-05-19.

[^27]: Simon Willison, "Understanding the recent criticism of the Chatbot Arena," April 30, 2025. https://simonwillison.net/2025/Apr/30/criticism-of-the-chatbot-arena/ Accessed 2026-05-19.

[^28]: "Improving Your Model Ranking on Chatbot Arena by Vote Rigging," arXiv:2501.17858. https://arxiv.org/html/2501.17858v1 Accessed 2026-05-19.

[^29]: "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings," LMSYS Org blog, May 3, 2023. https://lmsys.org/blog/2023-05-03-arena/ Accessed 2026-06-28.