LMArena
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,256 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,256 words
Add missing citations, update stale details, or suggest a clearer explanation.
LMArena is an artificial intelligence evaluation company and crowdsourced benchmarking platform that grew out of an academic research project at the University of California, Berkeley. The platform hosts head-to-head, blind comparisons of large language models in which anonymous users vote on which of two model responses they prefer, producing leaderboards that have become one of the most widely referenced rankings in the AI industry.[^1][^2] The project began in April 2023 as "Chatbot Arena" under the Large Model Systems Organization (LMSYS) at Berkeley's Sky Computing Lab, was formally incorporated as a private company called Arena Intelligence Inc. (operating as LMArena) in April 2025, raised a US$100 million seed round in May 2025 and a US$150 million Series A round in January 2026, and was rebranded simply as "Arena" later that month.[^3][^4][^5][^6]
This article covers LMArena as an organization, including its corporate history, founders, funding, product portfolio, and the controversies that have surrounded its evaluation methodology. For a deeper treatment of the underlying benchmark, see Chatbot Arena.
The roots of LMArena lie in a research effort begun in spring 2023 at Berkeley's Sky Computing Lab (often abbreviated SkyLab), a successor to the RISELab led by Ion Stoica and other faculty. A group of graduate students and faculty including Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica developed an open evaluation platform that asked crowdsourced users to compare anonymized model responses side by side and vote on which was better.[^7][^8] The system was hosted under a wider, multi-university collaboration called LMSYS (the Large Model Systems Organization), which had been incubating open-source AI research projects across Berkeley, Stanford, UC San Diego, Carnegie Mellon, and MBZUAI.[^9]
Chatbot Arena began operations in late April 2023 and was publicly announced on May 3, 2023.[^10] LMSYS at the time was also responsible for the Vicuna open-weight chat model and for FastChat, a multi-model serving framework that initially supported the Arena front end.[^11] Lianmin Zheng and Ying Sheng, both then PhD students at Berkeley advised by Stoica and Joseph E. Gonzalez, are commonly described as co-founders of LMSYS itself; Wei-Lin Chiang and Anastasios Angelopoulos were also PhD students working in the Sky Computing Lab.[^11][^12]
By March 2024, the Arena team had published a methodology paper, "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference," on arXiv. Authored by Chiang, Zheng, Sheng, Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica, the paper described how more than 240,000 votes had been gathered and analyzed using statistical ranking methods.[^7] The paper was subsequently accepted to the International Conference on Machine Learning (ICML) and published in the conference's 2024 proceedings (PMLR volume 235).[^13]
LMSYS itself remained active after Chatbot Arena's growth. The organization formally incorporated as a non-profit in September 2024 with the stated aim of incubating early-stage open-source and research projects; its other flagship projects include SGLang, FastChat, and Vicuna. By 2025 LMSYS's own materials described Chatbot Arena as a "graduated" project, indicating that it had moved on from the umbrella organization to operate independently.[^9]
On April 17, 2025, the team announced that Chatbot Arena would be reorganized as a company called Arena Intelligence Inc., operating under the brand name LMArena.[^14] The corporate structure marked the platform's formal separation from LMSYS and from Berkeley as an academic research project, though the team continued to maintain that the company would retain a research-first orientation.[^4]
Reporting on the spinout described it as a transition from a university-led, donation-funded effort to a venture-backed startup. Earlier financial support had come from a mix of academic resources, donations, and grants, including credits and grants from Google's Kaggle platform, Andreessen Horowitz, and Together AI.[^2] Until the company's formal incorporation, Chatbot Arena operated through Berkeley as a free, public evaluation service.
According to filings, press releases, and company communications, LMArena's principal founders are:
Coverage of the founding team has described Angelopoulos and Chiang as Berkeley roommates who began Chatbot Arena as a side project in 2023 while pursuing their PhDs.[^4][^15] The broader engineering and research team that built and operated Chatbot Arena before incorporation also included Lianmin Zheng, Ying Sheng, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Joseph E. Gonzalez, and Michael Jordan, all co-authors on the methodology paper.[^7]
On May 21, 2025, approximately a month after the spinout was announced, LMArena disclosed that it had raised US$100 million in a seed round that valued the company at US$600 million.[^3][^16] The round was led by Andreessen Horowitz and UC Investments, the investment office that manages the University of California system's endowment and other portfolios. Additional participants in the seed round were Lightspeed Venture Partners, Felicis Ventures, and Kleiner Perkins.[^3]
In its own announcement, Andreessen Horowitz framed the investment as a bet on LMArena becoming infrastructure for the AI industry, describing the platform as a "continuous integration pipeline for intelligence" and arguing that "Arena-tested" could become a recognizable seal of quality for AI models.[^17]
On January 6, 2026, LMArena announced a US$150 million Series A round that brought its post-money valuation to approximately US$1.7 billion, almost tripling the seed valuation roughly seven months earlier.[^5][^6] The Series A was led by Felicis and UC Investments, with participation from Andreessen Horowitz, The House Fund, LDVP, Kleiner Perkins, Lightspeed Venture Partners, and Laude Ventures.[^5][^6]
The Series A announcement disclosed several business metrics that had not previously been public. The company reported that its community platform served more than 5 million monthly users across 150 countries and was generating more than 60 million conversations per month. It also disclosed that its commercial AI evaluation product, launched in September 2025, had an annualized consumption run rate of more than US$30 million by December 2025.[^5][^6] In a statement accompanying the round, CEO Anastasios Angelopoulos said, "We cannot deploy AI responsibly without knowing how it delivers value to humans."[^5]
Later in January 2026, the company announced that it would drop the "LM" prefix and operate simply as "Arena," migrating its main domain to arena.ai and redirecting traffic from lmarena.ai.[^18] The rebrand did not change the company's legal identity (Arena Intelligence Inc.) or the underlying community-driven evaluation platform.
While Chatbot Arena began as a single side-by-side text comparison interface, LMArena has expanded the platform into a portfolio of category-specific "arenas," each targeting a different evaluation modality. Across these products, the community has contributed votes on more than 400 models spanning text, vision, web development, search, video, and image generation.[^6] Reported platforms include:
The Series A announcement also described a separate, paid commercial evaluation product aimed at enterprises and AI developers, including OpenAI, Google, and xAI, with the company saying its evaluation work spans software engineering, law, medicine, and scientific research applications.[^5]
The Chatbot Arena methodology paper describes the platform as relying on a pairwise comparison approach in which crowdsourced users compare two anonymous model responses and vote on which is better. To convert these pairwise votes into a leaderboard, the paper describes statistical ranking techniques, with the system originally calculating ratings using an online Elo update and later adopting offline Bradley-Terry maximum-likelihood estimation, which produces more stable ratings; the paper also discusses bootstrap confidence intervals and active sampling strategies designed to accelerate ranking convergence.[^7][^19] As of the original paper, more than 240,000 votes had been collected; by 2025 the platform reported having gathered millions of votes across hundreds of models.[^4][^7]
LMArena has continued to publish methodology updates after spinning out, including a "Prompt-to-Leaderboard" prediction method and category-specific rankings that condition on prompt types.[^4] The detailed mechanics of the leaderboard, including the use of the Bradley-Terry model and bootstrap confidence intervals, are discussed at greater length in the article on the Chatbot Arena benchmark; this article focuses on LMArena as an organization rather than restating those details.
LMArena became the focus of a major controversy in April 2025 when Meta released its Llama 4 family. Meta announced that a version of Llama 4 Maverick had reached the number-two spot on the Arena leaderboard with an Arena score of 1417.[^20] Observers quickly noticed, however, that the variant Meta had submitted to LMArena, identified on the platform as "Llama-4-Maverick-03-26-Experimental," was different from the publicly available release. The experimental version produced verbose, emoji-laden responses, while the public release was substantially more concise.[^20][^21]
LMArena responded by publishing a statement that "Meta should have made it clearer that Llama-4-Maverick-03-26-Experimental was a customized model to optimize for human preference" and said it would update its leaderboard policies.[^21] The company released more than 2,000 head-to-head battle results from the experimental version to allow public scrutiny.[^21] When the unmodified public release of Llama 4 Maverick was added to the leaderboard, reports indicated it ranked far lower, around the thirty-second position, well below older models from competing providers.[^22]
Almost a year later, in interviews around his departure from Meta reported in early 2026, Yann LeCun, Meta's outgoing chief AI scientist, told the Financial Times that the team had "fudged a little bit" by using different model variants for different benchmarks. LeCun said the practice contributed to internal turmoil at Meta after the Llama 4 launch.[^23]
A second wave of scrutiny arrived later in April 2025 with the publication of "The Leaderboard Illusion," a research paper first posted to arXiv on April 29, 2025 (with a revised version on May 12, 2025) by a group of researchers from Cohere Labs, Princeton, Stanford, MIT, the University of Waterloo, and other institutions.[^24] The authors, listed as Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Ustun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker, analyzed roughly two million Arena battles across 243 models and 42 providers.[^24]
The paper made several claims about the Arena's evaluation pipeline:
Sara Hooker, head of Cohere Labs, characterized the situation as a "crisis" in AI evaluation.[^25]
LMArena responded with a detailed blog post pushing back on several of the paper's claims.[^26] The company said that official Arena statistics published on April 27, 2025 showed open models accounted for 40.9 percent of the leaderboard, not the much lower share cited in the paper, arguing that the authors' calculation had excluded open-weight families such as Llama and Gemma.[^26] LMArena described one of the paper's plots as a "simulation using Gaussians with mean 1200 and an arbitrarily chosen variance," and said that, by its own estimates, pre-release testing yielded an increase of roughly "+11 Elo after 50 tests and 3000 votes," far less than the figure highlighted in the paper.[^26] The company also said its testing policies had been "publicly available for over a year, published on March 1, 2024," and disputed the characterization that there was an unstated policy.[^26]
At the same time, LMArena's response did promise several changes: it said it would explicitly state that providers can test multiple variants before release, increase clarity around model retirement and the marking of retired models, and mark new model scores as "provisional" until 2,000 fresh post-release votes had accumulated when 10 or more models had been pre-release tested simultaneously.[^26]
Independent commentators, including blogger and developer Simon Willison, framed the dispute as fundamentally one about transparency: the question was less whether LMArena had violated a documented rule than whether vendors should be required to disclose how many variants they had tested before publishing a leaderboard score.[^27]
The "Leaderboard Illusion" paper was not the only academic critique. Earlier research, including a paper titled "Improving Your Model Ranking on Chatbot Arena by Vote Rigging," demonstrated that an attacker controlling only a few hundred votes could meaningfully shift Arena rankings under certain assumptions, raising concerns about the platform's resilience to coordinated voting.[^28] These concerns predate the spinout into LMArena and have been part of the broader academic conversation about preference-based benchmarks.
By the time of its Series A round, LMArena had become one of the most widely cited references for ranking general-purpose large language models, with leading AI providers including OpenAI, Google, Anthropic, xAI, and Meta either submitting models to the platform or being tracked on its public leaderboards.[^2][^5] Marketing claims about "Arena Score" or about ranking on the leaderboard have appeared in release materials from major model providers, including pre-release entries such as DeepSeek's R1, variants of GPT-5 (which appeared on the platform under code names), and pre-release variants of Google's Gemini 2.5 family that surfaced under names such as "Nano Banana."[^1]
LMArena's defenders argue that, while imperfect, large-scale human-preference voting captures dimensions of model quality that static benchmarks miss, particularly around fluency, helpfulness, and alignment with user expectations.[^17] Critics, including some authors of the "Leaderboard Illusion" paper and commentators in the open-source community, argue that the platform's outsized influence has created a Goodhart-style dynamic, in which providers may optimize for Arena preferences rather than for general capability, and that structural disparities in sampling and access risk encoding a bias toward well-resourced proprietary labs.[^24][^25] The combination of LMArena's commercial growth, its expanding suite of specialized arenas, and the active critique of its methodology has made the company an important reference point in the broader debate over how to evaluate frontier AI systems.