Chatbot Arena (now branded simply as Arena, and previously known as LMArena) is a crowdsourced evaluation platform for large language models that ranks AI systems based on human preferences through anonymous pairwise comparisons. Created by researchers at UC Berkeley in May 2023, the platform has grown from a small academic project into one of the most widely cited and trusted sources for comparing the capabilities of frontier AI models. By early 2026, Arena had collected over 6 million user votes across hundreds of models and rebranded as an independent company valued at $1.7 billion.
Chatbot Arena emerged from UC Berkeley's Sky Computing Lab in the spring of 2023. The project was led by a group of graduate students and faculty, including Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Anastasios Nikolas Angelopoulos, Hao Zhang, and professors Joseph E. Gonzalez and Ion Stoica. The team collectively operated under the name LMSYS Org (Large Model Systems Organization), which also maintained related open-source projects like Vicuna and FastChat.
The original motivation was straightforward: existing benchmarks for language models were too narrow. Benchmarks like MMLU or HellaSwag tested specific capabilities under controlled conditions, but they did not capture how well a model actually performed in open-ended conversation with real users. Zheng and his collaborators wanted a system that would let ordinary people compare models head-to-head on whatever questions they cared about, with the results aggregated into a global ranking.
The first version of Chatbot Arena went live in May 2023. Within months, the platform had deployed over 45 models and collected more than 130,000 valid votes. By early 2024, the research paper describing the platform (Chiang, Zheng et al., 2024) had been published at ICML 2024, and the leaderboard was being cited in press releases by OpenAI, Google, Anthropic, and other major AI labs as evidence of their models' performance.
The core mechanic of Chatbot Arena is a blind pairwise comparison. When a user visits the platform, they can type any prompt they want. The system sends the prompt to two randomly selected models and displays both responses side by side. The user does not know which models they are evaluating. After reading both responses, the user votes for whichever answer they prefer, or they can declare a tie. Only after voting are the model identities revealed.
This design addresses several problems with traditional benchmarks. Because users write their own prompts, the evaluation covers a vast and continually refreshed distribution of real-world use cases, from creative writing and coding to math problems and factual questions. Because the comparison is blind, there is no brand-name bias influencing votes. And because the voting is binary (A vs. B), users do not need to calibrate an absolute quality scale.
The battle process follows a carefully designed workflow:
With millions of votes from anonymous users, quality control is essential. The Arena employs several strategies:
Spam and bot detection. Automated systems filter out votes from bots, users who vote without reading the responses, and other low-quality interactions.
Prompt diversity tracking. The platform monitors the distribution of prompt topics to ensure broad coverage and flag periods where a particular type of prompt (for example, "which is the better model?") dominates.
Vote deduplication. Repeated votes from the same user on the same model pair within short time windows are deduplicated to prevent individual users from disproportionately influencing rankings.
Demographic awareness. The platform acknowledges that its user base skews toward English-speaking, technically sophisticated users. This is a known limitation rather than a controlled variable; the ratings reflect this population's preferences, which may not generalize to other demographics.
The platform originally used the Elo rating system, borrowed from competitive chess, to convert pairwise vote outcomes into a single numerical score for each model. In the Elo system, when a lower-rated model beats a higher-rated one, the ratings shift more dramatically than when the expected winner prevails.
LMSYS later transitioned to the Bradley-Terry (BT) model, a statistical framework from 1952 that estimates the probability of one player beating another based on their latent "strength" parameters. The BT model works by assigning each model i a strength parameter p_i, and the probability that model i beats model j is calculated as:
P(i beats j) = p_i / (p_i + p_j)
The strength parameters are estimated via maximum likelihood estimation (MLE) across all collected matchups. This approach is mathematically related to Elo (Elo ratings can be viewed as an online approximation of BT coefficients), but fitting BT via MLE across all data produces more stable and statistically principled rankings.
The key difference is that the Elo system weights recent games more heavily (it is an online algorithm that updates after each game), while the BT model treats all games equally and recomputes from scratch. For a platform where model capabilities do not change over time (unlike chess players who improve or decline), the BT model is more appropriate.
The platform computes 95% confidence intervals through bootstrapping: the rating calculation is repeated many times with random subsamples of the vote data, and the spread of results indicates how reliable each model's ranking is. Models with fewer votes have wider confidence intervals.
The Arena also maintains separate category-specific leaderboards, including rankings for:
| Category | What it measures |
|---|---|
| Overall | General conversational quality |
| Coding | Programming and debugging ability |
| Math | Mathematical reasoning and problem-solving |
| Hard Prompts | Performance on difficult, complex queries |
| Instruction Following | Ability to follow precise instructions |
| Multilingual | Performance in non-English languages |
| Creative Writing | Quality of creative and narrative outputs |
| Vision | Multimodal image understanding |
This lets users see which models excel in particular domains rather than relying solely on the overall ranking.
A significant methodological innovation introduced in August 2024 was style control, which attempts to separate what a model says from how it says it. Research by the LMSYS team found that user preferences were significantly influenced by stylistic factors, particularly response length.
The team identified four primary style features and their influence coefficients:
| Style Factor | Influence Coefficient | Impact |
|---|---|---|
| Answer token length | 0.249 | Dominant factor |
| Markdown lists | 0.031 | Second-order |
| Markdown headers | 0.024 | Second-order |
| Markdown bold text | 0.019 | Second-order |
Response length emerged as the overwhelmingly dominant style factor, with all other markdown formatting effects being second-order. Models that generated longer, more formatted responses tended to win more battles, not necessarily because their content was better, but because users preferred the appearance of thoroughness.
The style control approach uses a modified Bradley-Terry regression model that treats style as an independent variable. Each style feature is measured as a normalized difference between the two responses:
style_difference = normalize((feature_A - feature_B) / (feature_A + feature_B))
This proportional normalization accounts for context: a 20-token difference matters more when comparing 20-token versus 40-token responses than when comparing 500-token versus 520-token responses.
When style was controlled for, notable ranking shifts occurred:
Models that rose (substance over style):
Models that fell (style over substance):
These shifts revealed that some models were achieving high Arena scores partly through verbose, well-formatted responses rather than superior reasoning or factual accuracy.
Recognizing that live human evaluation is slow and expensive, the LMSYS team developed Arena-Hard-Auto as a complementary offline benchmark. Arena-Hard-Auto consists of 500 challenging prompts sampled from real Chatbot Arena conversations, specifically chosen from 250 high-quality topic clusters (two prompts per cluster) where the questions were difficult enough to reliably separate strong models from weaker ones.
The evaluation uses an LLM-as-a-judge approach: a powerful model (originally GPT-4-1106-preview) scores each test model's responses against a baseline model. To mitigate position bias, every comparison is run twice with the response order swapped, yielding 1,000 judgments per model. The results are then converted to Bradley-Terry scores.
Arena-Hard-Auto has the highest correlation with live Chatbot Arena rankings among popular open-ended LLM benchmarks, making it a useful proxy when live testing is impractical. The prompts and evaluation code are publicly available on GitHub.
The Arena leaderboard is updated continuously as new votes come in. The following table shows the top models on the overall text leaderboard as of early March 2026.
| Rank | Model | Organization | Arena Score |
|---|---|---|---|
| 1 | claude-opus-4-6 | Anthropic | 1504 |
| 2 | gemini-3.1-pro-preview | 1500 | |
| 3 | claude-opus-4-6-thinking | Anthropic | 1500 |
| 4 | grok-4.20-beta1 | xAI | 1493 |
| 5 | gemini-3-pro | 1492 | |
| 6 | grok-4.1-thinking | xAI | 1482 |
| 7 | gemini-3-flash | 1470 | |
| 8 | claude-opus-4.5-thinking-32k | Anthropic | 1466 |
| 9 | GPT-5.2-high | OpenAI | 1465 |
| 10 | GPT-5.1-high | OpenAI | 1464 |
| 11 | grok-4.1 | xAI | 1463 |
| 12 | claude-opus-4.5 | Anthropic | 1462 |
| 13 | ERNIE-5.0 | Baidu | 1461 |
| 14 | gemini-2.5-pro | 1460 | |
| 15 | GLM-4.7 | Zhipu AI | 1445 |
Ratings shift as new votes accumulate. Models marked as "Preliminary" have fewer votes and wider confidence intervals.
In April 2025, a research paper titled "The Leaderboard Illusion" (authored by researchers from Cohere, Stanford, MIT, and the Allen Institute for AI) accused LMArena of enabling systematic manipulation of the leaderboard by major AI companies. The paper's central claim was that LMArena had allowed companies like Meta, OpenAI, Google, and Amazon to privately test multiple model variants on the platform, then publicly reveal only the scores of the best-performing variants while burying the rest.
The most high-profile case involved Meta's release of Llama 4 in early 2025. According to the paper, Meta had privately tested 27 different model variants on Chatbot Arena between January and March 2025. Only one variant, which had been specifically optimized for "conversationality" (a quality that tends to perform well in Arena's preference-based voting), was publicly listed near the top of the leaderboard. Meta never released this optimized model to the public; the vanilla Llama 4 version that users could actually download performed significantly worse.
The specific model submitted was "Llama-4-Maverick-03-26-Experimental," a custom variant of Llama 4 Maverick that had been tuned for human preference. Analysis revealed striking differences between the Arena version and the public release:
| Characteristic | Arena version | Public release |
|---|---|---|
| Response style | Verbose, emoji-heavy | Concise, no emojis |
| Average response length | ~6,978 characters | ~2,982 characters |
| Longer response rate | 98.6% of comparisons | Baseline |
| Leaderboard position | Near top | Significantly lower |
The Arena version's responses were longer than the public version in 98.6% of comparisons, averaging more than twice the character count. Given the documented length bias in Arena voting, this optimization alone would have significantly inflated the model's ranking.
LMSYS acknowledged that "style and model response tone was an important factor" and that this information "was not made sufficiently clear by the Meta team." Meta's VP of Generative AI, Ahmad Al-Dahle, denied foul play, stating that the company did not train on the Arena test sets and that differences were "platform-specific quirks."
This is a textbook example of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. By optimizing specifically for Arena performance rather than general capability, companies could inflate their rankings without delivering corresponding improvements to users.
The Leaderboard Illusion paper documented similar patterns with other companies:
LMArena co-founder Ion Stoica pushed back against the paper, calling its claims "full of inaccuracies" and "questionable." The platform subsequently introduced new transparency measures, including public logs of all model submissions and a policy requiring that any model tested on the platform be made available for public use within a specified time window.
The controversy highlighted a broader tension in AI evaluation: the more influential a benchmark becomes, the greater the incentive to game it.
Chatbot Arena's trajectory from academic side project to venture-backed startup unfolded rapidly.
| Date | Event |
|---|---|
| May 2023 | Chatbot Arena launched at UC Berkeley's Sky Computing Lab |
| September 2024 | Platform moved to its own domain, lmarena.ai |
| April 2025 | LMArena incorporated as an independent company |
| May 2025 | Raised $100 million seed round at $600 million valuation (led by Andreessen Horowitz) |
| January 6, 2026 | Raised $150 million Series A at $1.7 billion valuation |
| January 28, 2026 | Rebranded from LMArena to "Arena" |
The founding team shifted into corporate roles: Anastasios Angelopoulos became CEO, Wei-Lin Chiang became CTO, and Ion Stoica remained involved in an advisory capacity. Investors included Andreessen Horowitz, UC Investments, Lightspeed Venture Partners, Felicis Ventures, and Kleiner Perkins.
The rebranding to "Arena" in January 2026 reflected the platform's expansion beyond text-only language models. By that point, Arena had added support for image generation models, video generation, and multimodal systems, making the "LM" (language model) prefix too restrictive.
The Arena computes confidence intervals by repeatedly resampling the vote data with replacement (bootstrapping). For each bootstrap iteration, a new set of Bradley-Terry coefficients is estimated, producing a distribution of possible ratings for each model. The 2.5th and 97.5th percentiles of this distribution form the 95% confidence interval.
Models with fewer votes have wider confidence intervals. A model with 10,000 votes might have a confidence interval of plus or minus 5 points, while a model with only 500 votes might have an interval of plus or minus 20 points. The platform marks models with wide intervals as "Preliminary" to signal that their rankings are less reliable.
Not all votes contribute equally to the rankings. The platform uses several weighting schemes:
Conversation quality. Votes from longer, more substantive conversations may receive higher weight than votes from single-turn, trivial prompts.
User diversity. The system tracks whether a disproportionate number of votes for a particular model come from a small number of users, which could indicate coordinated manipulation.
Temporal decay. While the BT model treats all votes equally by default, the platform periodically resets or downweights older votes when models are updated, since votes collected against an older version of a model are no longer relevant.
Chatbot Arena's influence on the AI industry is hard to overstate. It became the single most-referenced benchmark in AI company announcements and press coverage, partly because it was the only major evaluation that reflected real user preferences at scale rather than narrow academic tasks.
Several factors contributed to this influence. First, the evaluation is grounded in human judgment rather than automated metrics, which makes it resistant to the kind of benchmark-specific optimization that plagues other evaluations (though, as the Llama 4 scandal showed, not entirely immune). Second, the prompts come from real users rather than a fixed test set, which means the benchmark continuously evolves with user needs. Third, the pairwise comparison design is intuitive: anyone can understand "Model A beat Model B" without needing to interpret abstract accuracy scores.
At the same time, the platform has limitations. The user population skews toward English-speaking, technically sophisticated users, which means the rankings may not reflect performance for other demographics. The preference signal is noisy; users vote based on subjective criteria that vary from person to person. And as the Leaderboard Illusion controversy showed, the system is not immune to strategic manipulation by well-resourced companies.
As of early 2026, Arena operates at arena.ai with support for text, image, and video model evaluation. The platform has collected over 6 million human preference votes and supports hundreds of models from dozens of organizations. Arena-Hard-Auto continues to serve as an offline complement for rapid model comparison.
The company is exploring new evaluation formats, including multi-turn conversation evaluation, tool-use benchmarks, and domain-specific arenas for code, math, and scientific reasoning. With its $1.7 billion valuation and continued adoption by major AI labs, Arena has firmly established itself as the central venue for public AI model comparison.