Chatbot Arena

AI Benchmarks Large Language Models

18 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v6 · 3,550 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Chatbot Arena (now branded simply as Arena, and previously known as LMArena) is a crowdsourced evaluation platform for large language models that ranks AI systems based on human preferences through anonymous, blind, side-by-side comparisons.^[1] Users type any prompt, two unnamed models answer, the user votes for the better response, and the votes are aggregated into a single numerical rating per model. Created by researchers at UC Berkeley in May 2023, it grew from a small academic project into the most widely cited public benchmark for comparing frontier AI models.^[3] By early 2026 the platform had collected over 6 million human preference votes across hundreds of models, drew more than 5 million monthly users across 150 countries generating over 60 million conversations per month, and had spun out as an independent company valued at $1.7 billion.^[7]^[8]^[11]

The founding paper defines it as "an open platform for evaluating LLMs based on human preference" that "uses a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing."^[1]

Origins and founding

Chatbot Arena emerged from UC Berkeley's Sky Computing Lab in the spring of 2023. The project was led by a group of graduate students and faculty, including Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Anastasios Nikolas Angelopoulos, Hao Zhang, and professors Joseph E. Gonzalez and Ion Stoica.^[1] The team collectively operated under the name LMSYS Org (Large Model Systems Organization), which also maintained related open-source projects like Vicuna and FastChat.

The original motivation was straightforward: existing benchmarks for language models were too narrow. Benchmarks like MMLU or HellaSwag tested specific capabilities under controlled conditions, but they did not capture how well a model actually performed in open-ended conversation with real users.^[3] Zheng and his collaborators wanted a system that would let ordinary people compare models head-to-head on whatever questions they cared about, with the results aggregated into a global ranking.^[1]

The first version of Chatbot Arena went live in May 2023.^[3] Within months, the platform had deployed over 45 models and collected more than 130,000 valid votes.^[10] By early 2024, the research paper describing the platform (Chiang, Zheng et al., 2024) had been published at ICML 2024, reporting more than 240,000 votes collected over several months, and the leaderboard was being cited in press releases by OpenAI, Google, Anthropic, and other major AI labs as evidence of their models' performance.^[1]

How does Chatbot Arena work?

The core mechanic of Chatbot Arena is a blind pairwise comparison. When a user visits the platform, they can type any prompt they want. The system sends the prompt to two randomly selected models and displays both responses side by side. The user does not know which models they are evaluating. After reading both responses, the user votes for whichever answer they prefer, or they can declare a tie. Only after voting are the model identities revealed.^[3]

This design addresses several problems with traditional benchmarks. Because users write their own prompts, the evaluation covers a vast and continually refreshed distribution of real-world use cases, from creative writing and coding to math problems and factual questions. Because the comparison is blind, there is no brand-name bias influencing votes. And because the voting is binary (A vs. B), users do not need to calibrate an absolute quality scale.^[1]

How a battle works step by step

The battle process follows a carefully designed workflow:

Prompt entry. The user types any prompt into a single input box. There are no preset prompts or restricted topics; users are free to ask anything.
Model sampling. The system selects two models from the pool. The sampling is not uniform: models with similar ratings and higher uncertainty are paired more frequently to maximize the information gained from each vote.^[1] This non-uniform sampling concentrates votes where they are most needed.
Response generation. Both models receive the identical prompt and generate responses simultaneously. The responses are displayed side by side, labeled only as "Model A" and "Model B."
Position randomization. The left-right assignment of models is randomized to mitigate position bias (the tendency for users to prefer whichever response appears on a particular side).^[1]
Voting. The user selects one of four options: "A is better," "B is better," "Tie," or "Both are bad." The user can also continue the conversation with multi-turn exchanges before voting.^[3]
Reveal. After voting, the model identities are revealed, and the user can see which models they compared.

Crowdsourcing quality control

With millions of votes from anonymous users, quality control is essential. The Arena employs several strategies:

Spam and bot detection. Automated systems filter out votes from bots, users who vote without reading the responses, and other low-quality interactions.^[1]

Prompt diversity tracking. The platform monitors the distribution of prompt topics to ensure broad coverage and flag periods where a particular type of prompt (for example, "which is the better model?") dominates.

Vote deduplication. Repeated votes from the same user on the same model pair within short time windows are deduplicated to prevent individual users from disproportionately influencing rankings.

Demographic awareness. The platform acknowledges that its user base skews toward English-speaking, technically sophisticated users. This is a known limitation rather than a controlled variable; the ratings reflect this population's preferences, which may not generalize to other demographics.^[1]

Elo and Bradley-Terry ratings

The platform originally used the Elo rating system, borrowed from competitive chess, to convert pairwise vote outcomes into a single numerical score for each model.^[3] In the Elo system, when a lower-rated model beats a higher-rated one, the ratings shift more dramatically than when the expected winner prevails.

The Bradley-Terry model

LMSYS later transitioned to the Bradley-Terry (BT) model, a statistical framework from 1952 that estimates the probability of one player beating another based on their latent "strength" parameters.^[1] The BT model works by assigning each model i a strength parameter p_i, and the probability that model i beats model j is calculated as:

P(i beats j) = p_i / (p_i + p_j)

The strength parameters are estimated via maximum likelihood estimation (MLE) across all collected matchups. This approach is mathematically related to Elo (Elo ratings can be viewed as an online approximation of BT coefficients), but fitting BT via MLE across all data produces more stable and statistically principled rankings.^[1]

The key difference is that the Elo system weights recent games more heavily (it is an online algorithm that updates after each game), while the BT model treats all games equally and recomputes from scratch. For a platform where model capabilities do not change over time (unlike chess players who improve or decline), the BT model is more appropriate.^[1]

The platform computes 95% confidence intervals through bootstrapping: the rating calculation is repeated many times with random subsamples of the vote data, and the spread of results indicates how reliable each model's ranking is. Models with fewer votes have wider confidence intervals.^[1]

Category-specific leaderboards

The Arena also maintains separate category-specific leaderboards, including rankings for:

Category	What it measures
Overall	General conversational quality
Coding	Programming and debugging ability
Math	Mathematical reasoning and problem-solving
Hard Prompts	Performance on difficult, complex queries
Instruction Following	Ability to follow precise instructions
Multilingual	Performance in non-English languages
Creative Writing	Quality of creative and narrative outputs
Vision	Multimodal image understanding

This lets users see which models excel in particular domains rather than relying solely on the overall ranking.

Style control

A significant methodological innovation introduced in August 2024 was style control, which attempts to separate what a model says from how it says it.^[5] Research by the LMSYS team found that user preferences were significantly influenced by stylistic factors, particularly response length. As the team put it, "when analyzing the style coefficients, we found that length was the dominant style factor."^[5]

Style factors and their impact

The team identified four primary style features and their influence coefficients:

Style Factor	Influence Coefficient	Impact
Answer token length	0.249	Dominant factor
Markdown lists	0.031	Second-order
Markdown headers	0.024	Second-order
Markdown bold text	0.019	Second-order

Response length emerged as the overwhelmingly dominant style factor, with all other markdown formatting effects being second-order.^[5] Models that generated longer, more formatted responses tended to win more battles, not necessarily because their content was better, but because users preferred the appearance of thoroughness.^[5]

Methodology

The style control approach uses a modified Bradley-Terry regression model that treats style as an independent variable. Each style feature is measured as a normalized difference between the two responses:

style_difference = normalize((feature_A - feature_B) / (length_A + length_B))

This proportional normalization accounts for context: a 20-token difference matters more when comparing 20-token versus 40-token responses than when comparing 500-token versus 520-token responses.^[5]

Impact on rankings

When style was controlled for on the overall leaderboard, notable ranking shifts occurred:

Models that rose (substance over style):

Claude 3.5 Sonnet: climbed from rank 6 to rank 4
Claude 3 Opus: climbed substantially

Models that fell (style over substance):

GPT-4o-mini: dropped from rank 6 to rank 11
Grok-2-mini: dropped from rank 6 to rank 18

Llama-3.1-405B held its position at rank 6 with no change.^[5] These shifts revealed that some models were achieving high Arena scores partly through verbose, well-formatted responses rather than superior reasoning or factual accuracy.^[5]

Arena-Hard-Auto

Recognizing that live human evaluation is slow and expensive, the LMSYS team developed Arena-Hard-Auto as a complementary offline benchmark.^[4] Arena-Hard-Auto consists of 500 challenging prompts sampled from real Chatbot Arena conversations, specifically chosen from 250 high-quality topic clusters (two prompts per cluster) where the questions were difficult enough to reliably separate strong models from weaker ones.^[4]

The evaluation uses an LLM-as-a-judge approach: a powerful model (originally GPT-4-1106-preview) scores each test model's responses against a baseline model.^[4] To mitigate position bias, every comparison is run twice with the response order swapped, yielding 1,000 judgments per model. The results are then converted to Bradley-Terry scores.^[4]

Arena-Hard-Auto has the highest correlation with live Chatbot Arena rankings among popular open-ended LLM benchmarks, making it a useful proxy when live testing is impractical.^[4] The prompts and evaluation code are publicly available on GitHub.

Current leaderboard

The Arena leaderboard is updated continuously as new votes come in. The following table shows the top models on the overall text leaderboard as of early March 2026.^[9]

Rank	Model	Organization	Arena Score
1	claude-opus-4-6	Anthropic	1504
2	gemini-3.1-pro-preview	Google	1500
3	claude-opus-4-6-thinking	Anthropic	1500
4	grok-4.20-beta1	xAI	1493
5	gemini-3-pro	Google	1492
6	grok-4.1-thinking	xAI	1482
7	gemini-3-flash	Google	1470
8	claude-opus-4.5-thinking-32k	Anthropic	1466
9	GPT-5.2-high	OpenAI	1465
10	GPT-5.1-high	OpenAI	1464
11	grok-4.1	xAI	1463
12	claude-opus-4.5	Anthropic	1462
13	ERNIE-5.0	Baidu	1461
14	gemini-2.5-pro	Google	1460
15	GLM-4.7	Zhipu AI	1445

Ratings shift as new votes accumulate. Models marked as "Preliminary" have fewer votes and wider confidence intervals.^[9]

The Leaderboard Illusion controversy

In April 2025, a research paper titled "The Leaderboard Illusion" (authored by Shivalika Singh and colleagues from Cohere Labs, the Allen Institute for AI, Princeton, Stanford, the University of Waterloo, and the University of Washington) accused LMArena of enabling systematic manipulation of the leaderboard by major AI companies.^[6] The paper's central claim was that LMArena had allowed companies like Meta, OpenAI, Google, and Amazon to privately test multiple model variants on the platform, then publicly reveal only the scores of the best-performing variants while retracting or burying the rest.^[6]

The paper also documented uneven access to Arena data: it estimated that Google and OpenAI received roughly 19.2% and 20.4% of all data on the platform respectively, while a combined 83 open-weight models received only about 29.7% of the total.^[6]

The Llama 4 scandal

The most high-profile case involved Meta's release of Llama 4 in early 2025. According to the paper, Meta had privately tested 27 different model variants on Chatbot Arena in the lead-up to the Llama 4 release.^[6] Only one variant, which had been specifically optimized for human preference (a quality that tends to perform well in Arena's preference-based voting), was publicly listed near the top of the leaderboard. Meta never released this optimized model to the public; the vanilla Llama 4 version that users could actually download performed significantly worse.^[7]

The specific model submitted was "Llama-4-Maverick-03-26-Experimental," a custom variant of Llama 4 Maverick that had been tuned for human preference. It ranked second on the overall leaderboard, just behind Google's Gemini-2.5-Pro experimental release.^[7] Analysis revealed striking differences between the Arena version and the public release:

Characteristic	Arena version	Public release
Response style	Verbose, emoji-heavy	Concise, no emojis
Average response length	~6,978 characters	~2,982 characters
Longer response rate	98.6% of comparisons	Baseline
Leaderboard position	Near top	Significantly lower

The Arena version's responses were longer than the public version in 98.6% of comparisons, averaging more than twice the character count.^[6] Given the documented length bias in Arena voting, this optimization alone would have significantly inflated the model's ranking.^[5] When the unmodified Maverick was later listed, it ranked below models including OpenAI's GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.^[7]

LMSYS acknowledged that "style and model response tone was an important factor" and that this information "was not made sufficiently clear by the Meta team."^[7] Meta's VP of Generative AI, Ahmad Al-Dahle, denied foul play, stating that the company did not train on the Arena test sets and that differences were "platform-specific quirks."^[7]

This is a textbook example of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. By optimizing specifically for Arena performance rather than general capability, companies could inflate their rankings without delivering corresponding improvements to users.

Broader manipulation findings

The Leaderboard Illusion paper documented similar patterns with other companies:

Multiple companies submitted "experimental" model variants that differed from their public releases.
Some companies tested dozens of variants privately before announcing the highest-scoring one.
The practice of optimizing for Arena-specific metrics (especially response length and formatting) was widespread.^[6]

LMArena co-founder Ion Stoica pushed back against the paper, calling its claims "full of inaccuracies" and "questionable."^[7] The platform subsequently introduced new transparency measures, including public logs of all model submissions and a policy requiring that any model tested on the platform be made available for public use within a specified time window.

The controversy highlighted a broader tension in AI evaluation: the more influential a benchmark becomes, the greater the incentive to game it.

Transition from research project to company

Chatbot Arena's trajectory from academic side project to venture-backed startup unfolded rapidly.

Date	Event
May 2023	Chatbot Arena launched at UC Berkeley's Sky Computing Lab
September 2024	Platform moved to its own domain, lmarena.ai
April 2025	LMArena incorporated as an independent company
May 2025	Raised $100 million seed round at $600 million valuation (co-led by Andreessen Horowitz and UC Investments)
January 6, 2026	Raised $150 million Series A at $1.7 billion valuation (led by Felicis and UC Investments) ^[11]
January 28, 2026	Rebranded from LMArena to "Arena"

The founding team shifted into corporate roles: Anastasios Angelopoulos became CEO, Wei-Lin Chiang became CTO, and Ion Stoica remained involved.^[8] Investors across the two rounds included Andreessen Horowitz, UC Investments, Felicis, Lightspeed Venture Partners, Kleiner Perkins, Laude Ventures, and The House Fund.^[7]^[8]

The rebranding to "Arena" in January 2026 reflected the platform's expansion beyond text-only language models. By that point, Arena had added support for image generation models, video generation, and multimodal systems, making the "LM" (language model) prefix too restrictive.^[8] The company framed the change in its announcement: "LMArena is now Arena. A name that takes us back to our roots with a powerful mission: to measure and advance the frontier of AI for real-world use."^[8]

Statistical methodology details

Bootstrapping for confidence intervals

The Arena computes confidence intervals by repeatedly resampling the vote data with replacement (bootstrapping). For each bootstrap iteration, a new set of Bradley-Terry coefficients is estimated, producing a distribution of possible ratings for each model. The 2.5th and 97.5th percentiles of this distribution form the 95% confidence interval.^[1]

Models with fewer votes have wider confidence intervals. A model with 10,000 votes might have a confidence interval of plus or minus 5 points, while a model with only 500 votes might have an interval of plus or minus 20 points. The platform marks models with wide intervals as "Preliminary" to signal that their rankings are less reliable.^[1]

Vote weighting and sampling

Not all votes contribute equally to the rankings. The platform uses several weighting schemes:

Conversation quality. Votes from longer, more substantive conversations may receive higher weight than votes from single-turn, trivial prompts.

User diversity. The system tracks whether a disproportionate number of votes for a particular model come from a small number of users, which could indicate coordinated manipulation.

Temporal decay. While the BT model treats all votes equally by default, the platform periodically resets or downweights older votes when models are updated, since votes collected against an older version of a model are no longer relevant.

Why does Chatbot Arena matter?

Chatbot Arena's influence on the AI industry is hard to overstate. It became the single most-referenced benchmark in AI company announcements and press coverage, partly because it was the only major evaluation that reflected real user preferences at scale rather than narrow academic tasks.

Several factors contributed to this influence. First, the evaluation is grounded in human judgment rather than automated metrics, which makes it resistant to the kind of benchmark-specific optimization that plagues other evaluations (though, as the Llama 4 scandal showed, not entirely immune). Second, the prompts come from real users rather than a fixed test set, which means the benchmark continuously evolves with user needs. Third, the pairwise comparison design is intuitive: anyone can understand "Model A beat Model B" without needing to interpret abstract accuracy scores.^[1]

At the same time, the platform has limitations. The user population skews toward English-speaking, technically sophisticated users, which means the rankings may not reflect performance for other demographics.^[1] The preference signal is noisy; users vote based on subjective criteria that vary from person to person. And as the Leaderboard Illusion controversy showed, the system is not immune to strategic manipulation by well-resourced companies.^[6]

Current state

As of early 2026, Arena operates at arena.ai with support for text, image, and video model evaluation.^[8] The platform has collected over 6 million human preference votes and supports hundreds of models from dozens of organizations, drawing more than 5 million monthly users across 150 countries who generate over 60 million conversations per month.^[7]^[8] By December 2025 the company reported an annualized consumption run rate above $30 million, reached less than four months after launching its commercial evaluation service.^[11] Arena-Hard-Auto continues to serve as an offline complement for rapid model comparison.

The company is exploring new evaluation formats, including multi-turn conversation evaluation, tool-use benchmarks, and domain-specific arenas for code, math, and scientific reasoning. With its $1.7 billion valuation and continued adoption by major AI labs, Arena has firmly established itself as the central venue for public AI model comparison.^[7]^[8]

References

Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., & Stoica, I. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." *Proceedings of the 41st International Conference on Machine Learning (ICML 2024)*. https://arxiv.org/abs/2403.04132 ↩
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." *Advances in Neural Information Processing Systems 36 (NeurIPS 2023)*. https://arxiv.org/abs/2306.05685
LMSYS Org. (2023). "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings." LMSYS Blog. https://lmsys.org/blog/2023-05-03-arena/ ↩
LMSYS Org. (2024). "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline." LMSYS Blog. https://lmsys.org/blog/2024-04-19-arena-hard/ ↩
LMSYS Org. (2024). "Does Style Matter? Disentangling Style and Substance in Chatbot Arena." LMSYS Blog. https://lmsys.org/blog/2024-08-28-style-control/ ↩
Singh, S., Nan, Y., Wang, A., et al. (2025). "The Leaderboard Illusion." https://arxiv.org/abs/2504.20879 ↩
Matsakis, L. (2025). "Study accuses LM Arena of helping top AI labs game its benchmark." *TechCrunch*. https://techcrunch.com/2025/04/30/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark/ ↩
Arena. (2026). "LMArena is now Arena." Arena Blog. https://arena.ai/blog/lmarena-is-now-arena/ ↩
Arena Leaderboard. https://arena.ai/leaderboard ↩
LMSYS Org. (2023). "Chatbot Arena: New Models & Elo System Update." LMSYS Blog. https://lmsys.org/blog/2023-12-07-leaderboard/ ↩
Wiggers, K. (2026). "LMArena lands $1.7B valuation four months after launching its product." *TechCrunch*. https://techcrunch.com/2026/01/06/lmarena-lands-1-7b-valuation-four-months-after-launching-its-product/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Chatbot Arena

Origins and founding

How does Chatbot Arena work?

How a battle works step by step

Crowdsourcing quality control

Elo and Bradley-Terry ratings

The Bradley-Terry model

Category-specific leaderboards

Style control

Style factors and their impact

Methodology

Impact on rankings

Arena-Hard-Auto

Current leaderboard

The Leaderboard Illusion controversy

The Llama 4 scandal

Broader manipulation findings

Transition from research project to company

Statistical methodology details

Bootstrapping for confidence intervals

Vote weighting and sampling

Why does Chatbot Arena matter?

Current state

See also

References

Improve this article

What links here (24 of 76)

What links here (24 of 76)

Origins and founding

How does Chatbot Arena work?

How a battle works step by step

Crowdsourcing quality control

Elo and Bradley-Terry ratings

The Bradley-Terry model

Category-specific leaderboards

Style control

Style factors and their impact

Methodology

Impact on rankings

Arena-Hard-Auto

Current leaderboard

The Leaderboard Illusion controversy

The Llama 4 scandal

Broader manipulation findings

Transition from research project to company

Statistical methodology details

Bootstrapping for confidence intervals

Vote weighting and sampling

Why does Chatbot Arena matter?

Current state

See also

References

Improve this article

Related Articles

MMLU-Pro

BIG-Bench

MT-Bench

GSM8K

MBPP

Artificial Analysis

What links here (24 of 76)

Related Articles

MMLU-Pro

BIG-Bench

MT-Bench

GSM8K

MBPP

Artificial Analysis

What links here (24 of 76)