# Chatbot Arena

> Source: https://aiwiki.ai/wiki/lmsys_chatbot_arena
> Updated: 2026-06-21
> Categories: AI Benchmarks, Large Language Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Chatbot Arena** (now branded simply as **Arena**, and previously known as **LMArena**) is a crowdsourced evaluation platform for [large language models](/wiki/large_language_model) that ranks AI systems based on human preferences through anonymous, blind, side-by-side comparisons.[1] Users type any prompt, two unnamed models answer, the user votes for the better response, and the votes are aggregated into a single numerical rating per model. Created by researchers at UC Berkeley in May 2023, it grew from a small academic project into the most widely cited public benchmark for comparing frontier AI models.[3] By early 2026 the platform had collected over 6 million human preference votes across hundreds of models, drew more than 5 million monthly users across 150 countries generating over 60 million conversations per month, and had spun out as an independent company valued at $1.7 billion.[7][8][11]

The founding paper defines it as "an open platform for evaluating LLMs based on human preference" that "uses a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing."[1]

## Origins and founding

Chatbot Arena emerged from UC Berkeley's Sky Computing Lab in the spring of 2023. The project was led by a group of graduate students and faculty, including Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Anastasios Nikolas Angelopoulos, Hao Zhang, and professors Joseph E. Gonzalez and Ion Stoica.[1] The team collectively operated under the name **LMSYS Org** (Large Model Systems Organization), which also maintained related open-source projects like Vicuna and FastChat.

The original motivation was straightforward: existing benchmarks for language models were too narrow. [Benchmarks](/wiki/benchmarks) like [MMLU](/wiki/mmlu) or [HellaSwag](/wiki/hellaswag) tested specific capabilities under controlled conditions, but they did not capture how well a model actually performed in open-ended conversation with real users.[3] Zheng and his collaborators wanted a system that would let ordinary people compare models head-to-head on whatever questions they cared about, with the results aggregated into a global ranking.[1]

The first version of Chatbot Arena went live in May 2023.[3] Within months, the platform had deployed over 45 models and collected more than 130,000 valid votes.[10] By early 2024, the research paper describing the platform (Chiang, Zheng et al., 2024) had been published at ICML 2024, reporting more than 240,000 votes collected over several months, and the leaderboard was being cited in press releases by OpenAI, Google, Anthropic, and other major AI labs as evidence of their models' performance.[1]

## How does Chatbot Arena work?

The core mechanic of Chatbot Arena is a blind pairwise comparison. When a user visits the platform, they can type any prompt they want. The system sends the prompt to two randomly selected models and displays both responses side by side. The user does not know which models they are evaluating. After reading both responses, the user votes for whichever answer they prefer, or they can declare a tie. Only after voting are the model identities revealed.[3]

This design addresses several problems with traditional benchmarks. Because users write their own prompts, the evaluation covers a vast and continually refreshed distribution of real-world use cases, from creative writing and coding to math problems and factual questions. Because the comparison is blind, there is no brand-name bias influencing votes. And because the voting is binary (A vs. B), users do not need to calibrate an absolute quality scale.[1]

### How a battle works step by step

The battle process follows a carefully designed workflow:

1. **[Prompt](/wiki/prompt) entry.** The user types any prompt into a single input box. There are no preset prompts or restricted topics; users are free to ask anything.
2. **Model sampling.** The system selects two models from the pool. The sampling is not uniform: models with similar ratings and higher uncertainty are paired more frequently to maximize the information gained from each vote.[1] This non-uniform sampling concentrates votes where they are most needed.
3. **Response generation.** Both models receive the identical prompt and generate responses simultaneously. The responses are displayed side by side, labeled only as "Model A" and "Model B."
4. **Position randomization.** The left-right assignment of models is randomized to mitigate position bias (the tendency for users to prefer whichever response appears on a particular side).[1]
5. **Voting.** The user selects one of four options: "A is better," "B is better," "Tie," or "Both are bad." The user can also continue the conversation with multi-turn exchanges before voting.[3]
6. **Reveal.** After voting, the model identities are revealed, and the user can see which models they compared.

### Crowdsourcing quality control

With millions of votes from anonymous users, quality control is essential. The Arena employs several strategies:

**Spam and bot detection.** Automated systems filter out votes from bots, users who vote without reading the responses, and other low-quality interactions.[1]

**Prompt diversity tracking.** The platform monitors the distribution of prompt topics to ensure broad coverage and flag periods where a particular type of prompt (for example, "which is the better model?") dominates.

**Vote deduplication.** Repeated votes from the same user on the same model pair within short time windows are deduplicated to prevent individual users from disproportionately influencing rankings.

**Demographic awareness.** The platform acknowledges that its user base skews toward English-speaking, technically sophisticated users. This is a known limitation rather than a controlled variable; the ratings reflect this population's preferences, which may not generalize to other demographics.[1]

## Elo and Bradley-Terry ratings

The platform originally used the [Elo rating system](/wiki/elo_rating_system), borrowed from competitive chess, to convert pairwise vote outcomes into a single numerical score for each model.[3] In the Elo system, when a lower-rated model beats a higher-rated one, the ratings shift more dramatically than when the expected winner prevails.

### The Bradley-Terry model

LMSYS later transitioned to the **Bradley-Terry (BT) model**, a statistical framework from 1952 that estimates the probability of one player beating another based on their latent "strength" parameters.[1] The BT model works by assigning each model i a strength parameter p_i, and the probability that model i beats model j is calculated as:

P(i beats j) = p_i / (p_i + p_j)

The strength parameters are estimated via maximum likelihood estimation (MLE) across all collected matchups. This approach is mathematically related to Elo (Elo ratings can be viewed as an online approximation of BT coefficients), but fitting BT via MLE across all data produces more stable and statistically principled rankings.[1]

The key difference is that the Elo system weights recent games more heavily (it is an online algorithm that updates after each game), while the BT model treats all games equally and recomputes from scratch. For a platform where model capabilities do not change over time (unlike chess players who improve or decline), the BT model is more appropriate.[1]

The platform computes 95% confidence intervals through bootstrapping: the rating calculation is repeated many times with random subsamples of the vote data, and the spread of results indicates how reliable each model's ranking is. Models with fewer votes have wider confidence intervals.[1]

### Category-specific leaderboards

The Arena also maintains separate category-specific leaderboards, including rankings for:

| Category | What it measures |
|----------|------------------|
| Overall | General conversational quality |
| Coding | Programming and debugging ability |
| Math | Mathematical reasoning and problem-solving |
| Hard Prompts | Performance on difficult, complex queries |
| Instruction Following | Ability to follow precise instructions |
| Multilingual | Performance in non-English languages |
| Creative Writing | Quality of creative and narrative outputs |
| Vision | Multimodal image understanding |

This lets users see which models excel in particular domains rather than relying solely on the overall ranking.

## Style control

A significant methodological innovation introduced in August 2024 was **style control**, which attempts to separate what a model says from how it says it.[5] Research by the LMSYS team found that user preferences were significantly influenced by stylistic factors, particularly response length. As the team put it, "when analyzing the style coefficients, we found that length was the dominant style factor."[5]

### Style factors and their impact

The team identified four primary style features and their influence coefficients:

| Style Factor | Influence Coefficient | Impact |
|--------------|----------------------|--------|
| Answer token length | 0.249 | Dominant factor |
| Markdown lists | 0.031 | Second-order |
| Markdown headers | 0.024 | Second-order |
| Markdown bold text | 0.019 | Second-order |

Response length emerged as the overwhelmingly dominant style factor, with all other markdown formatting effects being second-order.[5] Models that generated longer, more formatted responses tended to win more battles, not necessarily because their content was better, but because users preferred the appearance of thoroughness.[5]

### Methodology

The style control approach uses a modified Bradley-Terry regression model that treats style as an independent variable. Each style feature is measured as a normalized difference between the two responses:

style_difference = normalize((feature_A - feature_B) / (length_A + length_B))

This proportional normalization accounts for context: a 20-token difference matters more when comparing 20-token versus 40-token responses than when comparing 500-token versus 520-token responses.[5]

### Impact on rankings

When style was controlled for on the overall leaderboard, notable ranking shifts occurred:

**Models that rose (substance over style):**
- [Claude](/wiki/claude) 3.5 Sonnet: climbed from rank 6 to rank 4
- Claude 3 Opus: climbed substantially

**Models that fell (style over substance):**
- GPT-4o-mini: dropped from rank 6 to rank 11
- [Grok](/wiki/grok)-2-mini: dropped from rank 6 to rank 18

Llama-3.1-405B held its position at rank 6 with no change.[5] These shifts revealed that some models were achieving high Arena scores partly through verbose, well-formatted responses rather than superior reasoning or factual accuracy.[5]

## Arena-Hard-Auto

Recognizing that live human evaluation is slow and expensive, the LMSYS team developed **Arena-Hard-Auto** as a complementary offline benchmark.[4] Arena-Hard-Auto consists of 500 challenging prompts sampled from real Chatbot Arena conversations, specifically chosen from 250 high-quality topic clusters (two prompts per cluster) where the questions were difficult enough to reliably separate strong models from weaker ones.[4]

The evaluation uses an LLM-as-a-judge approach: a powerful model (originally [GPT-4](/wiki/gpt-4)-1106-preview) scores each test model's responses against a baseline model.[4] To mitigate position bias, every comparison is run twice with the response order swapped, yielding 1,000 judgments per model. The results are then converted to Bradley-Terry scores.[4]

Arena-Hard-Auto has the highest correlation with live Chatbot Arena rankings among popular open-ended LLM benchmarks, making it a useful proxy when live testing is impractical.[4] The prompts and evaluation code are publicly available on GitHub.

## Current leaderboard

The Arena leaderboard is updated continuously as new votes come in. The following table shows the top models on the overall text leaderboard as of early March 2026.[9]

| Rank | Model | Organization | Arena Score |
|------|-------|-------------|------------|
| 1 | claude-opus-4-6 | [Anthropic](/wiki/anthropic) | 1504 |
| 2 | gemini-3.1-pro-preview | [Google](/wiki/google_deepmind) | 1500 |
| 3 | claude-opus-4-6-thinking | Anthropic | 1500 |
| 4 | grok-4.20-beta1 | [xAI](/wiki/xai) | 1493 |
| 5 | gemini-3-pro | Google | 1492 |
| 6 | grok-4.1-thinking | xAI | 1482 |
| 7 | gemini-3-flash | Google | 1470 |
| 8 | claude-opus-4.5-thinking-32k | Anthropic | 1466 |
| 9 | GPT-5.2-high | [OpenAI](/wiki/openai) | 1465 |
| 10 | GPT-5.1-high | OpenAI | 1464 |
| 11 | grok-4.1 | xAI | 1463 |
| 12 | claude-opus-4.5 | Anthropic | 1462 |
| 13 | ERNIE-5.0 | [Baidu](/wiki/baidu_ai) | 1461 |
| 14 | gemini-2.5-pro | Google | 1460 |
| 15 | GLM-4.7 | Zhipu AI | 1445 |

Ratings shift as new votes accumulate. Models marked as "Preliminary" have fewer votes and wider confidence intervals.[9]

## The Leaderboard Illusion controversy

In April 2025, a research paper titled "The Leaderboard Illusion" (authored by Shivalika Singh and colleagues from [Cohere](/wiki/cohere) Labs, the [Allen Institute for AI](/wiki/ai2), Princeton, Stanford, the University of Waterloo, and the University of Washington) accused LMArena of enabling systematic manipulation of the leaderboard by major AI companies.[6] The paper's central claim was that LMArena had allowed companies like [Meta](/wiki/meta), OpenAI, Google, and Amazon to privately test multiple model variants on the platform, then publicly reveal only the scores of the best-performing variants while retracting or burying the rest.[6]

The paper also documented uneven access to Arena data: it estimated that Google and OpenAI received roughly 19.2% and 20.4% of all data on the platform respectively, while a combined 83 open-weight models received only about 29.7% of the total.[6]

### The Llama 4 scandal

The most high-profile case involved Meta's release of [Llama 4](/wiki/llama_4) in early 2025. According to the paper, Meta had privately tested 27 different model variants on Chatbot Arena in the lead-up to the Llama 4 release.[6] Only one variant, which had been specifically optimized for human preference (a quality that tends to perform well in Arena's preference-based voting), was publicly listed near the top of the leaderboard. Meta never released this optimized model to the public; the vanilla Llama 4 version that users could actually download performed significantly worse.[7]

The specific model submitted was "Llama-4-Maverick-03-26-Experimental," a custom variant of Llama 4 Maverick that had been tuned for human preference. It ranked second on the overall leaderboard, just behind Google's Gemini-2.5-Pro experimental release.[7] Analysis revealed striking differences between the Arena version and the public release:

| Characteristic | Arena version | Public release |
|----------------|---------------|----------------|
| Response style | Verbose, emoji-heavy | Concise, no emojis |
| Average response length | ~6,978 characters | ~2,982 characters |
| Longer response rate | 98.6% of comparisons | Baseline |
| Leaderboard position | Near top | Significantly lower |

The Arena version's responses were longer than the public version in 98.6% of comparisons, averaging more than twice the character count.[6] Given the documented length bias in Arena voting, this optimization alone would have significantly inflated the model's ranking.[5] When the unmodified Maverick was later listed, it ranked below models including OpenAI's GPT-4o, [Claude](/wiki/claude) 3.5 Sonnet, and Gemini 1.5 Pro.[7]

LMSYS acknowledged that "style and model response tone was an important factor" and that this information "was not made sufficiently clear by the Meta team."[7] Meta's VP of [Generative AI](/wiki/generative_ai), Ahmad Al-Dahle, denied foul play, stating that the company did not train on the Arena test sets and that differences were "platform-specific quirks."[7]

This is a textbook example of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. By optimizing specifically for Arena performance rather than general capability, companies could inflate their rankings without delivering corresponding improvements to users.

### Broader manipulation findings

The Leaderboard Illusion paper documented similar patterns with other companies:

- Multiple companies submitted "experimental" model variants that differed from their public releases.
- Some companies tested dozens of variants privately before announcing the highest-scoring one.
- The practice of optimizing for Arena-specific metrics (especially response length and formatting) was widespread.[6]

LMArena co-founder Ion Stoica pushed back against the paper, calling its claims "full of inaccuracies" and "questionable."[7] The platform subsequently introduced new transparency measures, including public logs of all model submissions and a policy requiring that any model tested on the platform be made available for public use within a specified time window.

The controversy highlighted a broader tension in AI evaluation: the more influential a benchmark becomes, the greater the incentive to game it.

## Transition from research project to company

Chatbot Arena's trajectory from academic side project to venture-backed startup unfolded rapidly.

| Date | Event |
|------|-------|
| May 2023 | Chatbot Arena launched at UC Berkeley's Sky Computing Lab |
| September 2024 | Platform moved to its own domain, lmarena.ai |
| April 2025 | LMArena incorporated as an independent company |
| May 2025 | Raised $100 million seed round at $600 million valuation (co-led by Andreessen Horowitz and UC Investments) |
| January 6, 2026 | Raised $150 million Series A at $1.7 billion valuation (led by Felicis and UC Investments) [11] |
| January 28, 2026 | Rebranded from LMArena to "Arena" |

The founding team shifted into corporate roles: Anastasios Angelopoulos became CEO, Wei-Lin Chiang became CTO, and Ion Stoica remained involved.[8] Investors across the two rounds included Andreessen Horowitz, UC Investments, Felicis, Lightspeed Venture Partners, Kleiner Perkins, Laude Ventures, and The House Fund.[7][8]

The rebranding to "Arena" in January 2026 reflected the platform's expansion beyond text-only language models. By that point, Arena had added support for image generation models, video generation, and multimodal systems, making the "LM" (language model) prefix too restrictive.[8] The company framed the change in its announcement: "LMArena is now Arena. A name that takes us back to our roots with a powerful mission: to measure and advance the frontier of AI for real-world use."[8]

## Statistical methodology details

### Bootstrapping for confidence intervals

The Arena computes confidence intervals by repeatedly resampling the vote data with replacement (bootstrapping). For each bootstrap iteration, a new set of Bradley-Terry coefficients is estimated, producing a distribution of possible ratings for each model. The 2.5th and 97.5th percentiles of this distribution form the 95% confidence interval.[1]

Models with fewer votes have wider confidence intervals. A model with 10,000 votes might have a confidence interval of plus or minus 5 points, while a model with only 500 votes might have an interval of plus or minus 20 points. The platform marks models with wide intervals as "Preliminary" to signal that their rankings are less reliable.[1]

### Vote weighting and sampling

Not all votes contribute equally to the rankings. The platform uses several weighting schemes:

**Conversation quality.** Votes from longer, more substantive conversations may receive higher weight than votes from single-turn, trivial prompts.

**User diversity.** The system tracks whether a disproportionate number of votes for a particular model come from a small number of users, which could indicate coordinated manipulation.

**Temporal decay.** While the BT model treats all votes equally by default, the platform periodically resets or downweights older votes when models are updated, since votes collected against an older version of a model are no longer relevant.

## Why does Chatbot Arena matter?

Chatbot Arena's influence on the AI industry is hard to overstate. It became the single most-referenced benchmark in AI company announcements and press coverage, partly because it was the only major evaluation that reflected real user preferences at scale rather than narrow academic tasks.

Several factors contributed to this influence. First, the evaluation is grounded in human judgment rather than automated metrics, which makes it resistant to the kind of benchmark-specific optimization that plagues other evaluations (though, as the Llama 4 scandal showed, not entirely immune). Second, the prompts come from real users rather than a fixed test set, which means the benchmark continuously evolves with user needs. Third, the pairwise comparison design is intuitive: anyone can understand "Model A beat Model B" without needing to interpret abstract accuracy scores.[1]

At the same time, the platform has limitations. The user population skews toward English-speaking, technically sophisticated users, which means the rankings may not reflect performance for other demographics.[1] The preference signal is noisy; users vote based on subjective criteria that vary from person to person. And as the Leaderboard Illusion controversy showed, the system is not immune to strategic manipulation by well-resourced companies.[6]

## Current state

As of early 2026, Arena operates at arena.ai with support for text, image, and video model evaluation.[8] The platform has collected over 6 million human preference votes and supports hundreds of models from dozens of organizations, drawing more than 5 million monthly users across 150 countries who generate over 60 million conversations per month.[7][8] By December 2025 the company reported an annualized consumption run rate above $30 million, reached less than four months after launching its commercial evaluation service.[11] Arena-Hard-Auto continues to serve as an offline complement for rapid model comparison.

The company is exploring new evaluation formats, including multi-turn conversation evaluation, tool-use benchmarks, and domain-specific arenas for code, math, and scientific reasoning. With its $1.7 billion valuation and continued adoption by major AI labs, Arena has firmly established itself as the central venue for public AI model comparison.[7][8]

## See also

- [Dynabench](/wiki/dynabench)

## References

1. Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J. E., & Stoica, I. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." *Proceedings of the 41st International Conference on Machine Learning (ICML 2024)*. https://arxiv.org/abs/2403.04132

2. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). "Judging LLM-as-a-Judge with [MT-Bench](/wiki/mt_bench) and Chatbot Arena." *Advances in Neural Information Processing Systems 36 ([NeurIPS](/wiki/neurips) 2023)*. https://arxiv.org/abs/2306.05685

3. LMSYS Org. (2023). "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings." LMSYS Blog. https://lmsys.org/blog/2023-05-03-arena/

4. LMSYS Org. (2024). "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline." LMSYS Blog. https://lmsys.org/blog/2024-04-19-arena-hard/

5. LMSYS Org. (2024). "Does Style Matter? Disentangling Style and Substance in Chatbot Arena." LMSYS Blog. https://lmsys.org/blog/2024-08-28-style-control/

6. Singh, S., Nan, Y., Wang, A., et al. (2025). "The Leaderboard Illusion." https://arxiv.org/abs/2504.20879

7. Matsakis, L. (2025). "Study accuses LM Arena of helping top AI labs game its benchmark." *TechCrunch*. https://techcrunch.com/2025/04/30/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark/

8. Arena. (2026). "LMArena is now Arena." Arena Blog. https://arena.ai/blog/lmarena-is-now-arena/

9. Arena Leaderboard. https://arena.ai/leaderboard

10. LMSYS Org. (2023). "Chatbot Arena: New Models & Elo System Update." LMSYS Blog. https://lmsys.org/blog/2023-12-07-leaderboard/

11. Wiggers, K. (2026). "LMArena lands $1.7B valuation four months after launching its product." *TechCrunch*. https://techcrunch.com/2026/01/06/lmarena-lands-1-7b-valuation-four-months-after-launching-its-product/

