# Arena-Hard

> Source: https://aiwiki.ai/wiki/arena_hard
> Updated: 2026-06-09
> Categories: AI Benchmarks, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Arena-Hard

**Arena-Hard** (and its evaluation tool **Arena-Hard-Auto**) is an automatic large language model (LLM) benchmark developed by the team behind [Chatbot Arena](/wiki/lmsys_chatbot_arena) that scores instruction-tuned models on 500 challenging, real-world prompts mined from crowdsourced user conversations.[^1][^2] Models are compared in pairwise battles against a fixed baseline ([GPT-4](/wiki/gpt-4)-0314), graded by a strong LLM judge, and aggregated with a Bradley-Terry model to produce win-rate scores against the baseline.[^2][^3] Originally released by LMSYS Org researchers at the [University of California, Berkeley](/wiki/uc_berkeley) Sky Computing Lab in April 2024 and now maintained by the spin-off company LMArena, the benchmark is positioned as a cheap, fast proxy for the full Chatbot Arena Elo leaderboard, with the v0.1 release reporting **89.1% agreement with human preference**, **87.4% separability** between models, and a per-model evaluation cost of roughly **US$20-25**.[^1][^2][^4]

The benchmark is distributed through the open-source `arena-hard-auto` repository (formerly under `lm-sys/`, now under `lmarena/` on GitHub) along with a Hugging Face dataset and a viewer space.[^5][^6] As of 2025, the project advanced to **Arena-Hard-v2.0-Preview**, which refreshes the 500-prompt benchmark, adds 250 creative-writing prompts, and replaces the original GPT-4-Turbo judge with [GPT-4.1](/wiki/gpt-4.1) and [Gemini 2.5](/wiki/gemini_2_5_pro) as default automatic judges.[^5][^7]

## Background

### Chatbot Arena and the need for hard benchmarks

[Chatbot Arena](/wiki/lmsys_chatbot_arena) launched on 24 April 2023 as an open, crowdsourced platform on which anonymous users pit two LLMs against one another and vote on which response is better, producing an Elo-style leaderboard from the accumulated pairwise comparisons.[^8] By early 2024 the platform had become the de facto reference for instruction-tuned LLM quality, but its slow turnaround (fresh ranks for a new model require thousands of human votes) made it unsuitable for the rapid release cycles of new models, and motivated work on an automatic surrogate that preserved the agreement with human preferences while costing a fraction of the time and money.[^1][^4]

Two existing automatic benchmarks, [MT-Bench](/wiki/mt_bench) (a multi-turn, 80-question LLM-as-judge benchmark released by the same LMSYS team in 2023[^9]) and [AlpacaEval](/wiki/alpacaeval), already used strong LLM judges for pairwise comparisons against a baseline, but the LMSYS team argued that their prompts were too easy to discriminate frontier models confidently.[^1] In April 2024, the team reported that MT-Bench's agreement with the Chatbot Arena leaderboard fell from 91.3% to **22.6%** once 95% confidence intervals were considered, motivating a new pipeline built on harder prompts and tighter statistical methodology.[^1][^2]

### Hard Prompts in Chatbot Arena

Roughly a month after the Arena-Hard release, on 17 May 2024 LMSYS introduced a parallel **Hard Prompts** category on the live Chatbot Arena leaderboard itself, applying the same seven-criterion rubric used for benchmark construction to over one million Arena prompts and creating a sub-leaderboard restricted to prompts scoring 6 or higher.[^10] Approximately 20% of arena prompts qualified, with **Specificity**, **Domain Knowledge** and **Real-world Application** being the most common criteria and **Problem-Solving** and **Complexity** the rarest.[^10] On the Hard Prompts split, [Claude 3 Opus](/wiki/claude_3_opus), [GPT-4o](/wiki/gpt_4o) and reasoning-focused models (such as Microsoft's Phi-3) gained ground while small instruction-tuned models like Llama-3-8B-Instruct dropped sharply in rank, evidence that the harder-prompts methodology was indeed surfacing capability differences obscured in the general English category.[^10]

## Original Arena-Hard (April 2024)

### Launch and authorship

The Arena-Hard pipeline was announced on **19 April 2024** in a post titled "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline" on the LMSYS Org blog.[^1] The associated work was authored by Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez and [Ion Stoica](/wiki/ion_stoica), with affiliations at the UC Berkeley Sky Computing Lab and LMSYS Org.[^1][^4] A peer-reviewed paper titled "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline" by the same group plus Tianhao Wu was posted to arXiv as 2406.11939 on 17 June 2024, and was later accepted as a poster at ICML 2025.[^3][^11]

The original release positioned Arena-Hard as having three required properties for a useful benchmark: ability to **robustly differentiate** model capabilities, **alignment** with real-world human preferences, and **frequent updates** to avoid test-set leakage.[^4] The team committed to refreshing the benchmark periodically, a commitment realized through the v2.0-Preview release in April 2025.[^5][^7]

### Three benchmark "products"

It is helpful to distinguish three related artifacts that all carry the "Arena-Hard" name:

* **The Arena-Hard pipeline**: an automated data-curation pipeline (later renamed **BenchBuilder**) that mines, clusters and quality-filters prompts from crowdsourced sources such as Chatbot Arena.[^3][^4]
* **Arena-Hard-Auto**: the static 500-prompt benchmark plus evaluation harness produced by running the pipeline on roughly 200 000 Chatbot Arena conversations in early 2024.[^1][^3]
* **The Hard Prompts category** in the live Chatbot Arena: a live filtered slice of the ongoing crowdsourced leaderboard, distinct from the static benchmark.[^10]

In casual usage "Arena-Hard" most often refers to Arena-Hard-Auto and its leaderboard, while the underlying pipeline is published under the BenchBuilder name in the arXiv paper.[^3]

## The Arena-Hard-Auto Pipeline

### Source data

The original pipeline drew on **200 000 user queries** collected via Chatbot Arena between roughly April 2023 and early 2024.[^1] In the arXiv paper, the BenchBuilder pipeline is additionally demonstrated on 150 000 queries sampled from the WildChat-1M dataset of ChatGPT conversations, showing that the same methodology generalizes beyond Arena data.[^3]

### Clustering by topic

To ensure topical diversity in the final benchmark, each prompt is embedded with OpenAI's `text-embedding-3-small`, projected to a lower-dimensional space using UMAP, and clustered via HDBSCAN through the BERTopic library; cluster summaries are then generated by GPT-4-Turbo.[^1][^4] This step yields roughly **4 000 topical clusters** covering a wide spectrum of real-world tasks.[^1][^4]

### Seven-criterion quality scoring

A calibrated LLM judge (using GPT-3.5-Turbo for cheap pre-scoring and GPT-4-Turbo for final scoring) rates every prompt on a 0-7 scale that counts how many of seven challenge-oriented criteria are satisfied:[^1][^3]

* **Specificity**: does the prompt ask something concrete and well-defined?
* **Domain Knowledge**: does answering it require expertise in a specific field?
* **Complexity**: does the task involve multiple steps or considerations?
* **Problem-Solving**: does the prompt require active reasoning rather than recall?
* **Creativity**: does the task require novel generation?
* **Technical Accuracy**: does the answer demand technical precision?
* **Real-world Application**: does the prompt resemble what a real user would ask?

Each cluster is then scored by the mean score of its member prompts. Clusters with mean score below 5 are discarded; the surviving clusters whose mean score is **≥ 6 out of 7** are kept for sampling.[^4]

### Selecting the final 500 prompts

From the high-scoring clusters, the team retained **250 clusters** and randomly sampled **two prompts per cluster** to yield the **500-prompt Arena-Hard-Auto-v0.1** benchmark.[^1][^3] The published dataset on Hugging Face (`lmarena-ai/arena-hard-auto-v0.1`) contains exactly 500 rows, each with a `question_id`, `category`, `cluster` label and a turn list containing the prompt text, released under the Apache 2.0 license.[^6][^12] Sample clusters include "Advanced Math", "Advanced TypeScript", "AI & Alignment", "AWS Cloud Services" and "ABC Sequence Puzzles", illustrating the technical bent of the curated prompts.[^12]

## Methodology of Evaluation

### Pairwise comparison against a baseline

Rather than score model answers independently (which can introduce drift and absolute-scale bias), Arena-Hard-Auto evaluates each candidate by asking a strong LLM judge to compare its answer against a **baseline model**'s answer, prompt by prompt.[^2][^4] The default baseline is [GPT-4](/wiki/gpt-4)-0314 (the March 2023 snapshot of GPT-4), chosen because it is a strong but no-longer-being-updated reference point.[^2][^6]

### Judge model

In the v0.1 release the default judge is **GPT-4-Turbo (`gpt-4-1106-preview`)**, with [Claude-3-Opus](/wiki/claude_3_opus) used as a secondary judge to probe robustness.[^1][^2] Chain-of-thought prompting is used: the judge first generates its own answer to the prompt, then writes a comparison between A and B before giving a verdict.[^2] The judge classifies the comparison into one of five labels:[^2]

* A is significantly better than B (A >> B)
* A is slightly better than B (A > B)
* A is roughly equal to B (A ~= B)
* B is slightly better than A (B > A)
* B is significantly better than A (B >> A)

### Two-game position-swap setup

To control for position bias in the LLM-as-a-judge (the well-documented tendency of GPT-4 to prefer whichever response appears first), every prompt is judged **twice**, once with the candidate as response A and once as response B, yielding **1 000 judgments per evaluated model** on the 500-prompt benchmark.[^2]

### Bradley-Terry aggregation

The set of pairwise comparisons across all 500 prompts (each weighted by the strength of preference encoded in the five-label scale) is fit with a **Bradley-Terry** logistic-regression model, the same statistical model that underpins the Chatbot Arena Elo leaderboard, to produce a latent strength coefficient for each model.[^2][^4] The coefficient is then converted to an interpretable **win-rate against the baseline** (GPT-4-0314), with **95% confidence intervals** derived from 100 rounds of bootstrap resampling.[^2]

### Cost

The team reported a total cost of approximately **US$25 per model** for v0.1 using GPT-4-Turbo as judge,[^1][^4] which the arXiv version of the paper refines to roughly **US$20 per model**.[^3] Both figures are several orders of magnitude cheaper than the human-vote bandwidth required to place a model on the full Chatbot Arena leaderboard.[^4]

## Correlation with Chatbot Arena

LMSYS published three custom benchmark-quality metrics alongside Arena-Hard, intended to capture aspects ignored by raw Spearman correlation:[^1][^4]

* **Separability with Confidence**: the fraction of model pairs whose 95% confidence intervals do not overlap, measuring how confidently the benchmark can rank pairs.[^1][^4]
* **Agreement with Confidence**: the fraction of ranked model pairs whose ordering agrees with Chatbot Arena when both intervals are considered.[^1][^3]
* **Pair Rank Brier Score**: a probabilistic loss for pairwise ranking that penalizes confident-but-wrong predictions.[^5]

Reported results for **Arena-Hard-Auto-v0.1** on the top-20 Chatbot Arena models as of April 2024:[^1][^4]

| Metric | Arena-Hard-Auto v0.1 | MT-Bench |
|---|---|---|
| Separability with confidence | **87.4%** | 22.6% |
| Agreement with Chatbot Arena (95% CI) | **89.1%** | 26.6% |
| Spearman correlation | **94.1%** | 89.9% |
| Cost per model | **~US$25** | ~US$10 |

The arXiv paper later reports refined numbers of **87.4% separability**, **90.9% confidence-aware agreement**, **93.2% Spearman correlation** and **98.6% correlation with full Chatbot Arena rankings**, all at ~US$20 per model; the team emphasized these as providing roughly **3× higher separation** than MT-Bench at a similar cost.[^3]

A persistent finding is that simple Spearman correlation tends to overstate alignment: in the LMSYS analysis, MT-Bench's nominal 91.3% Spearman with Chatbot Arena collapses to 22.6% once 95% CIs are required, while Arena-Hard maintains its agreement under the stricter metric.[^1][^4] This motivated the team's repeated argument that confidence-aware benchmark evaluation is the appropriate standard.[^1][^3]

## Leaderboard Highlights

### v0.1 (April 2024, GPT-4-Turbo judge)

At launch the top results on Arena-Hard-Auto with the default GPT-4-Turbo judge were:[^1]

| Model | Win-rate vs GPT-4-0314 |
|---|---|
| `gpt-4-turbo-2024-04-09` | **82.6%** |
| `gpt-4-0125-preview` | 78.0% |
| `claude-3-opus-20240229` | 60.4% |
| `claude-3-sonnet-20240229` | 46.8% |
| `llama-3-70b-instruct` | 41.1% |

The strong showing of [Llama 3](/wiki/llama_3) 70B Instruct, released on 18 April 2024, one day before the Arena-Hard announcement, was particularly noted because it placed the first widely-available open-weight model above several proprietary mid-tier models.[^1]

### Mid-2024 refresh

By mid-2024 the leaderboard had been extended with newer models. Per the arXiv paper, top performers on the same v0.1 benchmark included [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) at **79.3%** win rate and [GPT-4o](/wiki/gpt_4o) at **79.2%**, edging out earlier GPT-4 Turbo snapshots.[^3]

### Style Control (October 2024)

On **14 October 2024**, the `arena-hard-auto` repository added support for **Style Control**, porting the methodology introduced in the August 2024 Chatbot Arena paper "Does style matter? Disentangling style and substance in Chatbot Arena."[^13][^5] Style Control augments the Bradley-Terry regression with additional independent variables that capture the **normalized difference in response length (tokens)**, **number of markdown headers**, **number of bold elements**, and **number of markdown lists** between the two answers, allowing the rating to attribute portions of preference to style rather than substance.[^13] In the Chatbot Arena data the length coefficient (0.249) dominated all markdown effects (lists 0.031, headers 0.024, bold 0.019), confirming length as the largest stylistic confounder.[^13]

When style control was applied to Arena-Hard-Auto, GPT-4o-mini fell substantially (rank 6 → 11 overall, with Grok-2-mini dropping 6 → 18), while [Claude 3 Opus](/wiki/claude_3_opus) (rank 16 → 10) and [Llama 3.1](/wiki/llama_3_1) 405B rose notably.[^13] The LMSYS team reported that under style control, [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) tied for first place with `chatgpt-4o-latest` on hard prompts.[^13] Importantly, the team showed that adding verbose or markdown-heavy system prompts could inflate raw Arena-Hard scores but **could not** inflate the style-controlled scores, providing a defense against simple style gaming.[^13]

### Arena-Hard-v2.0-Preview (April 2025)

On **23 April 2025** the team released **Arena-Hard-v2.0-Preview**, a refresh designed to address test-set staleness and broaden coverage.[^5][^7] V2.0 contains:[^5][^7]

* **500 fresh, challenging real-world prompts** sourced from new Chatbot Arena traffic, covering open-ended software-engineering problems, math, and other technical domains;
* **250 creative writing prompts** sourced from the same platform, evaluated as a separate split;
* **New default judges**: [GPT-4.1](/wiki/gpt-4.1) (recommended for speed and stability) and [Gemini 2.5](/wiki/gemini_2_5_pro) (recommended for creative writing), with an **ensemble** option that aggregates both;
* **Style Control retained**, controllable via a `--control-features` flag at result time;
* **Judgment caching** for cheaper incremental evaluation when new models are added.

Top results on Arena-Hard-v2.0-Preview using the official configuration (Hard-Prompt split, Style Control on, [Gemini 2.5](/wiki/gemini_2_5_pro) judge) included [OpenAI o3](/wiki/o3) (`o3-2025-04-16`) at **85.9%**, [o4-mini](/wiki/o4_mini) (high-effort variant) at **79.1%**, and Gemini 2.5 at **79.0%**, ahead of o4-mini at 74.6% and Gemini 2.5 Flash at 68.6%.[^5] In later updates of the leaderboard reported by third-party trackers, top scores rose further as new reasoning models such as [DeepSeek-R1](/wiki/deepseek_r1) entered the table.[^14]

## Comparison to Other Benchmarks

### MT-Bench

[MT-Bench](/wiki/mt_bench), the predecessor automatic benchmark from the same LMSYS group, asks GPT-4 to grade individual answers across 80 multi-turn prompts on a 10-point scale and compute mean scores.[^9] Arena-Hard improves on MT-Bench along three axes simultaneously: prompt difficulty (curated hard prompts vs hand-written generic prompts), comparison style (pairwise vs absolute scoring), and statistical methodology (Bradley-Terry with confidence intervals vs raw mean scores).[^1][^3] The headline numbers (**87.4% vs 22.6% separability** and **89.1% vs 26.6% confidence-aware agreement** with Chatbot Arena) quantify the resulting improvement.[^1]

### AlpacaEval

[AlpacaEval](/wiki/alpacaeval) and its length-controlled successor AlpacaEval 2.0 LC are alternative LLM-as-judge benchmarks scored by win-rate against a strong baseline. AlpacaEval 2.0 LC achieved 83.2% separability and roughly 82.5% agreement with Chatbot Arena in the LMSYS analysis, behind Arena-Hard on both dimensions while serving a similar role of cheap, fast proxy for human-preference rankings.[^1][^4]

### MMLU, MMLU-Pro and other multi-task benchmarks

[MMLU](/wiki/mmlu) (Massive Multitask Language Understanding) and its harder variant [MMLU-Pro](/wiki/mmlu-pro) are knowledge-and-reasoning benchmarks built from multiple-choice questions, which measure factual recall and constrained reasoning rather than the open-ended generation quality that Arena-Hard targets.[^15] The two benchmark families are typically reported side by side in model evaluations because they measure complementary capabilities.

## Critiques and Limitations

### Judge bias toward GPT-4-style answers

Because the default v0.1 judge is GPT-4-Turbo (and v2.0 defaults to GPT-4.1 or Gemini 2.5), the benchmark inherits any preferences these judges hold. The LMSYS team reported that both GPT-4-Turbo and Claude-3-Opus are sensitive to **response length**, with Claude exhibiting larger length bias under "more detailed" system prompts.[^4] Single-judge self-preference is documented in the literature on LLM-as-a-judge, where a model tends to slightly favor its own family; for example, Claude judges favor Claude answers.[^3] The arXiv paper proposes **Ensemble-as-Judges**, combining GPT-4-Turbo and Gemini 1.5 Pro to reduce self-bias and improve agreement with humans, an approach later operationalized in the v2.0 ensemble option.[^3][^5]

### Length and style gameability

Before the addition of Style Control, prompting candidate models to produce verbose, markdown-heavy answers reliably inflated their Arena-Hard win rates.[^13] Style Control, introduced in the live Chatbot Arena in August 2024 and ported to Arena-Hard-Auto in October 2024, mitigates this gameability by regressing out length and markdown coefficients before reporting model strength; the LMSYS team demonstrated that style-controlled scores could not be inflated by verbose system prompts.[^13]

### Quality criteria coverage

The authors explicitly acknowledge in the arXiv paper that the **seven defined quality criteria may not fully capture all possible attributes** of a high-quality prompt and that the resulting benchmark skews toward technical and STEM-heavy domains, a bias visible in the v0.1 sample clusters such as "Advanced TypeScript", "Advanced Math" and "AWS Cloud Services".[^3][^12] V0.1 also lacks multi-turn evaluation and is essentially English-only; the v2.0 refresh adds creative writing but does not directly address multilinguality.[^3][^5]

### Test-set leakage

A static benchmark with a fixed 500-prompt set is, in principle, vulnerable to training-data contamination once the questions are public. The team's stated strategy is **periodic refresh**, first realized with the April 2025 v2.0 release, to keep the benchmark out of training data and force models to be re-evaluated on truly novel prompts.[^4][^5][^7]

### The broader leaderboard-illusion debate

In April 2025, the paper "[The Leaderboard Illusion](/wiki/the_leaderboard_illusion)" by researchers from Cohere, AI2 and academic collaborators argued that Chatbot Arena's policy of allowing select proprietary providers to test multiple private model variants in parallel and selectively reveal only the best, distorts the public leaderboard and indirectly the Arena-Hard pipeline that draws prompts from Arena traffic.[^16] The paper specifically claimed that having access to additional Arena data could improve a model's Arena-Hard score by up to 112%, framing the benchmark as susceptible to indirect data-access advantages even before explicit contamination is considered.[^16] Simon Willison's commentary on the criticism noted that 27 private LLM variants tested by Meta in the lead-up to the [Llama](/wiki/llama_3) 4 release illustrate the asymmetric-access concern, and called for transparency about all variants submitted to the platform.[^17] LMArena has since publicly responded with policy updates and additional disclosures about preview submissions.[^17][^18]

## Maintenance and Governance

### From LMSYS to LMArena

Arena-Hard was created by the LMSYS Org, an academic research collective housed at the UC Berkeley Sky Computing Lab and led by Wei-Lin Chiang, Anastasios N. Angelopoulos and [Ion Stoica](/wiki/ion_stoica) (with broader involvement from Joseph E. Gonzalez and other Sky Lab faculty).[^8][^18][^19] Through 2024 the project gradually separated from the academic LMSYS umbrella: Chatbot Arena moved to a dedicated `lmarena.ai` domain in September 2024,[^18][^19] and the Chatbot Arena and Arena-Hard work was incorporated as an independent company, **LMArena**, in early 2025 with seed funding announced in May 2025.[^18][^19]

LMArena raised approximately **US$100 million in seed funding** in May 2025 at a roughly US$600 million valuation, with participation from Andreessen Horowitz, Lightspeed Venture Partners and Kleiner Perkins.[^18][^19] In January 2026 the company closed a **US$150 million Series A** at a post-money valuation of approximately **US$1.7 billion**, and around the same time rebranded from "LMArena" to simply "Arena", operating at `arena.ai`.[^18][^19] The CEO is Anastasios N. Angelopoulos, with Wei-Lin Chiang as CTO and Ion Stoica as co-founder and advisor.[^19]

The `arena-hard-auto` GitHub repository was correspondingly moved from the `lm-sys/` organization to the new **`lmarena/`** organization, where it continues to be the canonical source for the benchmark code, prompts, and pre-generated answer/judgment datasets, with the dataset mirrored on Hugging Face under `lmarena-ai/arena-hard-auto`.[^5][^6]

### Update cadence

Headline updates since launch:

* **19 April 2024**: Arena-Hard-Auto v0.1 released, 500 prompts, GPT-4-Turbo judge.[^1]
* **17 May 2024**: Hard Prompts category introduced on the live Chatbot Arena leaderboard.[^10]
* **17 June 2024**: arXiv preprint 2406.11939 "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline" released.[^3]
* **28 August 2024**: Style Control methodology paper published for Chatbot Arena.[^13]
* **14 October 2024**: Style Control added to `arena-hard-auto`.[^5]
* **23 April 2025**: Arena-Hard-v2.0-Preview released: 500 new prompts plus 250 creative-writing prompts, GPT-4.1 and Gemini 2.5 as judges.[^5][^7]
* **2025**: Paper accepted at ICML 2025 as a poster (`KfTf9vFvSn`).[^11]

## Reception and Usage

Within months of launch, Arena-Hard-Auto win-rates became a standard reporting figure in model release notes from major labs, alongside [MMLU](/wiki/mmlu) and [HumanEval](/wiki/humaneval). Cloud providers such as Amazon Web Services explicitly benchmarked their LLM offerings (e.g. the Nova family) on both MT-Bench and Arena-Hard-Auto.[^20] Third-party leaderboard aggregators including `llm-stats.com` and the Vellum LLM Leaderboard maintain dedicated Arena-Hard and Arena-Hard v2 pages tracking submitted models from [DeepSeek](/wiki/deepseek), [Qwen](/wiki/qwen), NVIDIA's Nemotron family and others.[^14] On the GitHub side, community forks have appeared, for example a Russian-language port (`deepvk/arena-hard-auto-ru`) and a Hugging Face mirror, illustrating uptake outside the original Berkeley group.[^21]

## See also

* [Chatbot Arena](/wiki/lmsys_chatbot_arena): the live human-vote leaderboard from which Arena-Hard prompts are mined.
* [MT-Bench](/wiki/mt_bench): the earlier LLM-as-judge benchmark from the same LMSYS team.
* [AlpacaEval](/wiki/alpacaeval): another pairwise LLM-as-judge benchmark.
* [MMLU](/wiki/mmlu) and [MMLU-Pro](/wiki/mmlu-pro): multiple-choice knowledge benchmarks frequently reported alongside Arena-Hard.
* [LiveCodeBench](/wiki/livecodebench): a contamination-resistant coding benchmark.
* [Ion Stoica](/wiki/ion_stoica), [UC Berkeley](/wiki/uc_berkeley): institutional context for the original work.

## References

[^1]: Li, Tianle; Chiang, Wei-Lin; Frick, Evan; Dunlap, Lisa; Zhu, Banghua; Gonzalez, Joseph E.; Stoica, Ion. "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline." LMSYS Org blog, 19 April 2024. https://www.lmsys.org/blog/2024-04-19-arena-hard/

[^2]: "Arena-Hard-Auto methodology: pairwise comparison, GPT-4-Turbo judge, Bradley-Terry." Emergent Mind: Arena-Hard Benchmarking Standard. https://www.emergentmind.com/topics/arena-hard-benchmark

[^3]: Li, Tianle; Chiang, Wei-Lin; Frick, Evan; Dunlap, Lisa; Wu, Tianhao; Zhu, Banghua; Gonzalez, Joseph E.; Stoica, Ion. "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline." arXiv preprint 2406.11939, 17 June 2024. https://arxiv.org/abs/2406.11939 (HTML at https://arxiv.org/html/2406.11939v2)

[^4]: "Arena-Hard." UC Berkeley Sky Computing Lab project page. https://sky.cs.berkeley.edu/project/arena-hard/

[^5]: `lmarena/arena-hard-auto` repository README. GitHub. https://github.com/lmarena/arena-hard-auto

[^6]: `lmarena-ai/arena-hard-auto` dataset card. Hugging Face. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto

[^7]: Arena-Hard-v2.0 Preview release announcement (`@lmarena_ai`), 23 April 2025. https://x.com/lmarena_ai/status/1915472561644986700

[^8]: "LMArena." Wikipedia. https://en.wikipedia.org/wiki/LMArena

[^9]: Zheng, Lianmin et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv preprint 2306.05685, 2023. https://arxiv.org/pdf/2306.05685

[^10]: "Introducing Hard Prompts Category in Chatbot Arena." LMSYS Org blog, 17 May 2024. https://www.lmsys.org/blog/2024-05-17-category-hard/

[^11]: "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and Benchbuilder Pipeline." ICML 2025 poster page. https://icml.cc/virtual/2025/poster/45630 ; OpenReview record: https://openreview.net/forum?id=KfTf9vFvSn

[^12]: `lmarena-ai/arena-hard-auto-v0.1` dataset card. Hugging Face. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1

[^13]: "Does style matter? Disentangling style and substance in Chatbot Arena." LMSYS Org blog, 28 August 2024. https://www.lmsys.org/blog/2024-08-28-style-control/

[^14]: "Arena Hard v2 Benchmark Leaderboard." llm-stats.com. https://llm-stats.com/benchmarks/arena-hard-v2

[^15]: "MMLU-Pro." Wikipedia/aiwiki, referenced via internal article. (See also Wang et al., "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark," 2024.)

[^16]: Singh, Shivalika et al. "The Leaderboard Illusion." arXiv preprint 2504.20879, April 2025. https://arxiv.org/pdf/2504.20879

[^17]: Willison, Simon. "Understanding the recent criticism of the Chatbot Arena." simonwillison.net, 30 April 2025. https://simonwillison.net/2025/Apr/30/criticism-of-the-chatbot-arena/

[^18]: "LMArena Business Breakdown & Founding Story." Contrary Research. https://research.contrary.com/company/lmarena

[^19]: "About LMArena." LMArena blog. https://blog.lmarena.ai/about/ ; Wei-Lin Chiang personal site. https://infwinston.github.io/

[^20]: "Benchmarking Amazon Nova: a comprehensive analysis through MT-Bench and Arena-Hard-Auto." AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-a-comprehensive-analysis-through-mt-bench-and-arena-hard-auto/

[^21]: `deepvk/arena-hard-auto-ru` (Russian fork). GitHub. https://github.com/deepvk/arena-hard-auto-ru

