Arena-Hard

AI Benchmarks Model Evaluation

20 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v6 · 3,969 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Arena-Hard (and its evaluation tool Arena-Hard-Auto) is an automatic large language model (LLM) benchmark developed by the team behind Chatbot Arena that scores instruction-tuned models on 500 challenging, real-world prompts mined from crowdsourced user conversations.^[1]^[2] Models are compared in pairwise battles against a fixed baseline (GPT-4-0314), graded by a strong LLM judge, and aggregated with a Bradley-Terry model to produce win-rate scores against the baseline.^[2]^[3] Originally released by LMSYS Org researchers at the University of California, Berkeley Sky Computing Lab in April 2024 and now maintained by the spin-off company LMArena, the benchmark is positioned as a cheap, fast proxy for the full Chatbot Arena Elo leaderboard, with the v0.1 release reporting 89.1% agreement with human preference, 87.4% separability between models, and a per-model evaluation cost of roughly US$20-25.^[1]^[2]^[4]

The project's own one-line description is that "Arena-Hard-Auto is an automatic evaluation tool for instruction-tuned LLMs," and the maintainers state that it "has the highest correlation and separability to LMArena (Chatbot Arena) among popular open-ended LLM benchmarks."^[5] The benchmark is distributed through the open-source arena-hard-auto repository (formerly under lm-sys/, now under lmarena/ on GitHub) along with a Hugging Face dataset and a viewer space.^[5]^[6] As of 2025, the project advanced to Arena-Hard-v2.0-Preview, which refreshes the 500-prompt benchmark, adds 250 creative-writing prompts, and replaces the original GPT-4-Turbo judge with GPT-4.1 and Gemini 2.5 as default automatic judges.^[5]^[7]

What problem does Arena-Hard solve?

Chatbot Arena and the need for hard benchmarks

Chatbot Arena launched on 24 April 2023 as an open, crowdsourced platform on which anonymous users pit two LLMs against one another and vote on which response is better, producing an Elo-style leaderboard from the accumulated pairwise comparisons.^[8] By early 2024 the platform had become the de facto reference for instruction-tuned LLM quality, but its slow turnaround (fresh ranks for a new model require thousands of human votes) made it unsuitable for the rapid release cycles of new models, and motivated work on an automatic surrogate that preserved the agreement with human preferences while costing a fraction of the time and money.^[1]^[4]

Two existing automatic benchmarks, MT-Bench (a multi-turn, 80-question LLM-as-judge benchmark released by the same LMSYS team in 2023^[9]) and AlpacaEval, already used strong LLM judges for pairwise comparisons against a baseline, but the LMSYS team argued that their prompts were too easy to discriminate frontier models confidently.^[1] In April 2024, the team reported that MT-Bench's agreement with the Chatbot Arena leaderboard fell from 91.3% to 22.6% once 95% confidence intervals were considered, motivating a new pipeline built on harder prompts and tighter statistical methodology.^[1]^[2]

Hard Prompts in Chatbot Arena

Roughly a month after the Arena-Hard release, on 17 May 2024 LMSYS introduced a parallel Hard Prompts category on the live Chatbot Arena leaderboard itself, applying the same seven-criterion rubric used for benchmark construction to over one million Arena prompts and creating a sub-leaderboard restricted to prompts scoring 6 or higher.^[10] Approximately 20% of arena prompts qualified, with Specificity, Domain Knowledge and Real-world Application being the most common criteria and Problem-Solving and Complexity the rarest.^[10] On the Hard Prompts split, Claude 3 Opus, GPT-4o and reasoning-focused models (such as Microsoft's Phi-3) gained ground while small instruction-tuned models like Llama-3-8B-Instruct dropped sharply in rank, evidence that the harder-prompts methodology was indeed surfacing capability differences obscured in the general English category.^[10]

When was Arena-Hard released and who made it?

Launch and authorship

The Arena-Hard pipeline was announced on 19 April 2024 in a post titled "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline" on the LMSYS Org blog.^[1] The post describes the project succinctly as "a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals."^[1] The associated work was authored by Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez and Ion Stoica, with affiliations at the UC Berkeley Sky Computing Lab and LMSYS Org.^[1]^[4] A peer-reviewed paper titled "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline" by the same group plus Tianhao Wu was posted to arXiv as 2406.11939 on 17 June 2024, and was later accepted as a poster at ICML 2025.^[3]^[11]

The original release positioned Arena-Hard as having three required properties for a useful benchmark: ability to robustly differentiate model capabilities, alignment with real-world human preferences, and frequent updates to avoid test-set leakage.^[4] The team committed to refreshing the benchmark periodically, a commitment realized through the v2.0-Preview release in April 2025.^[5]^[7]

Three benchmark "products"

It is helpful to distinguish three related artifacts that all carry the "Arena-Hard" name:

The Arena-Hard pipeline: an automated data-curation pipeline (later renamed BenchBuilder) that mines, clusters and quality-filters prompts from crowdsourced sources such as Chatbot Arena.^[3]^[4]
Arena-Hard-Auto: the static 500-prompt benchmark plus evaluation harness produced by running the pipeline on roughly 200 000 Chatbot Arena conversations in early 2024.^[1]^[3]
The Hard Prompts category in the live Chatbot Arena: a live filtered slice of the ongoing crowdsourced leaderboard, distinct from the static benchmark.^[10]

In casual usage "Arena-Hard" most often refers to Arena-Hard-Auto and its leaderboard, while the underlying pipeline is published under the BenchBuilder name in the arXiv paper.^[3]

How are the 500 Arena-Hard prompts built?

Source data

The original pipeline drew on 200 000 user queries collected via Chatbot Arena between roughly April 2023 and early 2024.^[1] In the arXiv paper, the BenchBuilder pipeline is additionally demonstrated on 150 000 queries sampled from the WildChat-1M dataset of ChatGPT conversations, showing that the same methodology generalizes beyond Arena data.^[3]

Clustering by topic

To ensure topical diversity in the final benchmark, each prompt is embedded with OpenAI's text-embedding-3-small, projected to a lower-dimensional space using UMAP, and clustered via HDBSCAN through the BERTopic library; cluster summaries are then generated by GPT-4-Turbo.^[1]^[4] This step yields roughly 4 000 topical clusters covering a wide spectrum of real-world tasks.^[1]^[4]

Seven-criterion quality scoring

A calibrated LLM judge (using GPT-3.5-Turbo for cheap pre-scoring and GPT-4-Turbo for final scoring) rates every prompt on a 0-7 scale that counts how many of seven challenge-oriented criteria are satisfied:^[1]^[3]

Specificity: does the prompt ask something concrete and well-defined?
Domain Knowledge: does answering it require expertise in a specific field?
Complexity: does the task involve multiple steps or considerations?
Problem-Solving: does the prompt require active reasoning rather than recall?
Creativity: does the task require novel generation?
Technical Accuracy: does the answer demand technical precision?
Real-world Application: does the prompt resemble what a real user would ask?

Each cluster is then scored by the mean score of its member prompts. Clusters with mean score below 5 are discarded; the surviving clusters whose mean score is >= 6 out of 7 are kept for sampling.^[4]

Selecting the final 500 prompts

From the high-scoring clusters, the team retained 250 clusters and randomly sampled two prompts per cluster to yield the 500-prompt Arena-Hard-Auto-v0.1 benchmark.^[1]^[3] The published dataset on Hugging Face (lmarena-ai/arena-hard-auto-v0.1) contains exactly 500 rows, each with a question_id, category, cluster label and a turn list containing the prompt text, released under the Apache 2.0 license.^[6]^[12] Sample clusters include "Advanced Math", "Advanced TypeScript", "AI & Alignment", "AWS Cloud Services" and "ABC Sequence Puzzles", illustrating the technical bent of the curated prompts.^[12]

How does Arena-Hard work?

Pairwise comparison against a baseline

Rather than score model answers independently (which can introduce drift and absolute-scale bias), Arena-Hard-Auto evaluates each candidate by asking a strong LLM judge to compare its answer against a baseline model's answer, prompt by prompt.^[2]^[4] The default baseline is GPT-4-0314 (the March 2023 snapshot of GPT-4), chosen because it is a strong but no-longer-being-updated reference point.^[2]^[6]

Judge model

In the v0.1 release the default judge is GPT-4-Turbo (gpt-4-1106-preview), with Claude-3-Opus used as a secondary judge to probe robustness.^[1]^[2] Chain-of-thought prompting is used: the judge first generates its own answer to the prompt, then writes a comparison between A and B before giving a verdict.^[2] The judge classifies the comparison into one of five labels:^[2]

A is significantly better than B (A >> B)
A is slightly better than B (A > B)
A is roughly equal to B (A ~= B)
B is slightly better than A (B > A)
B is significantly better than A (B >> A)

Two-game position-swap setup

To control for position bias in the LLM-as-a-judge (the well-documented tendency of GPT-4 to prefer whichever response appears first), every prompt is judged twice, once with the candidate as response A and once as response B, yielding 1 000 judgments per evaluated model on the 500-prompt benchmark.^[2]

Bradley-Terry aggregation

The set of pairwise comparisons across all 500 prompts (each weighted by the strength of preference encoded in the five-label scale) is fit with a Bradley-Terry logistic-regression model, the same statistical model that underpins the Chatbot Arena Elo leaderboard, to produce a latent strength coefficient for each model.^[2]^[4] The coefficient is then converted to an interpretable win-rate against the baseline (GPT-4-0314), with 95% confidence intervals derived from 100 rounds of bootstrap resampling.^[2]

Cost

The team reported a total cost of approximately US$25 per model for v0.1 using GPT-4-Turbo as judge,^[1]^[4] which the arXiv version of the paper refines to roughly US$20 per model.^[3] Both figures are several orders of magnitude cheaper than the human-vote bandwidth required to place a model on the full Chatbot Arena leaderboard.^[4]

How well does Arena-Hard correlate with human rankings?

LMSYS published three custom benchmark-quality metrics alongside Arena-Hard, intended to capture aspects ignored by raw Spearman correlation:^[1]^[4]

Separability with Confidence: the fraction of model pairs whose 95% confidence intervals do not overlap, measuring how confidently the benchmark can rank pairs.^[1]^[4]
Agreement with Confidence: the fraction of ranked model pairs whose ordering agrees with Chatbot Arena when both intervals are considered.^[1]^[3]
Pair Rank Brier Score: a probabilistic loss for pairwise ranking that penalizes confident-but-wrong predictions.^[5]

Reported results for Arena-Hard-Auto-v0.1 on the top-20 Chatbot Arena models as of April 2024:^[1]^[4]

Metric	Arena-Hard-Auto v0.1	MT-Bench
Separability with confidence	87.4%	22.6%
Agreement with Chatbot Arena (95% CI)	89.1%	26.6%
Spearman correlation	94.1%	89.9%
Cost per model	~US$25	~US$10

The arXiv paper later reports refined numbers of 87.4% separability, 90.9% confidence-aware agreement, 93.2% Spearman correlation and 98.6% correlation with full Chatbot Arena rankings (with style control), all at ~US$20 per model.^[3] The authors summarize the headline result directly: "Arena-Hard-Auto achieves 98.6% correlation with human preference rankings, all at a cost of $20," and report that the pipeline "provides 3x higher separation of model performances compared to MT-Bench."^[3]

A persistent finding is that simple Spearman correlation tends to overstate alignment: in the LMSYS analysis, MT-Bench's nominal 91.3% Spearman with Chatbot Arena collapses to 22.6% once 95% CIs are required, while Arena-Hard maintains its agreement under the stricter metric.^[1]^[4] This motivated the team's repeated argument that confidence-aware benchmark evaluation is the appropriate standard.^[1]^[3]

What do Arena-Hard leaderboards look like?

v0.1 (April 2024, GPT-4-Turbo judge)

At launch the top results on Arena-Hard-Auto with the default GPT-4-Turbo judge were:^[1]

Model	Win-rate vs GPT-4-0314
`gpt-4-turbo-2024-04-09`	82.6%
`gpt-4-0125-preview`	78.0%
`claude-3-opus-20240229`	60.4%
`claude-3-sonnet-20240229`	46.8%
`llama-3-70b-instruct`	41.1%

The strong showing of Llama 3 70B Instruct, released on 18 April 2024, one day before the Arena-Hard announcement, was particularly noted because it placed the first widely-available open-weight model above several proprietary mid-tier models.^[1]

Mid-2024 refresh

By mid-2024 the leaderboard had been extended with newer models. Per the arXiv paper, top performers on the same v0.1 benchmark included Claude 3.5 Sonnet at 79.3% win rate and GPT-4o at 79.2%, edging out earlier GPT-4 Turbo snapshots.^[3]

Style Control (October 2024)

On 14 October 2024, the arena-hard-auto repository added support for Style Control, porting the methodology introduced in the August 2024 Chatbot Arena paper "Does style matter? Disentangling style and substance in Chatbot Arena."^[13]^[5] Style Control augments the Bradley-Terry regression with additional independent variables that capture the normalized difference in response length (tokens), number of markdown headers, number of bold elements, and number of markdown lists between the two answers, allowing the rating to attribute portions of preference to style rather than substance.^[13] In the Chatbot Arena data the length coefficient (0.249) dominated all markdown effects (lists 0.031, headers 0.024, bold 0.019), confirming length as the largest stylistic confounder.^[13]

When style control was applied to Arena-Hard-Auto, GPT-4o-mini fell substantially (rank 6 to 11 overall, with Grok-2-mini dropping 6 to 18), while Claude 3 Opus (rank 16 to 10) and Llama 3.1 405B rose notably.^[13] The LMSYS team reported that under style control, Claude 3.5 Sonnet tied for first place with chatgpt-4o-latest on hard prompts.^[13] Importantly, the team showed that adding verbose or markdown-heavy system prompts could inflate raw Arena-Hard scores but could not inflate the style-controlled scores, providing a defense against simple style gaming.^[13]

Arena-Hard-v2.0-Preview (April 2025)

On 23 April 2025 the team released Arena-Hard-v2.0-Preview, a refresh designed to address test-set staleness and broaden coverage.^[5]^[7] The maintainers framed its purpose plainly: "If you are curious to see how well your model might perform on LMArena before deploying, we recommend trying Arena-Hard-Auto's newest evaluation set."^[5] V2.0 contains:^[5]^[7]

500 fresh, challenging real-world prompts sourced from new Chatbot Arena traffic, covering open-ended software-engineering problems, math, and other technical domains;
250 creative writing prompts sourced from the same platform, evaluated as a separate split;
New default judges: GPT-4.1 (recommended for speed and stability) and Gemini 2.5 (recommended for creative writing), with an ensemble option that aggregates both;
Style Control retained, controllable via a --control-features flag at result time;
Judgment caching for cheaper incremental evaluation when new models are added.

Top results on Arena-Hard-v2.0-Preview using the official configuration (Hard-Prompt split, Style Control on, Gemini 2.5 judge) included OpenAI o3 (o3-2025-04-16) at 85.9%, o4-mini (high-effort variant) at 79.1%, and Gemini 2.5 at 79.0%, ahead of o4-mini at 74.6% and Gemini 2.5 Flash at 68.6%.^[5] In later updates of the leaderboard reported by third-party trackers, top scores rose further as new reasoning models such as DeepSeek-R1 entered the table.^[14]

How does Arena-Hard compare to other benchmarks?

MT-Bench

MT-Bench, the predecessor automatic benchmark from the same LMSYS group, asks GPT-4 to grade individual answers across 80 multi-turn prompts on a 10-point scale and compute mean scores.^[9] Arena-Hard improves on MT-Bench along three axes simultaneously: prompt difficulty (curated hard prompts vs hand-written generic prompts), comparison style (pairwise vs absolute scoring), and statistical methodology (Bradley-Terry with confidence intervals vs raw mean scores).^[1]^[3] The headline numbers (87.4% vs 22.6% separability and 89.1% vs 26.6% confidence-aware agreement with Chatbot Arena) quantify the resulting improvement.^[1]

AlpacaEval

AlpacaEval and its length-controlled successor AlpacaEval 2.0 LC are alternative LLM-as-judge benchmarks scored by win-rate against a strong baseline. AlpacaEval 2.0 LC achieved 83.2% separability and roughly 82.5% agreement with Chatbot Arena in the LMSYS analysis, behind Arena-Hard on both dimensions while serving a similar role of cheap, fast proxy for human-preference rankings.^[1]^[4]

MMLU, MMLU-Pro and other multi-task benchmarks

MMLU (Massive Multitask Language Understanding) and its harder variant MMLU-Pro are knowledge-and-reasoning benchmarks built from multiple-choice questions, which measure factual recall and constrained reasoning rather than the open-ended generation quality that Arena-Hard targets.^[15] The two benchmark families are typically reported side by side in model evaluations because they measure complementary capabilities.

What are the limitations and criticisms of Arena-Hard?

Judge bias toward GPT-4-style answers

Because the default v0.1 judge is GPT-4-Turbo (and v2.0 defaults to GPT-4.1 or Gemini 2.5), the benchmark inherits any preferences these judges hold. The LMSYS team reported that both GPT-4-Turbo and Claude-3-Opus are sensitive to response length, with Claude exhibiting larger length bias under "more detailed" system prompts.^[4] Single-judge self-preference is documented in the literature on LLM-as-a-judge, where a model tends to slightly favor its own family; for example, Claude judges favor Claude answers.^[3] The arXiv paper proposes Ensemble-as-Judges, combining GPT-4-Turbo and Gemini 1.5 Pro to reduce self-bias and improve agreement with humans, an approach later operationalized in the v2.0 ensemble option.^[3]^[5]

Length and style gameability

Before the addition of Style Control, prompting candidate models to produce verbose, markdown-heavy answers reliably inflated their Arena-Hard win rates.^[13] Style Control, introduced in the live Chatbot Arena in August 2024 and ported to Arena-Hard-Auto in October 2024, mitigates this gameability by regressing out length and markdown coefficients before reporting model strength; the LMSYS team demonstrated that style-controlled scores could not be inflated by verbose system prompts.^[13]

Quality criteria coverage

The authors explicitly acknowledge in the arXiv paper that the seven defined quality criteria may not fully capture all possible attributes of a high-quality prompt and that the resulting benchmark skews toward technical and STEM-heavy domains, a bias visible in the v0.1 sample clusters such as "Advanced TypeScript", "Advanced Math" and "AWS Cloud Services".^[3]^[12] V0.1 also lacks multi-turn evaluation and is essentially English-only; the v2.0 refresh adds creative writing but does not directly address multilinguality.^[3]^[5]

Test-set leakage

A static benchmark with a fixed 500-prompt set is, in principle, vulnerable to training-data contamination once the questions are public. The team's stated strategy is periodic refresh, first realized with the April 2025 v2.0 release, to keep the benchmark out of training data and force models to be re-evaluated on truly novel prompts.^[4]^[5]^[7]

The broader leaderboard-illusion debate

In April 2025, the paper "The Leaderboard Illusion" by researchers from Cohere, AI2 and academic collaborators argued that Chatbot Arena's policy of allowing select proprietary providers to test multiple private model variants in parallel and selectively reveal only the best, distorts the public leaderboard and indirectly the Arena-Hard pipeline that draws prompts from Arena traffic.^[16] The paper specifically claimed that having access to additional Arena data could improve a model's Arena-Hard score by up to 112%, framing the benchmark as susceptible to indirect data-access advantages even before explicit contamination is considered.^[16] Simon Willison's commentary on the criticism noted that 27 private LLM variants tested by Meta in the lead-up to the Llama 4 release illustrate the asymmetric-access concern, and called for transparency about all variants submitted to the platform.^[17] LMArena has since publicly responded with policy updates and additional disclosures about preview submissions.^[17]^[18]

Who maintains Arena-Hard?

From LMSYS to LMArena

Arena-Hard was created by the LMSYS Org, an academic research collective housed at the UC Berkeley Sky Computing Lab and led by Wei-Lin Chiang, Anastasios N. Angelopoulos and Ion Stoica (with broader involvement from Joseph E. Gonzalez and other Sky Lab faculty).^[8]^[18]^[19] Through 2024 the project gradually separated from the academic LMSYS umbrella: Chatbot Arena moved to a dedicated lmarena.ai domain in September 2024,^[18]^[19] and the Chatbot Arena and Arena-Hard work was incorporated as an independent company, LMArena, in early 2025 with seed funding announced in May 2025.^[18]^[19]

LMArena raised approximately US$100 million in seed funding in May 2025 at a roughly US$600 million valuation, with participation from Andreessen Horowitz, Lightspeed Venture Partners and Kleiner Perkins.^[18]^[19] In January 2026 the company closed a US$150 million Series A at a post-money valuation of approximately US$1.7 billion, and around the same time rebranded from "LMArena" to simply "Arena", operating at arena.ai.^[18]^[19] The CEO is Anastasios N. Angelopoulos, with Wei-Lin Chiang as CTO and Ion Stoica as co-founder and advisor.^[19]

The arena-hard-auto GitHub repository was correspondingly moved from the lm-sys/ organization to the new lmarena/ organization, where it continues to be the canonical source for the benchmark code, prompts, and pre-generated answer/judgment datasets, with the dataset mirrored on Hugging Face under lmarena-ai/arena-hard-auto.^[5]^[6]

Update cadence

Headline updates since launch:

19 April 2024: Arena-Hard-Auto v0.1 released, 500 prompts, GPT-4-Turbo judge.^[1]
17 May 2024: Hard Prompts category introduced on the live Chatbot Arena leaderboard.^[10]
17 June 2024: arXiv preprint 2406.11939 "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline" released.^[3]
28 August 2024: Style Control methodology paper published for Chatbot Arena.^[13]
14 October 2024: Style Control added to arena-hard-auto.^[5]
23 April 2025: Arena-Hard-v2.0-Preview released: 500 new prompts plus 250 creative-writing prompts, GPT-4.1 and Gemini 2.5 as judges.^[5]^[7]
2025: Paper accepted at ICML 2025 as a poster (KfTf9vFvSn).^[11]

How widely is Arena-Hard used?

Within months of launch, Arena-Hard-Auto win-rates became a standard reporting figure in model release notes from major labs, alongside MMLU and HumanEval. Cloud providers such as Amazon Web Services explicitly benchmarked their LLM offerings (e.g. the Nova family) on both MT-Bench and Arena-Hard-Auto.^[20] Third-party leaderboard aggregators including llm-stats.com and the Vellum LLM Leaderboard maintain dedicated Arena-Hard and Arena-Hard v2 pages tracking submitted models from DeepSeek, Qwen, NVIDIA's Nemotron family and others.^[14] On the GitHub side, community forks have appeared, for example a Russian-language port (deepvk/arena-hard-auto-ru) and a Hugging Face mirror, illustrating uptake outside the original Berkeley group.^[21]

ELI5: Arena-Hard in plain terms

Imagine you want to know which AI chatbot is smartest, but asking thousands of people to vote takes weeks. Arena-Hard is a shortcut: it collects 500 genuinely hard questions that real people asked, has each chatbot answer them, and then asks a very capable "judge" AI to decide whether the new chatbot's answer is better or worse than a fixed reference chatbot. Doing this twice per question (swapping which answer is shown first) and tallying the results gives a "win rate" that, the makers report, lines up with the slow human voting about 89% of the time while costing only about US$20-25 per model.^[1]^[3] In short, it is a fast, cheap stand-in for the big human popularity contest.

References

Li, Tianle; Chiang, Wei-Lin; Frick, Evan; Dunlap, Lisa; Zhu, Banghua; Gonzalez, Joseph E.; Stoica, Ion. "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline." LMSYS Org blog, 19 April 2024. https://www.lmsys.org/blog/2024-04-19-arena-hard/ ↩
"Arena-Hard-Auto methodology: pairwise comparison, GPT-4-Turbo judge, Bradley-Terry." Emergent Mind: Arena-Hard Benchmarking Standard. https://www.emergentmind.com/topics/arena-hard-benchmark ↩
Li, Tianle; Chiang, Wei-Lin; Frick, Evan; Dunlap, Lisa; Wu, Tianhao; Zhu, Banghua; Gonzalez, Joseph E.; Stoica, Ion. "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline." arXiv preprint 2406.11939, 17 June 2024. https://arxiv.org/abs/2406.11939 (HTML at https://arxiv.org/html/2406.11939v2) ↩
"Arena-Hard." UC Berkeley Sky Computing Lab project page. https://sky.cs.berkeley.edu/project/arena-hard/ ↩
`lmarena/arena-hard-auto` repository README. GitHub. https://github.com/lmarena/arena-hard-auto ↩
`lmarena-ai/arena-hard-auto` dataset card. Hugging Face. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto ↩
Arena-Hard-v2.0 Preview release announcement (`@lmarena_ai`), 23 April 2025. https://x.com/lmarena_ai/status/1915472561644986700 ↩
"LMArena." Wikipedia. https://en.wikipedia.org/wiki/LMArena ↩
Zheng, Lianmin et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv preprint 2306.05685, 2023. https://arxiv.org/pdf/2306.05685 ↩
"Introducing Hard Prompts Category in Chatbot Arena." LMSYS Org blog, 17 May 2024. https://www.lmsys.org/blog/2024-05-17-category-hard/ ↩
"From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and Benchbuilder Pipeline." ICML 2025 poster page. https://icml.cc/virtual/2025/poster/45630 ; OpenReview record: https://openreview.net/forum?id=KfTf9vFvSn ↩
`lmarena-ai/arena-hard-auto-v0.1` dataset card. Hugging Face. https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1 ↩
"Does style matter? Disentangling style and substance in Chatbot Arena." LMSYS Org blog, 28 August 2024. https://www.lmsys.org/blog/2024-08-28-style-control/ ↩
"Arena Hard v2 Benchmark Leaderboard." llm-stats.com. https://llm-stats.com/benchmarks/arena-hard-v2 ↩
Wang, Yubo et al. "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv preprint 2406.01574, 2024. https://arxiv.org/abs/2406.01574 ↩
Singh, Shivalika et al. "The Leaderboard Illusion." arXiv preprint 2504.20879, April 2025. https://arxiv.org/pdf/2504.20879 ↩
Willison, Simon. "Understanding the recent criticism of the Chatbot Arena." simonwillison.net, 30 April 2025. https://simonwillison.net/2025/Apr/30/criticism-of-the-chatbot-arena/ ↩
"LMArena Business Breakdown & Founding Story." Contrary Research. https://research.contrary.com/company/lmarena ↩
"About LMArena." LMArena blog. https://blog.lmarena.ai/about/ ; Wei-Lin Chiang personal site. https://infwinston.github.io/ ↩
"Benchmarking Amazon Nova: a comprehensive analysis through MT-Bench and Arena-Hard-Auto." AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/benchmarking-amazon-nova-a-comprehensive-analysis-through-mt-bench-and-arena-hard-auto/ ↩
`deepvk/arena-hard-auto-ru` (Russian fork). GitHub. https://github.com/deepvk/arena-hard-auto-ru ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributor · full history

Suggest edit

What links here

Benchmark (AI)Llama-3.1-Nemotron-70B-Instruct MT-Bench Nemotron 3 Nemotron-4 SimPO WildBench

What problem does Arena-Hard solve?

Chatbot Arena and the need for hard benchmarks

Hard Prompts in Chatbot Arena

When was Arena-Hard released and who made it?

Launch and authorship

Three benchmark "products"

How are the 500 Arena-Hard prompts built?

Source data

Clustering by topic

Seven-criterion quality scoring

Selecting the final 500 prompts

How does Arena-Hard work?

Pairwise comparison against a baseline

Judge model

Two-game position-swap setup

Bradley-Terry aggregation

Cost

How well does Arena-Hard correlate with human rankings?

What do Arena-Hard leaderboards look like?

v0.1 (April 2024, GPT-4-Turbo judge)

Mid-2024 refresh

Style Control (October 2024)

Arena-Hard-v2.0-Preview (April 2025)

How does Arena-Hard compare to other benchmarks?

MT-Bench

AlpacaEval

MMLU, MMLU-Pro and other multi-task benchmarks

What are the limitations and criticisms of Arena-Hard?

Judge bias toward GPT-4-style answers

Length and style gameability

Quality criteria coverage

Test-set leakage

The broader leaderboard-illusion debate

Who maintains Arena-Hard?

From LMSYS to LMArena

Update cadence

How widely is Arena-Hard used?

ELI5: Arena-Hard in plain terms

See also

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here