Arena-Hard
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,691 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,691 words
Add missing citations, update stale details, or suggest a clearer explanation.
Arena-Hard (and its evaluation tool Arena-Hard-Auto) is an automatic large language model (LLM) benchmark developed by the team behind Chatbot Arena that scores instruction-tuned models on 500 challenging, real-world prompts mined from crowdsourced user conversations.[1][2] Models are compared in pairwise battles against a fixed baseline (GPT-4-0314), graded by a strong LLM judge, and aggregated with a Bradley–Terry model to produce win-rate scores against the baseline.[2][3] Originally released by LMSYS Org researchers at the University of California, Berkeley Sky Computing Lab in April 2024 and now maintained by the spin-off company LMArena, the benchmark is positioned as a cheap, fast proxy for the full Chatbot Arena Elo leaderboard, with the v0.1 release reporting 89.1% agreement with human preference, 87.4% separability between models, and a per-model evaluation cost of roughly US$20–25.[1][2][4]
The benchmark is distributed through the open-source arena-hard-auto repository (formerly under lm-sys/, now under lmarena/ on GitHub) along with a Hugging Face dataset and a viewer space.[5][6] As of 2025, the project advanced to Arena-Hard-v2.0-Preview, which refreshes the 500-prompt benchmark, adds 250 creative-writing prompts, and replaces the original GPT-4-Turbo judge with GPT-4.1 and Gemini 2.5 as default automatic judges.[5][7]
Chatbot Arena launched on 24 April 2023 as an open, crowdsourced platform on which anonymous users pit two LLMs against one another and vote on which response is better, producing an Elo-style leaderboard from the accumulated pairwise comparisons.[8] By early 2024 the platform had become the de facto reference for instruction-tuned LLM quality, but its slow turnaround (fresh ranks for a new model require thousands of human votes) made it unsuitable for the rapid release cycles of new models, and motivated work on an automatic surrogate that preserved the agreement with human preferences while costing a fraction of the time and money.[1][4]
Two existing automatic benchmarks, MT-Bench (a multi-turn, 80-question LLM-as-judge benchmark released by the same LMSYS team in 2023[9]) and AlpacaEval, already used strong LLM judges for pairwise comparisons against a baseline, but the LMSYS team argued that their prompts were too easy to discriminate frontier models confidently.[1] In April 2024, the team reported that MT-Bench's agreement with the Chatbot Arena leaderboard fell from 91.3% to 22.6% once 95% confidence intervals were considered, motivating a new pipeline built on harder prompts and tighter statistical methodology.[1][2]
Roughly a month after the Arena-Hard release, on 17 May 2024 LMSYS introduced a parallel Hard Prompts category on the live Chatbot Arena leaderboard itself, applying the same seven-criterion rubric used for benchmark construction to over one million Arena prompts and creating a sub-leaderboard restricted to prompts scoring 6 or higher.[10] Approximately 20% of arena prompts qualified, with Specificity, Domain Knowledge and Real-world Application being the most common criteria and Problem-Solving and Complexity the rarest.[10] On the Hard Prompts split, Claude 3 Opus, GPT-4o and reasoning-focused models (such as Microsoft's Phi-3) gained ground while small instruction-tuned models like Llama-3-8B-Instruct dropped sharply in rank, evidence that the harder-prompts methodology was indeed surfacing capability differences obscured in the general English category.[10]
The Arena-Hard pipeline was announced on 19 April 2024 in a post titled "From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline" on the LMSYS Org blog.[1] The associated work was authored by Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez and Ion Stoica, with affiliations at the UC Berkeley Sky Computing Lab and LMSYS Org.[1][4] A peer-reviewed paper titled "From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline" by the same group plus Tianhao Wu was posted to arXiv as 2406.11939 on 17 June 2024, and was later accepted as a poster at ICML 2025.[3][11]
The original release positioned Arena-Hard as having three required properties for a useful benchmark: ability to robustly differentiate model capabilities, alignment with real-world human preferences, and frequent updates to avoid test-set leakage.[4] The team committed to refreshing the benchmark periodically, a commitment realized through the v2.0-Preview release in April 2025.[5][7]
It is helpful to distinguish three related artifacts that all carry the "Arena-Hard" name:
In casual usage "Arena-Hard" most often refers to Arena-Hard-Auto and its leaderboard, while the underlying pipeline is published under the BenchBuilder name in the arXiv paper.[3]
The original pipeline drew on 200 000 user queries collected via Chatbot Arena between roughly April 2023 and early 2024.[1] In the arXiv paper, the BenchBuilder pipeline is additionally demonstrated on 150 000 queries sampled from the WildChat-1M dataset of ChatGPT conversations, showing that the same methodology generalizes beyond Arena data.[3]
To ensure topical diversity in the final benchmark, each prompt is embedded with OpenAI's text-embedding-3-small, projected to a lower-dimensional space using UMAP, and clustered via HDBSCAN through the BERTopic library; cluster summaries are then generated by GPT-4-Turbo.[1][4] This step yields roughly 4 000 topical clusters covering a wide spectrum of real-world tasks.[1][4]
A calibrated LLM judge (using GPT-3.5-Turbo for cheap pre-scoring and GPT-4-Turbo for final scoring) rates every prompt on a 0–7 scale that counts how many of seven challenge-oriented criteria are satisfied:[1][3]
Each cluster is then scored by the mean score of its member prompts. Clusters with mean score below 5 are discarded; the surviving clusters whose mean score is ≥ 6 out of 7 are kept for sampling.[4]
From the high-scoring clusters, the team retained 250 clusters and randomly sampled two prompts per cluster to yield the 500-prompt Arena-Hard-Auto-v0.1 benchmark.[1][3] The published dataset on Hugging Face (lmarena-ai/arena-hard-auto-v0.1) contains exactly 500 rows, each with a question_id, category, cluster label and a turn list containing the prompt text, released under the Apache 2.0 license.[6][12] Sample clusters include "Advanced Math", "Advanced TypeScript", "AI & Alignment", "AWS Cloud Services" and "ABC Sequence Puzzles", illustrating the technical bent of the curated prompts.[12]
Rather than score model answers independently (which can introduce drift and absolute-scale bias), Arena-Hard-Auto evaluates each candidate by asking a strong LLM judge to compare its answer against a baseline model's answer, prompt by prompt.[2][4] The default baseline is GPT-4-0314 (the March 2023 snapshot of GPT-4), chosen because it is a strong but no-longer-being-updated reference point.[2][6]
In the v0.1 release the default judge is GPT-4-Turbo (gpt-4-1106-preview), with Claude-3-Opus used as a secondary judge to probe robustness.[1][2] Chain-of-thought prompting is used: the judge first generates its own answer to the prompt, then writes a comparison between A and B before giving a verdict.[2] The judge classifies the comparison into one of five labels:[2]
To control for position bias in the LLM-as-a-judge (the well-documented tendency of GPT-4 to prefer whichever response appears first), every prompt is judged twice, once with the candidate as response A and once as response B, yielding 1 000 judgments per evaluated model on the 500-prompt benchmark.[2]
The set of pairwise comparisons across all 500 prompts (each weighted by the strength of preference encoded in the five-label scale) is fit with a Bradley–Terry logistic-regression model, the same statistical model that underpins the Chatbot Arena Elo leaderboard, to produce a latent strength coefficient for each model.[2][4] The coefficient is then converted to an interpretable win-rate against the baseline (GPT-4-0314), with 95% confidence intervals derived from 100 rounds of bootstrap resampling.[2]
The team reported a total cost of approximately US$25 per model for v0.1 using GPT-4-Turbo as judge,[1][4] which the arXiv version of the paper refines to roughly US$20 per model.[3] Both figures are several orders of magnitude cheaper than the human-vote bandwidth required to place a model on the full Chatbot Arena leaderboard.[4]
LMSYS published three custom benchmark-quality metrics alongside Arena-Hard, intended to capture aspects ignored by raw Spearman correlation:[1][4]
Reported results for Arena-Hard-Auto-v0.1 on the top-20 Chatbot Arena models as of April 2024:[1][4]
| Metric | Arena-Hard-Auto v0.1 | MT-Bench |
|---|---|---|
| Separability with confidence | 87.4% | 22.6% |
| Agreement with Chatbot Arena (95% CI) | 89.1% | 26.6% |
| Spearman correlation | 94.1% | 89.9% |
| Cost per model | ~US$25 | ~US$10 |
The arXiv paper later reports refined numbers of 87.4% separability, 90.9% confidence-aware agreement, 93.2% Spearman correlation and 98.6% correlation with full Chatbot Arena rankings, all at ~US$20 per model; the team emphasized these as providing roughly 3× higher separation than MT-Bench at a similar cost.[3]
A persistent finding is that simple Spearman correlation tends to overstate alignment: in the LMSYS analysis, MT-Bench's nominal 91.3% Spearman with Chatbot Arena collapses to 22.6% once 95% CIs are required, while Arena-Hard maintains its agreement under the stricter metric.[1][4] This motivated the team's repeated argument that confidence-aware benchmark evaluation is the appropriate standard.[1][3]
At launch the top results on Arena-Hard-Auto with the default GPT-4-Turbo judge were:[1]
| Model | Win-rate vs GPT-4-0314 |
|---|---|
gpt-4-turbo-2024-04-09 | 82.6% |
gpt-4-0125-preview | 78.0% |
claude-3-opus-20240229 | 60.4% |
claude-3-sonnet-20240229 | 46.8% |
llama-3-70b-instruct | 41.1% |
The strong showing of Llama 3 70B Instruct, released on 18 April 2024, one day before the Arena-Hard announcement, was particularly noted because it placed the first widely-available open-weight model above several proprietary mid-tier models.[1]
By mid-2024 the leaderboard had been extended with newer models. Per the arXiv paper, top performers on the same v0.1 benchmark included Claude 3.5 Sonnet at 79.3% win rate and GPT-4o at 79.2%, edging out earlier GPT-4 Turbo snapshots.[3]
On 14 October 2024, the arena-hard-auto repository added support for Style Control, porting the methodology introduced in the August 2024 Chatbot Arena paper "Does style matter? Disentangling style and substance in Chatbot Arena."[13][5] Style Control augments the Bradley–Terry regression with additional independent variables that capture the normalized difference in response length (tokens), number of markdown headers, number of bold elements, and number of markdown lists between the two answers, allowing the rating to attribute portions of preference to style rather than substance.[13] In the Chatbot Arena data the length coefficient (0.249) dominated all markdown effects (lists 0.031, headers 0.024, bold 0.019), confirming length as the largest stylistic confounder.[13]
When style control was applied to Arena-Hard-Auto, GPT-4o-mini fell substantially (rank 6 → 11 overall, with Grok-2-mini dropping 6 → 18), while Claude 3 Opus (rank 16 → 10) and Llama 3.1 405B rose notably.[13] The LMSYS team reported that under style control, Claude 3.5 Sonnet tied for first place with chatgpt-4o-latest on hard prompts.[13] Importantly, the team showed that adding verbose or markdown-heavy system prompts could inflate raw Arena-Hard scores but could not inflate the style-controlled scores, providing a defense against simple style gaming.[13]
On 23 April 2025 the team released Arena-Hard-v2.0-Preview, a refresh designed to address test-set staleness and broaden coverage.[5][7] V2.0 contains:[5][7]
--control-features flag at result time;Top results on Arena-Hard-v2.0-Preview using the official configuration (Hard-Prompt split, Style Control on, Gemini 2.5 judge) included OpenAI o3 (o3-2025-04-16) at 85.9%, o4-mini (high-effort variant) at 79.1%, and Gemini 2.5 at 79.0%, ahead of o4-mini at 74.6% and Gemini 2.5 Flash at 68.6%.[5] In later updates of the leaderboard reported by third-party trackers, top scores rose further as new reasoning models such as DeepSeek-R1 entered the table.[14]
MT-Bench, the predecessor automatic benchmark from the same LMSYS group, asks GPT-4 to grade individual answers across 80 multi-turn prompts on a 10-point scale and compute mean scores.[9] Arena-Hard improves on MT-Bench along three axes simultaneously: prompt difficulty (curated hard prompts vs hand-written generic prompts), comparison style (pairwise vs absolute scoring), and statistical methodology (Bradley–Terry with confidence intervals vs raw mean scores).[1][3] The headline numbers (87.4% vs 22.6% separability and 89.1% vs 26.6% confidence-aware agreement with Chatbot Arena) quantify the resulting improvement.[1]
AlpacaEval and its length-controlled successor AlpacaEval 2.0 LC are alternative LLM-as-judge benchmarks scored by win-rate against a strong baseline. AlpacaEval 2.0 LC achieved 83.2% separability and roughly 82.5% agreement with Chatbot Arena in the LMSYS analysis, behind Arena-Hard on both dimensions while serving a similar role of cheap, fast proxy for human-preference rankings.[1][4]
MMLU (Massive Multitask Language Understanding) and its harder variant MMLU-Pro are knowledge-and-reasoning benchmarks built from multiple-choice questions, which measure factual recall and constrained reasoning rather than the open-ended generation quality that Arena-Hard targets.[15] The two benchmark families are typically reported side by side in model evaluations because they measure complementary capabilities.
Because the default v0.1 judge is GPT-4-Turbo (and v2.0 defaults to GPT-4.1 or Gemini 2.5), the benchmark inherits any preferences these judges hold. The LMSYS team reported that both GPT-4-Turbo and Claude-3-Opus are sensitive to response length, with Claude exhibiting larger length bias under "more detailed" system prompts.[4] Single-judge self-preference is documented in the literature on LLM-as-a-judge, where a model tends to slightly favor its own family; for example, Claude judges favor Claude answers.[3] The arXiv paper proposes Ensemble-as-Judges, combining GPT-4-Turbo and Gemini 1.5 Pro to reduce self-bias and improve agreement with humans, an approach later operationalized in the v2.0 ensemble option.[3][5]
Before the addition of Style Control, prompting candidate models to produce verbose, markdown-heavy answers reliably inflated their Arena-Hard win rates.[13] Style Control, introduced in the live Chatbot Arena in August 2024 and ported to Arena-Hard-Auto in October 2024, mitigates this gameability by regressing out length and markdown coefficients before reporting model strength; the LMSYS team demonstrated that style-controlled scores could not be inflated by verbose system prompts.[13]
The authors explicitly acknowledge in the arXiv paper that the seven defined quality criteria may not fully capture all possible attributes of a high-quality prompt and that the resulting benchmark skews toward technical and STEM-heavy domains, a bias visible in the v0.1 sample clusters such as "Advanced TypeScript", "Advanced Math" and "AWS Cloud Services".[3][12] V0.1 also lacks multi-turn evaluation and is essentially English-only; the v2.0 refresh adds creative writing but does not directly address multilinguality.[3][5]
A static benchmark with a fixed 500-prompt set is, in principle, vulnerable to training-data contamination once the questions are public. The team's stated strategy is periodic refresh, first realized with the April 2025 v2.0 release, to keep the benchmark out of training data and force models to be re-evaluated on truly novel prompts.[4][5][7]
In April 2025, the paper "The Leaderboard Illusion" by researchers from Cohere, AI2 and academic collaborators argued that Chatbot Arena's policy of allowing select proprietary providers to test multiple private model variants in parallel and selectively reveal only the best, distorts the public leaderboard and indirectly the Arena-Hard pipeline that draws prompts from Arena traffic.[16] The paper specifically claimed that having access to additional Arena data could improve a model's Arena-Hard score by up to 112%, framing the benchmark as susceptible to indirect data-access advantages even before explicit contamination is considered.[16] Simon Willison's commentary on the criticism noted that 27 private LLM variants tested by Meta in the lead-up to the Llama 4 release illustrate the asymmetric-access concern, and called for transparency about all variants submitted to the platform.[17] LMArena has since publicly responded with policy updates and additional disclosures about preview submissions.[17][18]
Arena-Hard was created by the LMSYS Org, an academic research collective housed at the UC Berkeley Sky Computing Lab and led by Wei-Lin Chiang, Anastasios N. Angelopoulos and Ion Stoica (with broader involvement from Joseph E. Gonzalez and other Sky Lab faculty).[8][18][19] Through 2024 the project gradually separated from the academic LMSYS umbrella: Chatbot Arena moved to a dedicated lmarena.ai domain in September 2024,[18][19] and the Chatbot Arena and Arena-Hard work was incorporated as an independent company, LMArena, in early 2025 with seed funding announced in May 2025.[18][19]
LMArena raised approximately US$100 million in seed funding in May 2025 at a roughly US$600 million valuation, with participation from Andreessen Horowitz, Lightspeed Venture Partners and Kleiner Perkins.[18][19] In January 2026 the company closed a US$150 million Series A at a post-money valuation of approximately US$1.7 billion, and around the same time rebranded from "LMArena" to simply "Arena", operating at arena.ai.[18][19] The CEO is Anastasios N. Angelopoulos, with Wei-Lin Chiang as CTO and Ion Stoica as co-founder and advisor.[19]
The arena-hard-auto GitHub repository was correspondingly moved from the lm-sys/ organization to the new lmarena/ organization, where it continues to be the canonical source for the benchmark code, prompts, and pre-generated answer/judgment datasets, with the dataset mirrored on Hugging Face under lmarena-ai/arena-hard-auto.[5][6]
Headline updates since launch:
arena-hard-auto.[5]KfTf9vFvSn).[11]Within months of launch, Arena-Hard-Auto win-rates became a standard reporting figure in model release notes from major labs, alongside MMLU and HumanEval. Cloud providers such as Amazon Web Services explicitly benchmarked their LLM offerings (e.g. the Nova family) on both MT-Bench and Arena-Hard-Auto.[20] Third-party leaderboard aggregators including llm-stats.com and the Vellum LLM Leaderboard maintain dedicated Arena-Hard and Arena-Hard v2 pages tracking submitted models from DeepSeek, Qwen, NVIDIA's Nemotron family and others.[14] On the GitHub side, community forks have appeared, for example a Russian-language port (deepvk/arena-hard-auto-ru) and a Hugging Face mirror, illustrating uptake outside the original Berkeley group.[21]