WildBench

AI Benchmarks Model Evaluation

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v3 · 3,761 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WildBench is an automated evaluation framework for large language models (LLMs) introduced by the Allen Institute for AI (AI2) in 2024. It evaluates models on 1,024 challenging tasks curated from real-world user-chatbot conversations sourced from the WildChat corpus, scoring responses with two complementary metrics, WB-Score and WB-Reward, that use GPT-4-Turbo as an LLM-as-judge with task-specific checklists.^[1] The benchmark was designed to address a gap that emerged as static academic benchmarks saturated: it captures the diversity and difficulty of prompts that real users actually submit to chatbots, while preserving cost-efficiency, interpretability, and high correlation with human preference data from Chatbot Arena.^[1]

The official leaderboard is hosted on Hugging Face Spaces at huggingface.co/spaces/allenai/WildBench and is maintained by AI2.^[2] WildBench has been released under permissive licenses (Apache 2.0 for the evaluation framework code; CC BY 4.0 for the dataset) and has been adopted by AI2 itself as one of the evaluation signals for the Tülu 3 family of post-trained open-weight models.^[3]^[4]

Introduction

When WildBench was published in June 2024, the LLM evaluation landscape was dominated by two paradigms. The first was static, capability-oriented test sets such as MMLU, GSM8K, and HumanEval, which had been useful for tracking pretraining progress but had begun to saturate and were increasingly suspected of contamination through training-data overlap. The second was open-ended, preference-based evaluation. The leading example was the LMSYS Chatbot Arena, which collects pairwise human votes between anonymized model responses and converts them into Elo-style ratings.^[1] Arena scores are widely accepted as reflective of human preferences, but they require a constant flow of fresh human votes, are expensive and slow, and provide little interpretable information about why one model is preferred over another.

Several efforts attempted to approximate Arena rankings using LLM-as-judge methods, most notably MT-Bench, AlpacaEval (and its length-controlled variant AlpacaEval 2 LC), and Arena-Hard. These benchmarks rely on a small, curated set of prompts and use a strong model such as GPT-4 to judge responses. They are cheap, fast, and reproducible, but they have well-documented weaknesses: small prompt sets are easy to overfit, judge models exhibit systematic biases, and the prompts often differ in distribution from the queries that real users issue.^[1]

WildBench's central proposal is to address these problems by sampling the evaluation prompts from a corpus of real user conversations, then designing the judge protocol so that it is both cheaper than Arena and more reliable than prior LLM-judge benchmarks. The corpus is WildChat, a million-conversation dataset of timestamped ChatGPT API interactions released under an opt-in consent regime by Zhao et al.^[5] By filtering, deduplicating, and difficulty-rating tasks from WildChat, the WildBench team produced a fixed evaluation suite that preserves the natural distribution of user intents while concentrating on prompts that are hard enough to differentiate frontier models.^[1]

Background: limitations of static benchmarks

By 2023 to 2024, the field had recognized a number of structural problems with the dominant benchmarks used to compare instruction-tuned LLMs.^[1]

Saturation. Models had begun to score near ceiling on traditional capability tests. When a benchmark cannot distinguish among the strongest systems, additional model improvements either go unmeasured or are reflected only in noise. This is the saturation phenomenon that motivated LiveBench and Arena-Hard as alternatives, and it also motivated WildBench.^[1]

Contamination and overfitting. Because most popular benchmarks have been public for years, their prompts and answers may have leaked into training corpora. Even where direct leakage is absent, model developers may indirectly tune for specific tests through hyperparameter search, prompting strategies, or dataset curation. The result is an inflation of headline numbers that does not transfer to deployment.^[1]

Distribution mismatch. Benchmark prompts often diverge from what users actually ask. Academic prompts are short, well-formed, and unambiguous; real user prompts are long, messy, multi-turn, and contain typos, code fragments, and domain-specific jargon. The WildBench authors note that prior LLM-judge benchmarks were "either heavily weighted toward coding (Arena-Hard) or information-seeking tasks (AlpacaEval)," with task category proportions skewed relative to the WildChat distribution.^[1]

Judge biases. A growing literature has documented that LLM judges, particularly GPT-4, exhibit a length bias (preferring longer outputs), a position bias (preferring the response presented first), and a self-enhancement bias (preferring outputs that share their own model's style).^[1] These biases interact dangerously when benchmark designers do not control for them.

Cost and slowness of human preference systems. Chatbot Arena addresses many of these problems by going directly to human votes, but at substantial cost. New models must wait days or weeks to accumulate enough votes to receive a stable rating, and the prompts contributed by Arena visitors are themselves a non-random sample of the user population.^[1]

WildBench was designed to inherit the realism of Arena prompts (by sampling from WildChat, which records the same kind of free-form chatbot use) while restoring the speed, low cost, and interpretability of an automated, fixed benchmark.

Methodology: from WildChat to a fixed benchmark

Sourcing prompts from WildChat

The WildChat dataset, introduced by Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng at ICLR 2024, contains roughly one million timestamped conversations and over 2.5 million interaction turns, collected via a free chatbot service powered by the ChatGPT and GPT-4 APIs with affirmative user opt-in.^[5] Because it is multi-turn, multilingual, and demographically diverse, WildChat captures the actual distribution of user requests: brainstorming, role play, code debugging, language learning, creative writing, and more.^[5]

WildBench inherits this realism while distilling the raw corpus to a fixed evaluation set. The curation pipeline reported in the paper proceeds in several stages.^[1]

Filtering. Conversations are filtered to remove very short or very long turns. Specifically, the authors discard tasks shorter than 10 tokens or longer than 3,000 tokens, and they cap conversations at five turns to control evaluation cost.^[1] English-only data is retained, a deliberate scoping decision that the authors flag as a limitation.

Deduplication. Near-duplicate prompts are removed using a SentenceBERT embedding cosine-similarity threshold of 0.9, so that thematically similar tasks do not dominate the benchmark.^[1]

Difficulty filtering. Multiple strong LLMs are asked to estimate the difficulty of each candidate task and to identify the skills it requires. Tasks rated "very easy" or "easy" by all judging models are excluded, leaving a corpus enriched in genuinely challenging items that can differentiate frontier systems.^[1]

Leakage control. To prevent contamination of future model training data, the team coordinates with the WildChat maintainers so that prompts included in WildBench are not released as part of the public WildChat dataset.^[1]

Manual review. All finalized tasks undergo manual review, including a final pass to fix typos, redact personally identifiable information, and ensure that the prompt is self-contained and answerable.^[1]

The end product is a fixed set of 1,024 tasks. A separate "hard" split of 256 items is also released for ablation studies and harder-tier evaluation; the Hugging Face dataset card lists v1-legacy (1,024 rows), v2 (1,024 rows), and v2-hard (256 rows) as the canonical splits, totalling 2,304 rows in the published dataset.^[4]

Task categories

WildBench tasks are annotated with primary and secondary category tags, and the paper organizes them into 12 fine-grained classes: information seeking, reasoning, planning, editing, coding and debugging, math, role playing, data analysis, creative writing, advice seeking, brainstorming, and miscellaneous tasks.^[1] For aggregate analysis, these are consolidated into five major groups: Info Seeking, Reasoning and Planning, Coding and Debugging, Math and Data, and Creative Tasks. A central design claim of WildBench is that, because the distribution of these categories is inherited from WildChat, it is more balanced than competing benchmarks; for example, the paper observes that Arena-Hard prompts skew heavily toward coding (around 57 percent of the set) and AlpacaEval skews toward information-seeking tasks (above 50 percent), whereas WildBench retains a more uniform mix.^[1]

The judge protocol

WildBench uses an LLM-as-judge methodology with GPT-4-Turbo (the gpt-4-turbo-2024-04-09 family is named in the public framework code) as the principal evaluation model.^[3] The judge protocol has three notable design choices.

Task-specific checklists. Rather than asking the judge a single open-ended question (such as "which response is better?"), WildBench generates a checklist of five to ten yes/no questions tailored to each task. The checklists are generated by two strong models (GPT-4-Turbo and Claude 3 Opus) and combined; the judge model evaluates each candidate response against the checklist before producing an overall verdict.^[3] This is intended to make judgments more reliable and more interpretable, because the structured explanation can be inspected after the fact.

Multiple baselines for pairwise comparison. Rather than scoring against a single reference model, WildBench compares each model to three baselines spanning different performance tiers: GPT-4-Turbo-0429 (a strong proprietary system), Claude 3 Haiku (a fast mid-tier proprietary system), and Llama-2-70B-chat (a representative open-weight model).^[1]^[3] The reported WB-Reward (Mix) score is the average across these three comparisons. This reduces sensitivity to the choice of any single reference.

Length-controlled verdicts. Because LLM judges have a documented bias toward longer responses, WildBench introduces a length penalty: when the winning response is longer than the losing one by more than K characters, "slightly better" and "slightly worse" verdicts are converted to ties. The authors report that K = 500 characters maximizes the correlation with human Chatbot Arena rankings; users can also adjust K on the leaderboard webpage to inspect length-sensitivity, with K = infinity disabling the penalty.^[1]

Metrics: WB-Reward and WB-Score

WildBench defines two distinct metrics, intended for different purposes.^[1]

WB-Reward is a pairwise comparison metric. For each task, the judge compares the evaluated model's response to a baseline model's response and returns one of five outcomes: "much better," "slightly better," "tie," "slightly worse," or "much worse." These map to numerical rewards of +1, +0.5, 0, -0.5, and -1 respectively. The score is averaged over the 1,024 tasks, then averaged across the three baseline models (GPT-4-Turbo-0429, Claude 3 Haiku, and Llama-2-70B-chat) to produce WB-Reward (Mix).^[1] The length penalty (with K = 500 by default) is applied at the per-task level before aggregation. The framework code reports WB-Reward on a scaled range; the GitHub repository describes the comparative score range as -100 to +100.^[3]

WB-Score is a per-response quality score on a 1-10 scale assigned by the judge to each model output independently, without a paired reference. The paper anchors the rubric so that scores of 5 to 6 correspond to "fair but flawed" responses and scores of 9 to 10 correspond to "perfect" responses.^[1] To center the metric around acceptable performance, raw scores are rescaled with the transformation S' = (S - 5) x 2, yielding a metric on a roughly -8 to +10 range. WB-Score is faster and cheaper to compute than WB-Reward because it requires only one judge call per response rather than three pairwise calls, which makes it useful as a low-cost screening metric.^[1]^[3]

In addition to these aggregate metrics, the framework reports per-category WB-Reward and WB-Score values, exposing whether a model is strong at, for example, creative writing but weak at math and data analysis.

Initial results and leaderboard

When the V2 leaderboard was released in mid-2024, the paper reported a snapshot of model performance that placed several proprietary frontier models at the top.^[1] The top tier on WB-Score in the paper's reported snapshot included:

GPT-4o (gpt-4o-2024-05-13), with a WB-Score of 65.3
GPT-4-Turbo-0409, with 64.7
GPT-4-Turbo-0125, with 63.3
Claude 3 Opus, with 63.1
Llama 3 70B Instruct, with 60.4
Yi-1.5-34B-Chat, with 57.8
Gemini 1.5 Pro, with 55.7
Claude 3 Sonnet, with 55.5
Llama 3 8B Instruct (SimPO-tuned), with 53.9
Gemini 1.5 Flash, with 53.1

These numbers are from the V2 release reported in the arXiv v2 manuscript and are the figures included in the paper itself. The live Hugging Face leaderboard is intended to be updated continually as new model releases occur; the leaderboard interface allows users to filter by model family, by length-penalty K, by category, and by baseline. AI2 has described WildBench as "a dynamic benchmark that is updated regularly to reflect new types of user interactions."^[2]^[3]

Because the leaderboard is dynamic and the article date is 2026-05-19, this article does not enumerate live ranks; readers who wish to inspect current standings should consult the official leaderboard.^[2]

Correlations with Chatbot Arena

The paper's central empirical claim is that WildBench tracks human preferences from Chatbot Arena more closely than other LLM-judge benchmarks. Using the Arena Elo from the "hard" Arena category as the human-vote reference, the authors report Pearson correlations of 0.98 for WB-Reward and 0.95 for WB-Score against the top-ranking models.^[1] These compare to 0.91 for Arena-Hard, 0.89 for length-controlled AlpacaEval 2, and 0.87 for the regular AlpacaEval 2 win rate.^[1]

These correlations should be read with care. They are computed over only the top-ranking models in the snapshot, where the rank order is comparatively stable; correlations across the full set of evaluated systems are typically lower for any benchmark. The 0.98 figure is also a Pearson rather than a Spearman correlation, so it is sensitive to outliers. Nevertheless, the relative comparison, in which WildBench beats both Arena-Hard and AlpacaEval 2 on the same evaluation set, has been reported in independent reviews of the benchmark and provides the principal evidence for WildBench's claim to track human preference signals well.^[1]^[6]

Comparison to Chatbot Arena, Arena-Hard, and MT-Bench

WildBench occupies a specific niche among LLM evaluation suites, defined by three design axes: prompt source, judging method, and metric structure.^[1]

Chatbot Arena. Chatbot Arena uses real-time human pairwise votes between anonymized models on user-submitted prompts. Its strengths are unimpeachable: actual humans, large sample size, and prompt diversity from real users. Its weaknesses are cost, latency, and lack of interpretability. WildBench borrows the realism axis (its prompts come from a similar source: real chatbot conversations) but replaces human voters with an LLM judge, allowing it to evaluate a new model in roughly $5 of API spend when run in OpenAI's batch mode, according to the GitHub README.^[3]

Arena-Hard. Arena-Hard, also from LMSYS, takes a different but related approach. It curates 500 prompts from Chatbot Arena conversations that are difficult for current models, then uses GPT-4 as a judge against a baseline model in pairwise comparisons. Arena-Hard is highly correlated with Arena Elo (the original Arena-Hard paper reports correlations above 0.9), but it is heavily skewed toward coding prompts and tends to favor verbose, long responses absent additional controls. WildBench differs in three ways: a larger task set (1,024 vs 500), a more balanced category distribution, and an explicit length-control mechanism (K-based tie conversion).^[1]

AlpacaEval. AlpacaEval (and the length-controlled AlpacaEval 2 LC) uses an LLM judge to score model responses against a fixed GPT-4 reference on 805 prompts, computing a "win rate." AlpacaEval 2 LC introduced a logistic-regression length adjustment to mitigate length bias. WildBench differs by drawing prompts from real user conversations rather than the Alpaca-style instruction set, by maintaining a much broader category distribution, and by reporting both reward (pairwise) and score (single-response) metrics.^[1]

MT-Bench. MT-Bench is a small set of 80 multi-turn prompts judged by GPT-4 on a 1-10 scale. It was an early and influential LLM-judge benchmark, but its small size and limited diversity make it relatively coarse, and it tends to saturate among modern frontier models. WildBench's 1,024 prompts and category-balanced distribution are intended to provide much finer discrimination among strong systems.^[1]

LiveBench and others. LiveBench takes a different approach to contamination by continuously refreshing its evaluation prompts, while RewardBench and IFEval target adjacent capabilities (preference modeling and instruction following respectively). WildBench is conceptually orthogonal to these efforts: it focuses on holistic conversational quality on realistic prompts rather than on capability-specific tasks.

Limitations and critiques

The WildBench authors and subsequent commentators have identified several limitations of the benchmark, some inherent to the LLM-as-judge paradigm and some specific to its design choices.^[1]^[6]

Reliance on GPT-4-Turbo as judge. Because GPT-4-Turbo provides the principal score, models that share stylistic or training-data characteristics with the GPT-4 series may receive systematically favorable scores. This is the well-known self-enhancement bias of LLM judges. The use of three different baseline models for pairwise comparisons mitigates but does not eliminate this risk, and the checklist-based protocol is itself generated in part by GPT-4-Turbo.^[1]

Residual length bias. The K = 500 length-penalty rule is a heuristic. It is effective for the response-length range observed during the paper's analysis, but it remains a crude correction; models that consistently produce slightly-longer outputs may still gain an advantage on tasks where verbosity provides a small benefit to comprehension. Commentators have noted that no LLM-judge benchmark, including WildBench, has fully solved length bias.^[6]

English-only scope. WildBench V1 and V2 evaluate only English-language tasks, even though WildChat is multilingual. The authors describe this scoping as a deliberate first-version choice and explicitly flag multilingual extension as future work.^[1]

API and contamination risk. Although the curated tasks are not released in the public WildChat dataset, evaluating a model against WildBench requires submitting the WildBench prompts to that model's API, which exposes the prompts to API logging and therefore to potential future training data contamination. This is a structural risk for any fixed, cloud-evaluated benchmark and not unique to WildBench.^[1]

Multi-turn handling. WildBench supports multi-turn conversations up to five turns, but the prior turns in the conversation are generated by GPT-4 rather than by the model being evaluated. This means that the conversational context against which the model is judged may not match the conversational style of that model, which slightly disadvantages models with distinctive multi-turn behaviors.^[1]

Subjectivity of "good" responses. Even human raters disagree on which of two open-ended responses is better. The authors observe that "even asking humans to judge which of the given two model outputs is better can be subjective and ambiguous," and that automated judgments inherit this irreducible noise.^[1] Checklists are intended to make the decision more structured, but they cannot eliminate underlying disagreement.

Demographic skew. Because WildChat conversations are contributed by users who opted into the free-API service, the demographic distribution of prompts is not necessarily representative of all chatbot users globally. WildBench inherits this skew.^[5]

Snapshot rather than longitudinal. WildBench V2 is a fixed snapshot, so it does not by itself detect drift in user preferences or the emergence of new task types. AI2 has described the framework as "regularly updated," and the existence of V1 and V2 splits demonstrates a willingness to re-curate, but the cadence is event-driven rather than continuous.^[2]^[3]

Adoption and impact

Since its release in June 2024, WildBench has become a standard reference point in LLM evaluation reports, alongside Chatbot Arena, Arena-Hard, and AlpacaEval. Notable uptake includes:

AI2 internal use. WildBench is one of the benchmarks used in AI2's evaluation of its own Tülu 3 post-training recipes. The Tülu 3 release uses a multi-task evaluation scheme that includes WildBench as a signal for open-ended, real-user-style quality, complementing capability tests such as IFEval, math benchmarks, and safety evaluations.^[7]

Reproducibility. Because the evaluation framework is released under Apache 2.0 and supports OpenAI's batch API, a full evaluation can be reproduced for approximately $5 of GPT-4-Turbo API cost.^[3] This low cost has encouraged inclusion of WildBench results in model release notes from open-weight model groups, with results tracked on the dynamic Hugging Face leaderboard.^[2]

Dataset adoption. The 2,304-row WildBench Hugging Face dataset has accumulated thousands of monthly downloads, suggesting that researchers are using the curated prompts for purposes beyond the original leaderboard, including supervised fine-tuning data analysis and prompt-distribution studies.^[4]

Influence on benchmark design. WildBench's combination of real-user prompt sourcing, checklist-augmented LLM judging, and explicit length control has been cited as a methodological reference point in subsequent benchmark proposals that target conversational realism. Related work from the same AI2 group includes the multimodal counterpart WildVision (an analogous evaluation framework for vision-language models that draws prompts from real user-vision-model interactions).^[1]

Connection to the WildChat ecosystem. WildBench's existence depends on WildChat, and conversely WildChat's value has been amplified by WildBench as a curated, leakage-controlled evaluation set drawn from it. The Tülu 3 post-training pipeline at AI2 demonstrates this integration: WildChat-derived training prompts are paired with WildBench evaluation to close the loop between real-user data and real-user-style evaluation.^[7]

WildBench V2 currently represents AI2's recommended configuration for chatbot evaluation, and updates to the V2 split (and the harder V2-Hard subset) are released through the Hugging Face dataset page rather than as new paper revisions. As frontier models have continued to improve on the original V2 set, the V2-Hard subset has gained importance as the more discriminating evaluation surface.^[4]

Conclusion

WildBench occupies a useful position in the LLM evaluation ecosystem. It is cheaper and faster than Chatbot Arena but more realistic than MT-Bench or the original AlpacaEval; it is broader in category distribution than Arena-Hard and includes explicit length-bias controls; and it provides both pairwise (WB-Reward) and single-response (WB-Score) metrics with interpretable, checklist-based judge explanations. Its principal weaknesses (GPT-4-Turbo dependence, English-only scope, residual length bias, and the irreducible subjectivity of judging open-ended responses) are well-known and acknowledged by its authors, and they are largely shared with other automated benchmarks in the same family.^[1] Its primary contribution has been to demonstrate that automated, LLM-judged benchmarks can track human preference rankings closely (Pearson 0.98 against Arena Elo at the top of the leaderboard) when the prompt distribution is drawn from real-user data and the judging protocol is explicitly designed to control known judge biases.^[1]

For practitioners selecting an evaluation suite for an instruction-tuned model, WildBench is most useful as one signal among several: paired with capability tests (such as IFEval for instruction-following or LiveBench for contamination-resistant capability assessment), with human-vote systems where available, and with task-specific evaluations relevant to the deployment target. Used in this complementary fashion, WildBench provides a reproducible, low-cost, and reasonably faithful proxy for human conversational preference on the kinds of prompts that real users send.

References

Lin, B. Y., Deng, Y., Chandu, K., Brahman, F., Ravichander, A., Pyatkin, V., Dziri, N., Le Bras, R., & Choi, Y. (2024). "WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild." arXiv:2406.04770. https://arxiv.org/abs/2406.04770 (Accessed 2026-05-19) ↩
Allen Institute for AI. "AI2 WildBench Leaderboard (V2)." Hugging Face Spaces. https://huggingface.co/spaces/allenai/WildBench (Accessed 2026-05-19) ↩
Allen Institute for AI. "WildBench: Benchmarking LLMs with Challenging Tasks from Real Users." GitHub repository. https://github.com/allenai/WildBench (Accessed 2026-05-19) ↩
Allen Institute for AI. "allenai/WildBench Dataset." Hugging Face Datasets. https://huggingface.co/datasets/allenai/WildBench (Accessed 2026-05-19) ↩
Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., & Deng, Y. (2024). "WildChat: 1M ChatGPT Interaction Logs in the Wild." In Proceedings of the Twelfth International Conference on Learning Representations (ICLR). arXiv:2405.01470. https://arxiv.org/abs/2405.01470 (Accessed 2026-05-19) ↩
Lin, B. Y., et al. (2024). "WildBench" HTML version (v2). arXiv. https://arxiv.org/html/2406.04770v2 (Accessed 2026-05-19) ↩
Lambert, N., et al. (2024). "Tülu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv:2411.15124. https://arxiv.org/abs/2411.15124 (Accessed 2026-05-19) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

HELM (Holistic Evaluation of Language Models)