RewardBench
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,332 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,332 words
Add missing citations, update stale details, or suggest a clearer explanation.
RewardBench is a benchmark and public leaderboard for evaluating reward models, the scoring functions that sit at the center of reinforcement learning from human feedback (RLHF). It was introduced in March 2024 by Nathan Lambert and collaborators at the Allen Institute for AI (Ai2) and the University of Washington, and it was the first standardized way to compare reward models directly rather than judging them only through the language models they help train. [1][2]
A reward model takes a prompt and a candidate response and returns a scalar score meant to reflect how much a human would prefer that response. During RLHF, this score is the training signal that pushes a policy model toward helpful, harmless, and honest outputs. Before RewardBench, reward model quality was mostly inferred after the fact: a team would finish an entire RLHF run and inspect the resulting chat model, with no clean way to attribute success or failure to the reward model itself. RewardBench reframed reward model evaluation as a self-contained classification problem, which made it possible to iterate on reward models without paying the cost of a full alignment pipeline each time. [1][3]
Reward models had been studied far less than the policy models they supervise. The original paper notes that reward models were underexplored and that more work was needed to understand their capabilities and to correlate their performance with the quality of downstream aligned models. [3] Because a reward model encodes which behaviors get rewarded, it also encodes value judgments about what counts as a good answer, so the authors framed evaluation partly as a way to make the opaque preferences baked into alignment more legible. [1]
The practical gap was just as important. Two reward models can post similar validation accuracy on the preference data they were trained on and still behave very differently on adversarial prompts, refusal-worthy requests, or fine-grained reasoning distinctions. RewardBench was built to surface those differences on held-out, deliberately hard cases, the kinds of inputs where a reward model's failures translate into a misaligned policy.
The core RewardBench dataset is a collection of prompt-chosen-rejected trios. Each item pairs a prompt with one preferred (chosen) completion and one dispreferred (rejected) completion. The completions are either sampled from specific models or hand-selected so that the chosen response is clearly better by a known criterion. The trios are grouped into four categories that span everyday chat, harder adversarial chat, safety, and reasoning, drawing subsets from existing datasets such as AlpacaEval, MT Bench, LLMBar, XSTest, Do-Not-Answer, a math process-reward dataset, and the HumanEvalPack coding sets. [1][3]
| Category | Approx. samples | What it tests | Example subsets |
|---|---|---|---|
| Chat | 358 | Basic open-ended conversational preference | AlpacaEval (Easy, Length, Hard), MT Bench (Easy, Medium) |
| Chat Hard | 456 | Subtle and adversarial preference distinctions | MT Bench Hard, LLMBar Natural, LLMBar Adversarial (Neighbor, GPTInst, GPTOut, Manual) |
| Safety | 740 | Refusing dangerous requests while answering benign ones | Refusals (Dangerous, Offensive), XSTest (Should Refuse, Should Respond), Do-Not-Answer |
| Reasoning | 1,431 | Code correctness and step-by-step math | HumanEvalPack (six languages), PRM Math |
The Reasoning subsets are demanding by design. Many code examples differ between the chosen and rejected completions by only one or two tokens, so a reward model has to make a precise rather than a stylistic judgment. The Safety category is split so that some subsets reward refusal and others, such as XSTest Should Respond, reward answering a question that merely sounds risky, which exposes models that refuse too aggressively. [1][3]
Alongside the four scored categories, the project tracked a fifth group called Prior Sets, an average over test splits from established preference datasets including Anthropic Helpful, Anthropic HHH, Stanford Human Preferences (SHP), and OpenAI's Learning to Summarize data. Because these sets are noisier and have less clearly defined tasks, the leaderboard weighted Prior Sets at half the weight of the four main categories when computing an overall score, and the category was later de-emphasized. [1][3]
RewardBench uses a simple and model-agnostic metric: accuracy on pairwise preference. For each trio, the reward model scores the chosen completion and the rejected completion, and the item counts as a win if the chosen score is higher. Formally, a trial succeeds when r(x, y_chosen) is greater than r(x, y_rejected), where x is the prompt. A random scorer therefore lands at 50 percent. [1][3]
Within each category the per-subset accuracies are combined into a category score, and the overall RewardBench score is a weighted average across categories (with Prior Sets, when included, contributing at half weight). For Reasoning, the weighting is adjusted so that math and code abilities count roughly equally rather than letting the larger code split dominate. [1][3]
The key strength of this setup is that it accepts very different kinds of reward models on equal footing, as long as each can produce a comparable score for a prompt-response pair.
RewardBench is designed to evaluate three families of scorers under the same accuracy metric:
This coverage made it possible to ask comparative questions that had been hard to settle, such as whether classifier reward models or LLM-as-judge setups were stronger. In the original release, the best classifier reward models outperformed the best generative reward models, and DPO-trained models tended to show higher variance and weaker generalization to outside preference test sets than purpose-built classifiers. [1][3]
The RewardBench leaderboard is hosted as a Hugging Face Space under the allenai organization, where anyone can submit a model and see per-category breakdowns. The initial release evaluated more than 30 reward models, covering most of the publicly accessible options at the time, and the leaderboard kept growing as new models were submitted. [2][4]
The table below reports figures from the original paper's evaluation tables. The live leaderboard changes over time as models are added, so these numbers are a snapshot of the 2024 results rather than current standings.
| Reward model | Overall | Chat | Chat Hard | Safety | Reasoning |
|---|---|---|---|---|---|
| ArmoRM-Llama3-8B-v0.1 | 89.0 | 96.9 | 76.8 | 92.2 | 97.3 |
| RLHFlow pair-preference LLaMA3-8B | 85.7 | n/a | 65.8 | 89.7 | n/a |
| FsfairX-LLaMA3-RM-v0.1 | 83.6 | n/a | 65.1 | n/a | 86.4 |
| Starling-RM-34B | 81.4 | 96.9 | 57.2 | n/a | 88.5 |
| Tulu-2-DPO-70B | 76.1 | n/a | 60.5 | 83.9 | n/a |
Values are accuracy percentages; "n/a" marks per-category scores not reproduced here. [1]
A recurring pattern in these results is that Chat Hard is the bottleneck. Many models score in the mid-90s on plain Chat yet fall well below that on Chat Hard, because distinguishing a genuinely better answer from a fluent but subtly worse one is much harder than rejecting an obviously bad answer. The original analysis also found that only large models, and models built on the Llama 3 base, reached high performance on both Chat Hard and Reasoning at the same time. [1][3]
In June 2025 the team released RewardBench 2, a follow-up meant to fix the saturation and validity problems that had emerged as reward models improved. It was described by Saumya Malik and coauthors at Ai2 and the University of Washington. [5][6]
The most visible change is the scoring format. Instead of a single chosen-versus-rejected pair, most of RewardBench 2 uses a best-of-4 setup: each prompt comes with one correct completion and three incorrect ones, and a model succeeds only when it scores the correct completion above all three distractors. That moves the random baseline from 50 percent down to 25 percent and makes the test substantially harder. [5][6] The dataset holds roughly 1,865 examples across six domains. [6]
| Domain | Approx. samples | Focus |
|---|---|---|
| Factuality | 475 | Detecting hallucinated or subtly incorrect answers |
| Precise instruction following | 160 | Honoring hard constraints (for example, avoiding a given letter) |
| Math | 183 | Selecting correct solutions to open-ended problems |
| Safety | 450 | Appropriate compliance versus refusal |
| Focus | 495 | Relevance and on-topic quality |
| Ties | 102 | Calibration across equally valid answers |
Three of these domains (focus, math, and safety) refresh areas the first benchmark already touched, while factuality, precise instruction following, and ties are new. Ties is the most novel: it presents prompts such as naming a color of the rainbow, where several answers are equally correct, and checks whether a reward model avoids assigning arbitrarily strong preferences among them. [5][6]
Two other design choices stand out. First, RewardBench 2 sources fresh human prompts rather than recycling prompts from existing evaluations; the authors report that most of the benchmark draws on unseen prompts from WildChat and that they ran a decontamination check against twenty common downstream evaluations to avoid overlap. [5] Second, the authors explicitly target correlation with downstream use. They report that scores on the new benchmark track best-of-N sampling performance with a Pearson correlation around 0.87, while correlation with full RLHF training holds mainly when the policy and reward model share the same base-model lineage. [5]
The difficulty jump is large in practice. Leading models from the first RewardBench score about 20 points lower on RewardBench 2, and even strong systems land in the mid-70s. Reported top results include Gemini 2.5 Flash near 77 percent and several reward models and proprietary judges clustered around 76 to 77 percent, with the precise-instruction-following and math subsets remaining especially hard. [5][6]
The sharpest criticism of the original RewardBench concerns external validity: does accuracy on the benchmark predict how well a reward model will actually perform when used to train a policy. The authors themselves flagged this as unresolved, noting that they relied on semi-automatic ways of building chosen-rejected pairs rather than fresh human preference data, and that the link between benchmark results and downstream training quality was an open question. [1]
Independent work made the concern concrete. Frick and collaborators at UC Berkeley reported that, as reward models improved, a negative correlation appeared between RewardBench scores on top models and downstream RLHF performance, meaning the highest-scoring models were not reliably the best for training. They proposed Preference Proxy Evaluations (PPE) built on real human preference pairs from Chatbot Arena and verifiable-correctness data, and validated their metrics with end-to-end RLHF runs. [7] Other benchmarks such as RM-Bench likewise argued that RewardBench's style of accuracy correlated only weakly with policy improvement, while their harder, style-controlled accuracy correlated more strongly. [8]
There are narrower critiques as well. The categories cover chat, safety, and reasoning but leave out long-context, multilingual, and multimodal preferences, gaps that spawned spin-off benchmarks like M-RewardBench for multilingual evaluation and VL-RewardBench for vision-language reward models. [9] Some subsets, such as AlpacaEval and MT Bench, are widely used in training, raising a risk of data contamination if a reward model was trained on overlapping data. [1] And because the trios are mostly constructed rather than drawn from organic human judgments, a model can learn to exploit format or length cues that do not reflect genuine preference quality. RewardBench 2 was the team's answer to several of these points: harder best-of-4 scoring, fresh human prompts, decontamination, and an explicit effort to demonstrate correlation with downstream performance. [5]
RewardBench gave the alignment community a shared yardstick for a component that had been measured only indirectly, and it slotted into the broader toolkit of model evaluation for large language models. Its category breakdowns made specific weaknesses visible, especially the difficulty of subtle chat preferences and the safety tradeoff between over-refusal and over-compliance, both of which matter for AI safety. The follow-up debate about whether benchmark accuracy predicts RLHF outcomes has itself been productive, pushing reward model evaluation toward harder data and toward explicit validation against real downstream training. [3][5][7]