RewardBench

AI Benchmarks Model Evaluation Reinforcement Learning

12 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,332 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RewardBench is a benchmark and public leaderboard for evaluating reward models, the scoring functions that sit at the center of reinforcement learning from human feedback (RLHF). It was introduced in March 2024 by Nathan Lambert and collaborators at the Allen Institute for AI (Ai2) and the University of Washington, and it was the first standardized way to compare reward models directly rather than judging them only through the language models they help train. ^[1]^[2]

A reward model takes a prompt and a candidate response and returns a scalar score meant to reflect how much a human would prefer that response. During RLHF, this score is the training signal that pushes a policy model toward helpful, harmless, and honest outputs. Before RewardBench, reward model quality was mostly inferred after the fact: a team would finish an entire RLHF run and inspect the resulting chat model, with no clean way to attribute success or failure to the reward model itself. RewardBench reframed reward model evaluation as a self-contained classification problem, which made it possible to iterate on reward models without paying the cost of a full alignment pipeline each time. ^[1]^[3]

Motivation

Reward models had been studied far less than the policy models they supervise. The original paper notes that reward models were underexplored and that more work was needed to understand their capabilities and to correlate their performance with the quality of downstream aligned models. ^[3] Because a reward model encodes which behaviors get rewarded, it also encodes value judgments about what counts as a good answer, so the authors framed evaluation partly as a way to make the opaque preferences baked into alignment more legible. ^[1]

The practical gap was just as important. Two reward models can post similar validation accuracy on the preference data they were trained on and still behave very differently on adversarial prompts, refusal-worthy requests, or fine-grained reasoning distinctions. RewardBench was built to surface those differences on held-out, deliberately hard cases, the kinds of inputs where a reward model's failures translate into a misaligned policy.

Dataset construction and categories

The core RewardBench dataset is a collection of prompt-chosen-rejected trios. Each item pairs a prompt with one preferred (chosen) completion and one dispreferred (rejected) completion. The completions are either sampled from specific models or hand-selected so that the chosen response is clearly better by a known criterion. The trios are grouped into four categories that span everyday chat, harder adversarial chat, safety, and reasoning, drawing subsets from existing datasets such as AlpacaEval, MT Bench, LLMBar, XSTest, Do-Not-Answer, a math process-reward dataset, and the HumanEvalPack coding sets. ^[1]^[3]

Category	Approx. samples	What it tests	Example subsets
Chat	358	Basic open-ended conversational preference	AlpacaEval (Easy, Length, Hard), MT Bench (Easy, Medium)
Chat Hard	456	Subtle and adversarial preference distinctions	MT Bench Hard, LLMBar Natural, LLMBar Adversarial (Neighbor, GPTInst, GPTOut, Manual)
Safety	740	Refusing dangerous requests while answering benign ones	Refusals (Dangerous, Offensive), XSTest (Should Refuse, Should Respond), Do-Not-Answer
Reasoning	1,431	Code correctness and step-by-step math	HumanEvalPack (six languages), PRM Math

The Reasoning subsets are demanding by design. Many code examples differ between the chosen and rejected completions by only one or two tokens, so a reward model has to make a precise rather than a stylistic judgment. The Safety category is split so that some subsets reward refusal and others, such as XSTest Should Respond, reward answering a question that merely sounds risky, which exposes models that refuse too aggressively. ^[1]^[3]

Alongside the four scored categories, the project tracked a fifth group called Prior Sets, an average over test splits from established preference datasets including Anthropic Helpful, Anthropic HHH, Stanford Human Preferences (SHP), and OpenAI's Learning to Summarize data. Because these sets are noisier and have less clearly defined tasks, the leaderboard weighted Prior Sets at half the weight of the four main categories when computing an overall score, and the category was later de-emphasized. ^[1]^[3]

Scoring methodology

RewardBench uses a simple and model-agnostic metric: accuracy on pairwise preference. For each trio, the reward model scores the chosen completion and the rejected completion, and the item counts as a win if the chosen score is higher. Formally, a trial succeeds when r(x, y_chosen) is greater than r(x, y_rejected), where x is the prompt. A random scorer therefore lands at 50 percent. ^[1]^[3]

Within each category the per-subset accuracies are combined into a category score, and the overall RewardBench score is a weighted average across categories (with Prior Sets, when included, contributing at half weight). For Reasoning, the weighting is adjusted so that math and code abilities count roughly equally rather than letting the larger code split dominate. ^[1]^[3]

The key strength of this setup is that it accepts very different kinds of reward models on equal footing, as long as each can produce a comparable score for a prompt-response pair.

What kinds of reward models it compares

RewardBench is designed to evaluate three families of scorers under the same accuracy metric:

Sequence-classifier reward models. These are the standard RLHF reward models: a transformer with a scalar output head, usually trained with a pairwise Bradley-Terry loss on preference data. The score is read directly from the head, and the pairwise comparison is the difference between the two scores.
DPO-implicit reward models. Models trained with direct preference optimization (DPO) do not have an explicit reward head, but DPO defines an implicit reward through the log-probability ratio between the trained policy and its reference model. RewardBench scores these by comparing log[pi(y|x) / pi_ref(y|x)] for the chosen and rejected completions, which let the benchmark include the many openly available DPO checkpoints. ^[1]^[3]
Generative reward models. These use a language model as a judge, prompting it to choose between two answers or to emit a score. They are scored by parsing the model's verdict into a preference. ^[1]^[3]

This coverage made it possible to ask comparative questions that had been hard to settle, such as whether classifier reward models or LLM-as-judge setups were stronger. In the original release, the best classifier reward models outperformed the best generative reward models, and DPO-trained models tended to show higher variance and weaker generalization to outside preference test sets than purpose-built classifiers. ^[1]^[3]

The leaderboard and representative results

The RewardBench leaderboard is hosted as a Hugging Face Space under the allenai organization, where anyone can submit a model and see per-category breakdowns. The initial release evaluated more than 30 reward models, covering most of the publicly accessible options at the time, and the leaderboard kept growing as new models were submitted. ^[2]^[4]

The table below reports figures from the original paper's evaluation tables. The live leaderboard changes over time as models are added, so these numbers are a snapshot of the 2024 results rather than current standings.

Reward model	Overall	Chat	Chat Hard	Safety	Reasoning
ArmoRM-Llama3-8B-v0.1	89.0	96.9	76.8	92.2	97.3
RLHFlow pair-preference LLaMA3-8B	85.7	n/a	65.8	89.7	n/a
FsfairX-LLaMA3-RM-v0.1	83.6	n/a	65.1	n/a	86.4
Starling-RM-34B	81.4	96.9	57.2	n/a	88.5
Tulu-2-DPO-70B	76.1	n/a	60.5	83.9	n/a

Values are accuracy percentages; "n/a" marks per-category scores not reproduced here. ^[1]

A recurring pattern in these results is that Chat Hard is the bottleneck. Many models score in the mid-90s on plain Chat yet fall well below that on Chat Hard, because distinguishing a genuinely better answer from a fluent but subtly worse one is much harder than rejecting an obviously bad answer. The original analysis also found that only large models, and models built on the Llama 3 base, reached high performance on both Chat Hard and Reasoning at the same time. ^[1]^[3]

RewardBench 2

In June 2025 the team released RewardBench 2, a follow-up meant to fix the saturation and validity problems that had emerged as reward models improved. It was described by Saumya Malik and coauthors at Ai2 and the University of Washington. ^[5]^[6]

The most visible change is the scoring format. Instead of a single chosen-versus-rejected pair, most of RewardBench 2 uses a best-of-4 setup: each prompt comes with one correct completion and three incorrect ones, and a model succeeds only when it scores the correct completion above all three distractors. That moves the random baseline from 50 percent down to 25 percent and makes the test substantially harder. ^[5]^[6] The dataset holds roughly 1,865 examples across six domains. ^[6]

Domain	Approx. samples	Focus
Factuality	475	Detecting hallucinated or subtly incorrect answers
Precise instruction following	160	Honoring hard constraints (for example, avoiding a given letter)
Math	183	Selecting correct solutions to open-ended problems
Safety	450	Appropriate compliance versus refusal
Focus	495	Relevance and on-topic quality
Ties	102	Calibration across equally valid answers

Three of these domains (focus, math, and safety) refresh areas the first benchmark already touched, while factuality, precise instruction following, and ties are new. Ties is the most novel: it presents prompts such as naming a color of the rainbow, where several answers are equally correct, and checks whether a reward model avoids assigning arbitrarily strong preferences among them. ^[5]^[6]

Two other design choices stand out. First, RewardBench 2 sources fresh human prompts rather than recycling prompts from existing evaluations; the authors report that most of the benchmark draws on unseen prompts from WildChat and that they ran a decontamination check against twenty common downstream evaluations to avoid overlap. ^[5] Second, the authors explicitly target correlation with downstream use. They report that scores on the new benchmark track best-of-N sampling performance with a Pearson correlation around 0.87, while correlation with full RLHF training holds mainly when the policy and reward model share the same base-model lineage. ^[5]

The difficulty jump is large in practice. Leading models from the first RewardBench score about 20 points lower on RewardBench 2, and even strong systems land in the mid-70s. Reported top results include Gemini 2.5 Flash near 77 percent and several reward models and proprietary judges clustered around 76 to 77 percent, with the precise-instruction-following and math subsets remaining especially hard. ^[5]^[6]

Criticisms and limitations

The sharpest criticism of the original RewardBench concerns external validity: does accuracy on the benchmark predict how well a reward model will actually perform when used to train a policy. The authors themselves flagged this as unresolved, noting that they relied on semi-automatic ways of building chosen-rejected pairs rather than fresh human preference data, and that the link between benchmark results and downstream training quality was an open question. ^[1]

Independent work made the concern concrete. Frick and collaborators at UC Berkeley reported that, as reward models improved, a negative correlation appeared between RewardBench scores on top models and downstream RLHF performance, meaning the highest-scoring models were not reliably the best for training. They proposed Preference Proxy Evaluations (PPE) built on real human preference pairs from Chatbot Arena and verifiable-correctness data, and validated their metrics with end-to-end RLHF runs. ^[7] Other benchmarks such as RM-Bench likewise argued that RewardBench's style of accuracy correlated only weakly with policy improvement, while their harder, style-controlled accuracy correlated more strongly. ^[8]

There are narrower critiques as well. The categories cover chat, safety, and reasoning but leave out long-context, multilingual, and multimodal preferences, gaps that spawned spin-off benchmarks like M-RewardBench for multilingual evaluation and VL-RewardBench for vision-language reward models. ^[9] Some subsets, such as AlpacaEval and MT Bench, are widely used in training, raising a risk of data contamination if a reward model was trained on overlapping data. ^[1] And because the trios are mostly constructed rather than drawn from organic human judgments, a model can learn to exploit format or length cues that do not reflect genuine preference quality. RewardBench 2 was the team's answer to several of these points: harder best-of-4 scoring, fresh human prompts, decontamination, and an explicit effort to demonstrate correlation with downstream performance. ^[5]

Significance

RewardBench gave the alignment community a shared yardstick for a component that had been measured only indirectly, and it slotted into the broader toolkit of model evaluation for large language models. Its category breakdowns made specific weaknesses visible, especially the difficulty of subtle chat preferences and the safety tradeoff between over-refusal and over-compliance, both of which matter for AI safety. The follow-up debate about whether benchmark accuracy predicts RLHF outcomes has itself been productive, pushing reward model evaluation toward harder data and toward explicit validation against real downstream training. ^[3]^[5]^[7]

References

Lambert, Nathan; Pyatkin, Valentina; Morrison, Jacob; Miranda, LJ; Lin, Bill Yuchen; Chandu, Khyathi; Dziri, Nouha; Kumar, Sachin; Zick, Tom; Choi, Yejin; Smith, Noah A.; Hajishirzi, Hannaneh. "RewardBench: Evaluating Reward Models for Language Modeling." arXiv:2403.13787, 2024. https://arxiv.org/abs/2403.13787 ↩
Allen Institute for AI. "RewardBench: The first benchmark & leaderboard for reward models used in RLHF." Ai2 Blog, 2024. https://allenai.org/blog/rewardbench-the-first-benchmark-leaderboard-for-reward-models-used-in-rlhf-1d4d7d04a90b ↩
Lambert, Nathan et al. "RewardBench: Evaluating Reward Models for Language Modeling (v2, HTML)." arXiv, 2024. https://arxiv.org/html/2403.13787v2 ↩
Allen Institute for AI. "Reward Bench Leaderboard." Hugging Face Spaces, 2024. https://huggingface.co/spaces/allenai/reward-bench ↩
Malik, Saumya; Pyatkin, Valentina; Land, Sander; Morrison, Jacob; Smith, Noah A.; Hajishirzi, Hannaneh; Lambert, Nathan. "RewardBench 2: Advancing Reward Model Evaluation." arXiv:2506.01937, 2025. https://arxiv.org/abs/2506.01937 ↩
Allen Institute for AI. "reward-bench-2 dataset card." Hugging Face Datasets, 2025. https://huggingface.co/datasets/allenai/reward-bench-2 ↩
Frick, Evan et al. "How to Evaluate Reward Models for RLHF." arXiv:2410.14872, 2024. https://arxiv.org/abs/2410.14872 ↩
Liu, Yantao et al. "RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style." arXiv:2410.16184, 2024. https://arxiv.org/abs/2410.16184 ↩
Gureja, Srishti et al. "M-RewardBench: Evaluating Reward Models in Multilingual Settings." arXiv:2410.15522, 2024. https://arxiv.org/abs/2410.15522 ↩
allenai/reward-bench. "RewardBench: the first evaluation tool for reward models." GitHub, 2024. https://github.com/allenai/reward-bench

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Reward Model Self-Taught Evaluator

Motivation

Dataset construction and categories

Scoring methodology

What kinds of reward models it compares

The leaderboard and representative results

RewardBench 2

Criticisms and limitations

Significance

References

Improve this article

Related Articles

Process reward model (PRM)

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

What links here

Related Articles

Process reward model (PRM)

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

What links here