Reward Model
Last reviewed
Sources
29 citations
Review status
Source-backed
Revision
v2 · 2,264 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
29 citations
Review status
Source-backed
Revision
v2 · 2,264 words
Add missing citations, update stale details, or suggest a clearer explanation.
A reward model (RM) is a model trained to score the outputs of another AI system, producing a scalar estimate of how good a candidate response is according to some standard, most often human preference. Reward models are the central learned component of reinforcement learning from human feedback (RLHF): because asking humans to judge every sample during training is impractical, a model is fit to a limited set of human judgments and then stands in for the human as a cheap, automated proxy [1][2]. In language-model practice, a reward model is typically a transformer initialized from a pretrained or instruction-tuned checkpoint, with its token-prediction head replaced by a linear head that emits a single number, trained on pairwise preference comparisons under the Bradley-Terry model [2][3]. Because reward models are imperfect proxies, optimizing against them too aggressively degrades true quality, a failure mode known as reward hacking or overoptimization [4]. The category has since diversified into outcome and process reward models for multi-step reasoning [5], generative reward models that write critiques before scoring, and rule-based verifiable rewards that replace learned reward models entirely in parts of reasoning-model training.
Learning a reward function from human comparisons predates modern large language models. Christiano et al. (2017) trained deep reinforcement learning agents on Atari games and simulated robotics tasks using a reward predictor fit to human choices between pairs of short trajectory clips; the method taught a simulated robot to backflip from roughly 900 human comparisons, with people "providing feedback on less than one percent of our agent's interactions with the environment" [1]. Ziegler et al. (2019) brought the recipe to language models, fine-tuning GPT-2 against a learned reward for stylistic continuation and summarization [6], and Stiennon et al. (2020) showed that a reward model trained on human comparisons of summaries, optimized with proximal policy optimization (PPO), yielded summaries that humans preferred to those of much larger supervised models [7]. InstructGPT (Ouyang et al., 2022) scaled the three-stage pipeline of supervised fine-tuning, reward modeling, and PPO into the template behind ChatGPT and most aligned chat assistants since; the paper reported that "outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters" [2]. Anthropic's contemporaneous work referred to the same artifact as a "preference model" [8].
In the canonical pipeline, the reward model is trained between supervised fine-tuning and reinforcement learning. During the RL stage, the policy generates responses to training prompts, the reward model scores each complete response, and the policy is updated to increase expected reward, with a per-token KL penalty keeping it close to its reference initialization so it does not drift into degenerate text that fools the scorer [2]. Reward models are equally useful outside the RL loop. In best-of-n or rejection sampling, several candidate responses are generated and the highest-scoring one is kept; Llama 2 used reward-model-ranked rejection sampling to produce training data for successive fine-tuning rounds, and Meta trained two separate reward models for Llama 2-Chat, one for helpfulness and one for safety, because a single scorer traded the two objectives off poorly [9]. Reward models are also used to filter and rank synthetic training data and as automatic evaluators during development.
Reward models are usually trained on relative comparisons rather than absolute scores, since people grade inconsistently but compare fairly reliably. Under the Bradley-Terry model of paired comparisons, a statistical model dating to 1952 [3], the probability that response y_w is preferred over response y_l for a prompt x is the logistic sigmoid of the reward difference r(x, y_w) - r(x, y_l). Training minimizes the negative log-likelihood of the observed human choices, pushing chosen responses above rejected ones [2]. For InstructGPT, annotators ranked between 4 and 9 sampled responses per prompt, and all pairwise comparisons from one prompt were processed within a single batch element to prevent overfitting; the reward model was a 6-billion-parameter GPT-3 variant whose final unembedding layer was replaced with a projection emitting a scalar, and this 6B scorer was used even to train the 175B policy [2]. Because the pairwise loss constrains only reward differences, raw scores are shift-invariant and are typically normalized, for example so that labeler demonstrations average zero [2].
Many variations exist. NVIDIA's HelpSteer2 dataset supports regression-style reward models trained on absolute ratings of attributes such as helpfulness, correctness, and verbosity [10]. ArmoRM trains separate interpretable reward objectives and combines them through a mixture-of-experts gating network [11]. Open preference datasets, including Anthropic's HH-RLHF [8] and HelpSteer2 [10], underpin most openly released reward models.
For multi-step reasoning, reward models divide by the granularity of their feedback. An outcome reward model (ORM) scores a complete solution, usually by whether the final answer is correct; Cobbe et al. (2021) trained such "verifiers" on GSM8K math word problems and used them to rerank sampled solutions [12]. A process reward model (PRM) instead scores each intermediate reasoning step. Uesato et al. (2022) compared the two on GSM8K and found similar final-answer accuracy, but process supervision produced far fewer solutions with flawed reasoning [13]. OpenAI's "Let's Verify Step by Step" (Lightman et al., 2023) made the case at larger scale: a PRM trained on PRM800K, a released dataset of 800,000 human step-level labels, solved 78% of a representative subset of the MATH test set when used to select among sampled solutions, clearly outperforming outcome supervision [5]. Because human step labels are expensive, Math-Shepherd automated them by estimating each step's value from Monte Carlo continuation rollouts [14]. PRMs are used for best-of-n reranking and to guide step-level search in test-time compute methods.
| Aspect | Outcome reward model | Process reward model |
|---|---|---|
| Scores | Whole response or final answer | Each intermediate reasoning step |
| Typical labels | Final-answer correctness or response-level preferences | Human step annotations (PRM800K) or Monte Carlo estimates (Math-Shepherd) |
| Strengths | Cheap, easy-to-collect supervision | Finer credit assignment; stronger best-of-n selection on MATH [5] |
| Weaknesses | Credits lucky answers reached by wrong reasoning | Costly labels; step boundaries ill-defined; exploitable in large-scale RL |
A reward model is a proxy, and per Goodhart's law a proxy ceases to be a good measure once it becomes the target. Gao, Schulman, and Hilton (2022) quantified this with a synthetic setup in which a fixed 6B "gold" reward model plays the role of ground truth and smaller proxy reward models, spanning 3 million to 3 billion parameters, are trained on its labels [4]. As a policy is optimized against the proxy, whether by best-of-n sampling or by RL, the proxy score keeps climbing while the gold score peaks and then declines, with the relationship following smooth functional forms in the square root of the KL divergence from the initial policy; larger reward models and more preference data push the peak further out [4]. Documented hacks in deployed pipelines include verbosity, where reward models correlate score with length and policies respond by inflating it [15], as well as sycophancy and formatting tricks. Mitigations include the KL penalty itself [2], early stopping, reward model ensembles [16], periodically retraining the reward model on fresh on-policy comparisons, and length-controlled objectives. The problem has shaped reasoning-model training: the DeepSeek-R1 developers state that "we do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process" [17].
RewardBench (Lambert et al., March 2024), from the Allen Institute for AI, was the first dedicated reward model benchmark and leaderboard. It measures accuracy at scoring the chosen response above the rejected one in prompt-chosen-rejected trios across chat, chat-hard, safety, and reasoning subsets, and applies both to scalar reward models and to the implicit rewards of direct preference optimization (DPO) models [18]. Leaderboard milestones included NVIDIA's Nemotron-4-340B-Reward, which led with an overall score of 92.0 in June 2024 [10], and Skywork-Reward-Gemma-2-27B, ranked first in late 2024 [19]. As scores saturated and their correlation with downstream RLHF results proved loose, RewardBench 2 (June 2025) raised the difficulty: a best-of-4 format with one chosen and three rejected completions per prompt (25% random baseline) over 1,865 mostly real-user prompts spanning factuality, precise instruction following, math, safety, focus, and ties. Leading models score roughly 20 or more points lower than on the original benchmark, and accuracy correlates with downstream best-of-n and PPO performance [20]. Specialized suites also target PRMs, notably ProcessBench (2024), which tests whether a model can identify the earliest erroneous step in math solutions [21].
Several lines of work reduce or transform the reward model's role:
Strong open-weight reward models, among them Nemotron-4-340B-Reward [10], ArmoRM [11], the Skywork-Reward series [19], and Ai2's Tulu reward models [25], together with open preference datasets, have made reward modeling one of the more reproducible areas of post-training research.