Reward Model

Reinforcement Learning

11 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

29 citations

Revision

v2 · 2,264 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A reward model (RM) is a model trained to score the outputs of another AI system, producing a scalar estimate of how good a candidate response is according to some standard, most often human preference. Reward models are the central learned component of reinforcement learning from human feedback (RLHF): because asking humans to judge every sample during training is impractical, a model is fit to a limited set of human judgments and then stands in for the human as a cheap, automated proxy ^[1]^[2]. In language-model practice, a reward model is typically a transformer initialized from a pretrained or instruction-tuned checkpoint, with its token-prediction head replaced by a linear head that emits a single number, trained on pairwise preference comparisons under the Bradley-Terry model ^[2]^[3]. Because reward models are imperfect proxies, optimizing against them too aggressively degrades true quality, a failure mode known as reward hacking or overoptimization ^[4]. The category has since diversified into outcome and process reward models for multi-step reasoning ^[5], generative reward models that write critiques before scoring, and rule-based verifiable rewards that replace learned reward models entirely in parts of reasoning-model training.

Where did reward models come from?

Learning a reward function from human comparisons predates modern large language models. Christiano et al. (2017) trained deep reinforcement learning agents on Atari games and simulated robotics tasks using a reward predictor fit to human choices between pairs of short trajectory clips; the method taught a simulated robot to backflip from roughly 900 human comparisons, with people "providing feedback on less than one percent of our agent's interactions with the environment" ^[1]. Ziegler et al. (2019) brought the recipe to language models, fine-tuning GPT-2 against a learned reward for stylistic continuation and summarization ^[6], and Stiennon et al. (2020) showed that a reward model trained on human comparisons of summaries, optimized with proximal policy optimization (PPO), yielded summaries that humans preferred to those of much larger supervised models ^[7]. InstructGPT (Ouyang et al., 2022) scaled the three-stage pipeline of supervised fine-tuning, reward modeling, and PPO into the template behind ChatGPT and most aligned chat assistants since; the paper reported that "outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters" ^[2]. Anthropic's contemporaneous work referred to the same artifact as a "preference model" ^[8].

What role does the reward model play in RLHF?

In the canonical pipeline, the reward model is trained between supervised fine-tuning and reinforcement learning. During the RL stage, the policy generates responses to training prompts, the reward model scores each complete response, and the policy is updated to increase expected reward, with a per-token KL penalty keeping it close to its reference initialization so it does not drift into degenerate text that fools the scorer ^[2]. Reward models are equally useful outside the RL loop. In best-of-n or rejection sampling, several candidate responses are generated and the highest-scoring one is kept; Llama 2 used reward-model-ranked rejection sampling to produce training data for successive fine-tuning rounds, and Meta trained two separate reward models for Llama 2-Chat, one for helpfulness and one for safety, because a single scorer traded the two objectives off poorly ^[9]. Reward models are also used to filter and rank synthetic training data and as automatic evaluators during development.

How is a reward model trained?

Reward models are usually trained on relative comparisons rather than absolute scores, since people grade inconsistently but compare fairly reliably. Under the Bradley-Terry model of paired comparisons, a statistical model dating to 1952 ^[3], the probability that response y_w is preferred over response y_l for a prompt x is the logistic sigmoid of the reward difference r(x, y_w) - r(x, y_l). Training minimizes the negative log-likelihood of the observed human choices, pushing chosen responses above rejected ones ^[2]. For InstructGPT, annotators ranked between 4 and 9 sampled responses per prompt, and all pairwise comparisons from one prompt were processed within a single batch element to prevent overfitting; the reward model was a 6-billion-parameter GPT-3 variant whose final unembedding layer was replaced with a projection emitting a scalar, and this 6B scorer was used even to train the 175B policy ^[2]. Because the pairwise loss constrains only reward differences, raw scores are shift-invariant and are typically normalized, for example so that labeler demonstrations average zero ^[2].

Many variations exist. NVIDIA's HelpSteer2 dataset supports regression-style reward models trained on absolute ratings of attributes such as helpfulness, correctness, and verbosity ^[10]. ArmoRM trains separate interpretable reward objectives and combines them through a mixture-of-experts gating network ^[11]. Open preference datasets, including Anthropic's HH-RLHF ^[8] and HelpSteer2 ^[10], underpin most openly released reward models.

How do process reward models differ from outcome reward models?

For multi-step reasoning, reward models divide by the granularity of their feedback. An outcome reward model (ORM) scores a complete solution, usually by whether the final answer is correct; Cobbe et al. (2021) trained such "verifiers" on GSM8K math word problems and used them to rerank sampled solutions ^[12]. A process reward model (PRM) instead scores each intermediate reasoning step. Uesato et al. (2022) compared the two on GSM8K and found similar final-answer accuracy, but process supervision produced far fewer solutions with flawed reasoning ^[13]. OpenAI's "Let's Verify Step by Step" (Lightman et al., 2023) made the case at larger scale: a PRM trained on PRM800K, a released dataset of 800,000 human step-level labels, solved 78% of a representative subset of the MATH test set when used to select among sampled solutions, clearly outperforming outcome supervision ^[5]. Because human step labels are expensive, Math-Shepherd automated them by estimating each step's value from Monte Carlo continuation rollouts ^[14]. PRMs are used for best-of-n reranking and to guide step-level search in test-time compute methods.

Aspect	Outcome reward model	Process reward model
Scores	Whole response or final answer	Each intermediate reasoning step
Typical labels	Final-answer correctness or response-level preferences	Human step annotations (PRM800K) or Monte Carlo estimates (Math-Shepherd)
Strengths	Cheap, easy-to-collect supervision	Finer credit assignment; stronger best-of-n selection on MATH ^[5]
Weaknesses	Credits lucky answers reached by wrong reasoning	Costly labels; step boundaries ill-defined; exploitable in large-scale RL

What is reward hacking and overoptimization?

A reward model is a proxy, and per Goodhart's law a proxy ceases to be a good measure once it becomes the target. Gao, Schulman, and Hilton (2022) quantified this with a synthetic setup in which a fixed 6B "gold" reward model plays the role of ground truth and smaller proxy reward models, spanning 3 million to 3 billion parameters, are trained on its labels ^[4]. As a policy is optimized against the proxy, whether by best-of-n sampling or by RL, the proxy score keeps climbing while the gold score peaks and then declines, with the relationship following smooth functional forms in the square root of the KL divergence from the initial policy; larger reward models and more preference data push the peak further out ^[4]. Documented hacks in deployed pipelines include verbosity, where reward models correlate score with length and policies respond by inflating it ^[15], as well as sycophancy and formatting tricks. Mitigations include the KL penalty itself ^[2], early stopping, reward model ensembles ^[16], periodically retraining the reward model on fresh on-policy comparisons, and length-controlled objectives. The problem has shaped reasoning-model training: the DeepSeek-R1 developers state that "we do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process" ^[17].

How are reward models evaluated?

RewardBench (Lambert et al., March 2024), from the Allen Institute for AI, was the first dedicated reward model benchmark and leaderboard. It measures accuracy at scoring the chosen response above the rejected one in prompt-chosen-rejected trios across chat, chat-hard, safety, and reasoning subsets, and applies both to scalar reward models and to the implicit rewards of direct preference optimization (DPO) models ^[18]. Leaderboard milestones included NVIDIA's Nemotron-4-340B-Reward, which led with an overall score of 92.0 in June 2024 ^[10], and Skywork-Reward-Gemma-2-27B, ranked first in late 2024 ^[19]. As scores saturated and their correlation with downstream RLHF results proved loose, RewardBench 2 (June 2025) raised the difficulty: a best-of-4 format with one chosen and three rejected completions per prompt (25% random baseline) over 1,865 mostly real-user prompts spanning factuality, precise instruction following, math, safety, focus, and ties. Leading models score roughly 20 or more points lower than on the original benchmark, and accuracy correlates with downstream best-of-n and PPO performance ^[20]. Specialized suites also target PRMs, notably ProcessBench (2024), which tests whether a model can identify the earliest erroneous step in math solutions ^[21].

What alternatives are replacing learned reward models?

Several lines of work reduce or transform the reward model's role:

Implicit reward models. DPO (Rafailov et al., 2023) showed that the RLHF objective can be optimized in closed form directly on preference pairs, with the language model itself defining an implicit reward (a scaled log-probability ratio against a reference model), removing the separate reward model and the RL loop entirely ^[22].
AI feedback. Constitutional AI (Anthropic, 2022) trains the preference model on AI-generated comparisons guided by a written constitution rather than on human labels ^[23], and RLAIF experiments at Google showed AI-labeled preferences can match human-labeled RLHF on summarization and dialogue tasks ^[24].
Verifiable rewards. Reinforcement learning with verifiable rewards (RLVR), named in Ai2's Tulu 3 ^[25], replaces the learned scorer with programmatic checks such as answer matching, unit tests, and format checks wherever ground truth is checkable. DeepSeek-R1 was trained with rule-based accuracy and format rewards in this style ^[17]. RLVR sidesteps the hacking of learned reward models but covers only verifiable domains, so learned reward models remain necessary for open-ended tasks.
Generative reward models. Rather than emitting a bare scalar, generative reward models produce chain-of-thought critiques and then read out a judgment, merging reward modeling with the LLM-as-a-judge paradigm. GenRM framed verification as next-token prediction over a correctness token, enabling chain-of-thought verification and majority voting ^[26]. DeepSeek and Tsinghua University's DeepSeek-GRM (2025) trains a 27-billion-parameter pointwise generative reward model with Self-Principled Critique Tuning, which generates evaluation principles and critiques and scales judgment quality at inference time by sampling many critiques and aggregating them with a meta reward model; with 32 samples it surpassed much larger models, including Nemotron-4-340B-Reward and GPT-4o, on reward modeling benchmarks ^[27]. RM-R1 recasts reward modeling itself as a reasoning task, trained with distilled rationales followed by reinforcement learning ^[28], and Meta's J1 likewise uses RL to train judges that think before scoring ^[29].

Strong open-weight reward models, among them Nemotron-4-340B-Reward ^[10], ArmoRM ^[11], the Skywork-Reward series ^[19], and Ai2's Tulu reward models ^[25], together with open preference datasets, have made reward modeling one of the more reproducible areas of post-training research.

References

Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences". arXiv:1706.03741. ↩
Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback". arXiv:2203.02155. ↩
Bradley, R. A., and Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons". Biometrika 39(3/4), 324-345. ↩
Gao, L., Schulman, J., and Hilton, J. (2022). "Scaling Laws for Reward Model Overoptimization". arXiv:2210.10760. ↩
Lightman, H., et al. (2023). "Let's Verify Step by Step". arXiv:2305.20050. ↩
Ziegler, D. M., et al. (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593. ↩
Stiennon, N., et al. (2020). "Learning to Summarize from Human Feedback". arXiv:2009.01325. ↩
Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862. ↩
Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models". arXiv:2307.09288. ↩
Wang, Z., et al. (2024). "HelpSteer2: Open-source Dataset for Training Top-performing Reward Models". arXiv:2406.08673. ↩
Wang, H., et al. (2024). "Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts". arXiv:2406.12845. ↩
Cobbe, K., et al. (2021). "Training Verifiers to Solve Math Word Problems". arXiv:2110.14168. ↩
Uesato, J., et al. (2022). "Solving Math Word Problems with Process- and Outcome-Based Feedback". arXiv:2211.14275. ↩
Wang, P., et al. (2023). "Math-Shepherd: Verify and Reinforce LLMs Step-by-Step without Human Annotations". arXiv:2312.08935. ↩
Singhal, P., et al. (2023). "A Long Way to Go: Investigating Length Correlations in RLHF". arXiv:2310.03716. ↩
Coste, T., et al. (2023). "Reward Model Ensembles Help Mitigate Overoptimization". arXiv:2310.02743. ↩
DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv:2501.12948. ↩
Lambert, N., et al. (2024). "RewardBench: Evaluating Reward Models for Language Modeling". arXiv:2403.13787. ↩
Liu, C. Y., et al. (2024). "Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs". arXiv:2410.18451. ↩
Malik, S., et al. (2025). "RewardBench 2: Advancing Reward Model Evaluation". arXiv:2506.01937. ↩
Zheng, C., et al. (2024). "ProcessBench: Identifying Process Errors in Mathematical Reasoning". arXiv:2412.06559. ↩
Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290. ↩
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback". arXiv:2212.08073. ↩
Lee, H., et al. (2023). "RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback". arXiv:2309.00267. ↩
Lambert, N., et al. (2024). "Tulu 3: Pushing Frontiers in Open Language Model Post-Training". arXiv:2411.15124. ↩
Zhang, L., et al. (2024). "Generative Verifiers: Reward Modeling as Next-Token Prediction". arXiv:2408.15240. ↩
Liu, Z., et al. (2025). "Inference-Time Scaling for Generalist Reward Modeling". arXiv:2504.02495. ↩
Chen, X., et al. (2025). "RM-R1: Reward Modeling as Reasoning". arXiv:2505.02387. ↩
Whitehouse, C., et al. (2025). "J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning". arXiv:2505.10320. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

Reward Model

Where did reward models come from?

What role does the reward model play in RLHF?

How is a reward model trained?

How do process reward models differ from outcome reward models?

What is reward hacking and overoptimization?

How are reward models evaluated?

What alternatives are replacing learned reward models?

References

Improve this article

What links here

What links here

Where did reward models come from?

What role does the reward model play in RLHF?

How is a reward model trained?

How do process reward models differ from outcome reward models?

What is reward hacking and overoptimization?

How are reward models evaluated?

What alternatives are replacing learned reward models?

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

DQN

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

DQN

What links here