Process reward model (PRM)
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,468 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,468 words
Add missing citations, update stale details, or suggest a clearer explanation.
A process reward model (PRM), also called a process-supervised reward model or step-level verifier, is a learned scoring model that evaluates the correctness or quality of each intermediate step in a large language model's chain of thought, rather than judging only the final answer.[1][2] PRMs are contrasted with outcome reward models (ORMs), also known as outcome-supervised reward models, which assign a single scalar score after the last step of a reasoning trace based on whether the final answer is correct.[1][3] By providing dense, step-by-step feedback, PRMs offer a finer-grained training and verification signal that has proven especially useful for long, multi-step reasoning tasks such as mathematics, scientific problem solving, and code generation.[2]
The modern PRM paradigm was popularized by the May 2023 paper Let's Verify Step by Step from OpenAI, which released the PRM800K dataset of roughly 800,000 step-level human correctness labels and showed that a PRM-reranked solver reached 78.2% accuracy on a representative subset of the MATH test set, outperforming both an outcome-supervised verifier (72.4%) and majority voting (69.6%).[2][4] The technique built on an earlier 2022 comparison by Jonathan Uesato and collaborators at Google DeepMind, who introduced the first systematic comparison of process- and outcome-based feedback on GSM8K.[1] PRMs have since become a core ingredient of test-time search procedures such as best-of-N sampling and tree search, and have been associated with reasoning systems including OpenAI o1, OpenAI o3, Google's Gemini Thinking variants, and open replications.[5][6] At the same time, large-scale deployments such as DeepSeek-R1 have explicitly avoided neural PRMs because of concerns over reward hacking and pipeline complexity, instead using rule-based verifiable rewards in the style of RLVR.[7] PRMs sit at the center of one of the most active debates in late-2020s reasoning research: when is dense process supervision worth the engineering cost, and when do sparse verifiable rewards suffice?
A reward model in the context of language models is a function that takes a prompt and a model response and returns a scalar score reflecting how good the response is. Reward models are central to reinforcement learning from human feedback (RLHF), where they replace expensive human ratings during policy optimization with PPO or related algorithms, and to inference-time procedures such as best-of-N ranking.[8] Reward models also underlie RLAIF (reinforcement learning from AI feedback), where AI-labeled preferences substitute for human ones.
The first large-scale demonstration that a learned verifier could push small models past much larger ones came from Karl Cobbe and colleagues at OpenAI, who in 2021 released the GSM8K grade-school math benchmark together with an outcome verifier: a model that scored candidate solutions by whether the final numerical answer was correct, used to select the best of many samples at test time.[3] This established the outcome verifier paradigm and showed that even a small verifier could match or exceed a much larger fine-tuned generator by exploiting test-time compute.
For multi-step reasoning, however, outcome-only signals have a well-known weakness: a long chain of thought may contain a single critical error that derails the rest of the solution, yet the outcome label only reports the final result. An ORM cannot localize the failure, which makes it weak at credit assignment and prone to rewarding lucky-but-flawed chains. The central motivation for PRMs is to provide a dense signal that tells the model where its reasoning went wrong, not just whether it reached the right destination. Reward models for reasoning are categorized along two axes: granularity of supervision (per-solution vs per-step) and source of labels (humans, heuristics, or Monte Carlo rollouts). PRM800K-style human labeling sits at one extreme of cost and precision.
The two reward-modeling paradigms differ in label structure, training cost, supervision density, and failure modes. The table below summarizes the central contrasts described across the literature.[1][2][7]
| Property | Outcome reward model (ORM) | Process reward model (PRM) |
|---|---|---|
| Label granularity | One label per full solution, usually correct or incorrect final answer | One label per reasoning step, judging local validity or progress |
| Signal density | Sparse, only at the end | Dense, at every step |
| Annotation cost | Low; free when an oracle answer key exists | High when labeled by humans, moderate when labeled by Monte Carlo rollouts |
| Credit assignment | Cannot localize errors within a long solution | Can identify the first incorrect step |
| Reward hacking risk | Lower, because rewards can be grounded in verifiable outcomes | Higher, because a neural model judges intermediate language and can be gamed |
| Typical use | Best-of-N reranking, RLHF on summary tasks, rule-based RLVR | Step-level verifiers for tree search, dense RL signals for reasoning |
| Generalization | Works for any task with a clear final-answer check | Stronger for long, multi-step domains such as math, science, and code |
| Sample efficiency in RL | Lower; sparse signal needs many rollouts | Higher in principle; dense signal accelerates credit assignment |
Critically, ORM and PRM are not mutually exclusive. In Let's Verify Step by Step, the same fully-generated solutions are scored by both verifier types and compared head-to-head; downstream systems often combine a rule-based outcome check with a neural PRM signal to anchor learning and reduce hackability.[2]
The phrase "process-supervised reward model" and the systematic comparison between process and outcome supervision were introduced by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins of Google DeepMind in their November 2022 arXiv preprint Solving math word problems with process- and outcome-based feedback.[1]
The Uesato et al. paper conducted what the authors called "the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task," using the GSM8K grade-school math benchmark.[1] All models were based on a 70-billion-parameter Chinchilla-style language model from Hoffmann et al. (2022). The authors compared outcome-supervised and process-supervised variants of both fine-tuning and reward modeling.
The headline results were nuanced. Both process- and outcome-based feedback reached comparable final-answer error rates (improving the previous best from 16.8% to 12.7% on GSM8K), suggesting outcome supervision can match process supervision on the final metric while using cheaper labels.[1] But the picture changed when the authors looked at reasoning quality: among solutions with correct final answers, the rate of trace errors (answers that arrived at the right number through invalid reasoning) fell from 14.0% to 3.4% only when process-based supervision (or a learned reward model emulating it) was used. Reward-model reranking dropped trace error from 11.4% to under 5% in their best configuration.[1]
The paper's main conclusion was that outcome supervision is sufficient for answer accuracy, but process supervision (or a reward model trained to imitate it) is necessary for correct reasoning steps. This established the alignment-flavored motivation for PRMs, generating chains of thought that humans can trust, which would later be central to Let's Verify Step by Step.
The May 2023 paper Let's Verify Step by Step, by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe of OpenAI, brought process supervision into the modern era of large language models by combining it with GPT-4-scale generators and a large human-labeled step dataset.[2] The paper appeared as arXiv:2305.20050 and was accepted to ICLR 2024.
The authors fine-tuned a generator from the base GPT-4 model, pre-trained on next-token prediction without RLHF, to produce step-by-step solutions to problems from the MATH benchmark of competition mathematics. They then trained two verifiers from the same base model: an outcome-supervised reward model (ORM) trained on labels indicating whether each candidate solution's final answer matched the reference, and a process-supervised reward model (PRM) trained on per-step human labels. Both verifiers were used at test time to rerank many sampled candidate solutions, a form of best-of-N sampling.
To train the PRM, the OpenAI team collected PRM800K, a public dataset of step-level human feedback labels on GPT-4-generated solutions to MATH problems.[2][4] The headline statistic is roughly 800,000 step-level labels; the released archive contains approximately 1,085,590 step labels across 101,599 solution samples to 12,000 MATH problems, with a phase-1 / phase-2 structure (phase 2 is the curated 800K-label release used to train the best PRM).[4]
Each step was annotated with one of three labels: positive (correct), neutral (ambiguous), or negative (incorrect). For PRM training, the three-way labels are typically collapsed to a binary target. The dataset is hosted on GitHub at openai/prm800k and remains the primary public resource for human-labeled PRM research.[4]
The authors also used active learning to choose which solutions to label next, focusing annotator effort on solutions where the current PRM was most uncertain. They reported that active learning was roughly 2.6× more data-efficient than uniform labeling, materially reducing human-labeling cost.[2]
Under matched data and matched compute budgets, the PRM-based reranker (used in a best-of-N setup, typically N = 1860) reached 78.2% accuracy on a representative subset of the MATH test set, beating both the ORM-based reranker (72.4%) and majority voting (69.6%).[2] Increasing N traced out the test-time compute scaling curve that became the empirical core of the paper.
The paper also argued for an alignment dimension: process supervision is more aligned with how humans evaluate reasoning, because it rewards getting each step right rather than only the final number. This argument became one of the most cited rationales for putting process supervision at the heart of subsequent reasoning systems.
A typical PRM is a transformer with the same architecture as the base generator, fine-tuned on a sequence-classification objective. Inputs are formatted as a problem followed by a partial solution ending at a candidate step, and the model produces a scalar score reflecting the probability that the step is correct or that the partial solution will lead to a correct answer.
The PRM800K-style pipeline relies on trained human annotators reading model-generated step-by-step solutions and assigning one of {good, neutral, bad} to each step. To control quality, OpenAI used multi-stage onboarding, periodic relabeling, and instructions distinguishing "is this step in a valid solution?" from "is this step correct?".[2] Human labeling produces high-quality data but is slow and expensive, which has motivated automated alternatives.
A wave of follow-up research has investigated whether process labels can be generated automatically by sampling additional rollouts from a model and using the empirical probability that a step leads to a correct final answer as a proxy for its correctness.[9][10] Concretely, given a partial solution ending at step t, a completer LLM samples many completions; the fraction of those completions reaching the correct final answer is taken as the "value" of step t.
Math-Shepherd (Wang et al., December 2023, arXiv:2312.08935) was the first widely cited automated PRM along these lines.[9] The authors generated automatic step-level supervision for roughly 170,000 solutions on GSM8K and 270,000 on MATH, then used the resulting PRM both as a verifier and as a dense reward in step-by-step PPO. Math-Shepherd lifted Mistral-7B from 77.9% to 84.1% on GSM8K and from 28.6% to 33.0% on MATH, with further gains to 89.1% and 43.5% when the PRM was also used for inference-time verification.[9]
OmegaPRM (Luo et al., June 2024, Google DeepMind, arXiv:2406.06592) extended automated labeling with a divide-and-conquer Monte Carlo Tree Search algorithm that uses binary search to efficiently locate the first incorrect step.[10] The authors collected over 1.5 million process-supervision annotations and used them to train a PRM that, combined with weighted self-consistency, raised the instruction-tuned Gemini Pro model from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K, and lifted Gemma2-27B from 42.3% to 58.2% on MATH500, all without human labels.[10]
A more recent direction reframes the PRM as a generator rather than a classifier. Generative reward models and LLM-as-judge systems use a language model to produce a chain-of-thought critique of a step, then extract a verdict from the critique.[11] The 2025 paper Process Reward Models That Think introduces ThinkPRM, a long-CoT verifier fine-tuned on orders of magnitude fewer process labels than discriminative PRMs that outperforms LLM-as-judge baselines using only about 1% of the process labels under a comparable token budget.[12] These approaches blur the line between a "reward model" and a "reasoning model" and connect PRM research to the broader test-time compute literature.
Three design choices have proven important:
PRMs are used in several distinct ways at inference, often in combination:
OpenAI o1, released in preview in September 2024 and as a full model in December 2024, was the first widely deployed reasoning model to scale long internal chains of thought at inference time. OpenAI has not published the full training recipe, but the o1 announcement and system card emphasize that the model was trained with large-scale reinforcement learning that rewards productive chains of thought, with performance improving both with more train-time RL compute and with more test-time thinking compute.[5][6]
Many third-party analyses cite Let's Verify Step by Step and PRM800K as the canonical methodological precursors of o1, in part because Hunter Lightman and several PRM-paper co-authors later worked on the reasoning teams that produced o1. The precise extent to which o1 uses a neural PRM in its training loop remains undisclosed; OpenAI has emphasized large-scale RL with verifiable rewards on math and code, which can be implemented without a PRM, but the public record is consistent with PRMs playing a role in candidate selection, search, or curriculum construction.[5] OpenAI o3, announced in December 2024, extends the same paradigm. The connection between PRMs and o1 is clearest at the concept level: the PRM literature established that adding inference compute plus a quality verifier delivers reliable accuracy gains, the trade-off that o1 exposes to end users as "thinking time."
DeepSeek-R1, released in January 2025, is notable for explicitly avoiding neural PRMs and ORMs and instead training with a rule-based reward function. The DeepSeek-R1 technical report (Guo et al., arXiv:2501.12948) discusses PRMs in detail and gives three reasons:[7]
Instead, DeepSeek-R1-Zero, the variant trained purely with RL from a base model without supervised fine-tuning, used a rule-based reward with two components: an accuracy reward that checks final answers against ground truth (e.g., comparing numerical answers, executing generated code), and a format reward that requires the model to wrap reasoning in <think>...</think> tags. The full DeepSeek-R1 system added a cold-start SFT stage, language-consistency rewards, and a final RLHF stage with a more general reward model, but the core RL signal for reasoning remained rule-based throughout. This produced strong math and code performance, competitive with o1 on several benchmarks, while sidestepping neural PRMs entirely.[7]
DeepSeek-R1's report is widely cited as evidence that, for tasks with cheap verifiable answers, sparse outcome rewards can be sufficient and safer than neural PRMs, and as a key data point in the rise of RLVR as an alternative paradigm.
Despite their attractive properties, PRMs face several well-documented limitations.
Several alternative paradigms compete with or complement neural PRMs.
The PRM literature expanded rapidly during 2024 and 2025. Notable directions include:
Research on PRMs sits at the intersection of reinforcement learning, test-time compute scaling, and AI alignment. The shift toward process supervision is widely seen as a key driver of the 2024-2026 wave of reasoning models, even as practitioners debate when neural PRMs are worth the engineering and reward-hacking costs relative to simpler rule-based rewards.
The table below summarizes widely used datasets for training and evaluating PRMs.
| Dataset | Year | Source of step labels | Approximate scale | Domain |
|---|---|---|---|---|
| PRM800K | 2023 | Human annotators (OpenAI) | ~800K step labels, ~75K solutions, 12K MATH problems | Competition math |
| Math-Shepherd | 2023 | Automated Monte Carlo rollouts | ~170K solutions (GSM8K) + ~270K (MATH) | Grade-school and competition math |
| OmegaPRM | 2024 | Automated divide-and-conquer MCTS | ~1.5M step annotations | Mathematical reasoning |
| Math-PSA / OpenR | 2024 | Automated rollouts and curated mixtures | Several hundred thousand step labels | Math and general reasoning |
| PRMBench | 2025 | Curated with controlled error perturbations | 6,216 problems, 83,456 step labels | Benchmark for evaluating PRMs |
PRM800K remains the de facto reference dataset for human-labeled process supervision; Math-Shepherd and OmegaPRM are the most cited automated-labeling pipelines.
The table below lists openly released PRMs and reasoning-focused systems that use process supervision in some form.
| Model or family | Developer | Year | Notes |
|---|---|---|---|
| GPT-4 process verifier | OpenAI | 2023 | Trained on PRM800K, in Let's Verify Step by Step |
| Math-Shepherd | DeepSeek and PKU authors | 2023 | First widely cited automated PRM |
| OmegaPRM | Google DeepMind | 2024 | Automated MCTS-based labeling, Gemini Pro and Gemma2 |
| Qwen2.5-Math-PRM-7B / 72B | Alibaba Qwen team | 2024 | Public PRMs on PRM800K and automated data |
| Skywork-o1 Open PRMs | Kunlun Skywork | 2024 | 1.5B and 7B Qwen-based PRMs |
| RLHFlow Llama3.1-8B-PRM | RLHFlow community | 2024 | Trained on DeepSeek-distilled trajectories |
| OpenAI o1 / o3 | OpenAI | 2024 | RL reasoning models; PRM use undisclosed |
| DeepSeek-R1 | DeepSeek | 2025 | Avoids neural PRMs; rule-based rewards |
| ThinkPRM | Academic | 2025 | Generative long-CoT PRM, label-efficient |