Process reward model (PRM)
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,429 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,429 words
Add missing citations, update stale details, or suggest a clearer explanation.
A process reward model (PRM), also called a process-supervised reward model or step-level verifier, is a learned scoring model that evaluates the correctness or quality of each intermediate step in a large language model's chain of thought, rather than judging only the final answer. PRMs are contrasted with outcome reward models (ORMs), which assign a single scalar score after the last step of a reasoning trace based on whether the final answer is correct. By providing dense, step-by-step feedback, PRMs offer a finer-grained training and verification signal that has proven especially useful for long, multi-step reasoning tasks such as mathematics, scientific problem solving, and code generation.
The modern PRM paradigm was popularized by the May 2023 paper "Let's Verify Step by Step" from researchers at OpenAI, which released the PRM800K dataset of 800,000 step-level human correctness labels. PRMs have since become a core ingredient of test-time search procedures, such as best-of-N sampling and tree search, and have been widely associated with the development of reasoning-focused systems including OpenAI o1, o3, Google's Gemini Thinking variants, and various open replication efforts. At the same time, large-scale public deployments such as DeepSeek-R1 have explicitly avoided neural PRMs because of concerns over reward hacking, illustrating that PRMs remain a topic of active research and debate.
A reward model in the context of language models is a function that takes a prompt and a model response and returns a scalar score reflecting how good the response is. Reward models are central to reinforcement learning from human feedback (RLHF), where they replace expensive human ratings during policy optimization, and to test-time procedures such as best-of-N ranking, where multiple candidate outputs are sampled and the highest-scoring one is selected.
Reward models for multi-step reasoning are typically categorized along two axes:
The central motivation for PRMs is that long chains of reasoning often contain a single critical error that derails the rest of the solution. An ORM only knows that the final answer is wrong, not where the error occurred, which makes it weak at credit assignment. A PRM can localize the failure and give a stronger learning or selection signal.
The two reward-modeling paradigms differ in label structure, training cost, and use cases. The table below summarizes the main contrasts as described across the literature, including the surveys by Zhou et al. (2025) and the original OpenAI study.
| Property | Outcome reward model (ORM) | Process reward model (PRM) |
|---|---|---|
| Label granularity | One label per full solution, usually correct or incorrect final answer | One label per reasoning step, judging local validity or progress |
| Signal density | Sparse, only at the end | Dense, at every step |
| Annotation cost | Low, often free for math when an oracle answer key exists | High when labeled by humans, moderate to high when labeled by Monte Carlo rollouts |
| Credit assignment | Cannot localize errors within a long solution | Can identify the first incorrect step |
| Reward hacking risk | Lower, because rewards are tied to verifiable outcomes when an answer key exists | Higher, because a neural model judges intermediate language and can be gamed |
| Typical use | Outcome verifiers for best-of-N, RLHF on summary tasks | Step-level verifiers for tree search, dense RL signals for reasoning |
| Generalization | Works for any task with a clear final-answer check | Stronger for long, multi-step domains such as math, science, and code |
OpenAI's "Let's Verify Step by Step" reported that under a fixed compute budget, a PRM trained on the PRM800K data outperformed an ORM trained on comparable amounts of outcome data when both were used to rerank candidate solutions on a representative subset of the MATH test set. The PRM-selected solutions solved 78 percent of problems in that subset, a substantial gain over the ORM baseline and over majority voting.
Process supervision was discussed in earlier work on math word problem solvers, including the 2022 DeepMind paper "Solving math word problems with process- and outcome-based feedback" (Uesato et al., arXiv 2211.14275), but the May 2023 paper "Let's Verify Step by Step" by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe brought the technique into the modern era of large language models by combining it with GPT-4 scale generators and a large human-labeled dataset. The paper was first released as arXiv preprint 2305.20050 and was later accepted to ICLR 2024.
The authors fine-tuned a generator from the base GPT-4 model (which had been pre-trained on next-token prediction without RLHF) to produce step-by-step solutions to problems from the MATH dataset of competition mathematics. They then trained two kinds of verifiers from the same base model: an outcome-supervised reward model trained on whether the final answer matched the reference, and a process-supervised reward model trained on per-step human labels. Both verifiers were used at test time to rerank many sampled candidate solutions, in a form of best-of-N sampling.
To train the PRM, OpenAI's team collected PRM800K, a public dataset of step-level human feedback labels on solutions to MATH problems. As released, it contains 800,000 step-level labels across roughly 75,000 solutions to 12,000 MATH problems, with an underlying pool of about 1,085,590 labels across 101,599 solution samples. Each step was annotated by human contractors with one of three labels: positive (correct), negative (incorrect), or neutral (ambiguous). The dataset is hosted on GitHub at github.com/openai/prm800k and remains the primary public resource for human-labeled PRM research.
The authors also used active learning to choose which solutions to label next, focusing effort on solutions where the current PRM was most uncertain, and reported that active learning roughly doubled the data efficiency of process supervision.
Process supervision substantially outperformed outcome supervision on MATH under matched data and matched compute budgets. The PRM-based reranker reached approximately 78 percent accuracy on a representative test subset, beating both the ORM and majority-vote baselines. The paper also discussed alignment implications, arguing that process supervision produces chains of thought more aligned with human reasoning. It became one of the most influential references for reasoning-focused systems built after 2023, and is frequently cited as a precursor to OpenAI o1.
A major limitation of PRM800K is that step-level human labeling is expensive and slow. A wave of follow-up research has investigated whether process labels can be generated automatically by sampling additional rollouts from a model, using the empirical probability that a step leads to a correct final answer as a proxy for its correctness. Two prominent examples are Math-Shepherd and OmegaPRM.
Math-Shepherd, introduced in the December 2023 arXiv paper "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" by Peiyi Wang and colleagues (arXiv 2312.08935), was the first widely cited automated PRM. The authors define the quality of an intermediate step by its potential to lead to a correct final answer: they deploy a completer LLM that samples full solutions from any intermediate step and uses the fraction of successful completions to infer step-level labels.
Using this approach, the team generated automatic step-level supervision for roughly 170,000 solutions to GSM8K and 270,000 solutions to MATH. They then used the resulting PRM both as a verifier and as a dense reward signal in step-by-step proximal policy optimization. Math-Shepherd improved Mistral-7B from 77.9 percent to 84.1 percent on GSM8K and from 28.6 percent to 33.0 percent on MATH, with further gains to 89.1 percent and 43.5 percent when the PRM was also used for inference-time verification.
OmegaPRM, introduced in the June 2024 Google DeepMind paper "Improve Mathematical Reasoning in Language Models by Automated Process Supervision" by Liangchen Luo and collaborators (arXiv 2406.06592), extended the automated-labeling idea with a divide-and-conquer Monte Carlo Tree Search algorithm. Rather than uniformly sampling continuations from every step, OmegaPRM uses binary search to efficiently locate the first incorrect step in a reasoning chain, balancing positive and negative examples and reducing the number of rollouts needed.
The authors reported collecting over 1.5 million process-supervision annotations and using them to train a PRM that, combined with weighted self-consistency, raised the instruction-tuned Gemini Pro model from 51 percent to 69.4 percent on MATH500 and from 86.4 percent to 93.6 percent on GSM8K, and lifted Gemma2-27B from 42.3 percent to 58.2 percent on MATH500 and from 74.0 percent to 92.2 percent on GSM8K, all without human labels.
Related automated-labeling schemes have appeared since 2024, including Math-PSA from the OpenR project, RLHFlow's PRM trained on DeepSeek-distilled data, and various Process Advantage Verifier (PAV) approaches that score the marginal benefit of each step rather than its absolute correctness.
The table below summarizes the most widely used datasets for training and evaluating PRMs. Numbers are taken from each dataset's release notes and accompanying papers.
| Dataset | Year | Source of step labels | Approximate scale | Domain |
|---|---|---|---|---|
| PRM800K | 2023 | Human annotators (OpenAI) | ~800,000 step labels over ~75,000 solutions to ~12,000 MATH problems | Competition mathematics |
| Math-Shepherd | 2023 | Automated, Monte Carlo rollouts from a completer LLM | ~170,000 solutions on GSM8K and ~270,000 on MATH with step-level labels | Grade-school and competition math |
| OmegaPRM data | 2024 | Automated, divide-and-conquer MCTS | ~1.5 million step-level annotations | Mathematical reasoning |
| Math-PSA / OpenR PRM data | 2024 | Automated rollouts and curated mixtures | Several hundred thousand step labels | Math and general reasoning |
| PRMBench | 2025 | Curated with controlled error perturbations | 6,216 problems, 83,456 step labels across multiple error types | Math benchmark for evaluating PRMs |
PRM800K remains the de facto reference dataset for human-labeled process supervision, while Math-Shepherd and OmegaPRM are the most cited automated-labeling pipelines.
A typical PRM is a transformer that shares the architecture of the base generator and is fine-tuned on a sequence-classification objective. Inputs are formatted as a problem followed by a partial solution ending at a candidate step, and the model produces a scalar score reflecting the probability that the step is correct or that the partial solution will lead to a correct answer.
Three design choices have proven important:
PRMs are typically used as verifiers for best-of-N reranking, as search heuristics for beam search or MCTS, or as dense per-step rewards during policy optimization with algorithms such as PPO.
Process reward models are widely regarded as a key ingredient in the wave of test-time compute reasoning systems that emerged in 2024 and 2025, although the precise role differs across labs.
OpenAI o1, released in preview form in September 2024 and as a full model in December 2024, was the first widely deployed model to use long internal chains of thought scaled at inference time. OpenAI has not published the full training recipe, but the o1 system card and accompanying blog posts emphasize that the model is trained with large-scale reinforcement learning that rewards productive chains of thought, and many third-party analyses cite "Let's Verify Step by Step" and PRM800K as the canonical precursors. OpenAI o3, announced in December 2024, extends the same paradigm.
Google DeepMind has published more openly on automated process supervision through OmegaPRM. The improvements reported on Gemini Pro and Gemma-2 27B suggest that the Gemini Thinking variants released in late 2024 use related techniques, although Google has not disclosed full pipeline details.
DeepSeek-R1, released in January 2025, is notable for explicitly avoiding neural PRMs. The DeepSeek-R1 technical report (Guo et al., 2025) states that the authors did not apply outcome or process neural reward models when developing DeepSeek-R1-Zero, because they found neural reward models can suffer from reward hacking under large-scale RL and complicate the pipeline. Instead, DeepSeek-R1's reward has two components: a rule-based accuracy reward that checks final answers against ground truth, and a rule-based formatting reward. This choice has been widely cited as evidence that, for tasks with cheap verifiable answers, sparse outcome rewards can be sufficient and safer than neural PRMs.
The table below lists a sample of openly released PRMs and reasoning-focused systems that use process supervision in some form.
| Model or family | Developer | Year | Notes on PRM use |
|---|---|---|---|
| GPT-4 process verifier | OpenAI | 2023 | Trained on PRM800K, described in "Let's Verify Step by Step" |
| Math-Shepherd | DeepSeek and Peking University authors | 2023 | First widely cited automated PRM, used to train Mistral-7B and DeepSeek-Math |
| OmegaPRM | Google DeepMind | 2024 | Automated MCTS-based labeling, applied to Gemini Pro and Gemma2 |
| Qwen2.5-Math-PRM-7B and PRM-72B | Alibaba (Qwen team) | 2024 | Public PRMs trained on a mix of PRM800K and automated data |
| Skywork-o1 Open PRMs | Kunlun Skywork | 2024 | 1.5B and 7B Qwen-based PRMs released openly, focused on math and code reasoning |
| RLHFlow Llama3.1-8B-PRM-Deepseek-Data | RLHFlow community | 2024 | Open PRM trained on DeepSeek-distilled trajectories |
| DeepSeek-R1 | DeepSeek | 2025 | Explicitly avoids neural PRMs; uses rule-based rewards |
PRMs are deployed in several distinct ways across modern reasoning systems. The table below organizes the main use cases.
| Use case | What the PRM does | Typical setting |
|---|---|---|
| Best-of-N reranking | Scores fully generated solutions and selects the highest-scoring one | Math, science, code at test time |
| Weighted self-consistency | Combines PRM scores with majority voting across sampled answers | Math reasoning with diverse final answers |
| Beam search over steps | Expands the most promising partial solutions step by step | Test-time compute heavy reasoning |
| Monte Carlo Tree Search | Acts as a value or prior network for an MCTS-style tree | Long-horizon math, theorem proving |
| Dense RL signal | Provides per-step advantages during PPO or similar RL | Fine-tuning reasoning policies |
| Curriculum and data selection | Filters or ranks training trajectories before fine-tuning | Building reasoning-focused base models |
| Error localization for analysis | Highlights the first incorrect step in a long solution | Interpretability, debugging, and red-teaming |
Many recent systems combine several of these uses. For example, an automated PRM may first filter training trajectories, then provide a dense RL signal during policy optimization, and finally serve as a verifier at inference time.
Despite their attractive properties, PRMs face several well-documented limitations.
The PRM literature expanded rapidly during 2024 and 2025. Major directions include:
Research on PRMs sits at the intersection of reinforcement learning, test-time compute scaling, and AI alignment. The shift toward process supervision is widely seen as a key driver of the 2024 to 2026 wave of reasoning models, even as practitioners continue to debate when neural PRMs are worth the engineering and reward-hacking costs relative to simpler rule-based rewards.