Process reward model (PRM)

A process reward model (PRM), also called a process-supervised reward model or step-level verifier, is a learned scoring model that evaluates the correctness or quality of each intermediate step in a large language model's chain of thought, rather than judging only the final answer.^[1]^[2] PRMs are contrasted with outcome reward models (ORMs), also known as outcome-supervised reward models, which assign a single scalar score after the last step of a reasoning trace based on whether the final answer is correct.^[1]^[3] By providing dense, step-by-step feedback, PRMs offer a finer-grained training and verification signal that has proven especially useful for long, multi-step reasoning tasks such as mathematics, scientific problem solving, and code generation.^[2]

The modern PRM paradigm was popularized by the May 2023 paper Let's Verify Step by Step from OpenAI, which released the PRM800K dataset of roughly 800,000 step-level human correctness labels and showed that a PRM-reranked solver reached 78.2% accuracy on a representative subset of the MATH test set, outperforming both an outcome-supervised verifier (72.4%) and majority voting (69.6%).^[2]^[4] The technique built on an earlier 2022 comparison by Jonathan Uesato and collaborators at Google DeepMind, who introduced the first systematic comparison of process- and outcome-based feedback on GSM8K.^[1] PRMs have since become a core ingredient of test-time search procedures such as best-of-N sampling and tree search, and have been associated with reasoning systems including OpenAI o1, OpenAI o3, Google's Gemini Thinking variants, and open replications.^[5]^[6] At the same time, large-scale deployments such as DeepSeek-R1 have explicitly avoided neural PRMs because of concerns over reward hacking and pipeline complexity, instead using rule-based verifiable rewards in the style of RLVR.^[7] PRMs sit at the center of one of the most active debates in late-2020s reasoning research: when is dense process supervision worth the engineering cost, and when do sparse verifiable rewards suffice?

Background

A reward model in the context of language models is a function that takes a prompt and a model response and returns a scalar score reflecting how good the response is. Reward models are central to reinforcement learning from human feedback (RLHF), where they replace expensive human ratings during policy optimization with PPO or related algorithms, and to inference-time procedures such as best-of-N ranking.^[8] Reward models also underlie RLAIF (reinforcement learning from AI feedback), where AI-labeled preferences substitute for human ones.

The first large-scale demonstration that a learned verifier could push small models past much larger ones came from Karl Cobbe and colleagues at OpenAI, who in 2021 released the GSM8K grade-school math benchmark together with an outcome verifier: a model that scored candidate solutions by whether the final numerical answer was correct, used to select the best of many samples at test time.^[3] This established the outcome verifier paradigm and showed that even a small verifier could match or exceed a much larger fine-tuned generator by exploiting test-time compute.

For multi-step reasoning, however, outcome-only signals have a well-known weakness: a long chain of thought may contain a single critical error that derails the rest of the solution, yet the outcome label only reports the final result. An ORM cannot localize the failure, which makes it weak at credit assignment and prone to rewarding lucky-but-flawed chains. The central motivation for PRMs is to provide a dense signal that tells the model where its reasoning went wrong, not just whether it reached the right destination. Reward models for reasoning are categorized along two axes: granularity of supervision (per-solution vs per-step) and source of labels (humans, heuristics, or Monte Carlo rollouts). PRM800K-style human labeling sits at one extreme of cost and precision.

ORM versus PRM

The two reward-modeling paradigms differ in label structure, training cost, supervision density, and failure modes. The table below summarizes the central contrasts described across the literature.^[1]^[2]^[7]

Property	Outcome reward model (ORM)	Process reward model (PRM)
Label granularity	One label per full solution, usually correct or incorrect final answer	One label per reasoning step, judging local validity or progress
Signal density	Sparse, only at the end	Dense, at every step
Annotation cost	Low; free when an oracle answer key exists	High when labeled by humans, moderate when labeled by Monte Carlo rollouts
Credit assignment	Cannot localize errors within a long solution	Can identify the first incorrect step
Reward hacking risk	Lower, because rewards can be grounded in verifiable outcomes	Higher, because a neural model judges intermediate language and can be gamed
Typical use	Best-of-N reranking, RLHF on summary tasks, rule-based RLVR	Step-level verifiers for tree search, dense RL signals for reasoning
Generalization	Works for any task with a clear final-answer check	Stronger for long, multi-step domains such as math, science, and code
Sample efficiency in RL	Lower; sparse signal needs many rollouts	Higher in principle; dense signal accelerates credit assignment

Critically, ORM and PRM are not mutually exclusive. In Let's Verify Step by Step, the same fully-generated solutions are scored by both verifier types and compared head-to-head; downstream systems often combine a rule-based outcome check with a neural PRM signal to anchor learning and reduce hackability.^[2]

Uesato et al. (2022): origin of process supervision

The phrase "process-supervised reward model" and the systematic comparison between process and outcome supervision were introduced by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins of Google DeepMind in their November 2022 arXiv preprint Solving math word problems with process- and outcome-based feedback.^[1]

The Uesato et al. paper conducted what the authors called "the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task," using the GSM8K grade-school math benchmark.^[1] All models were based on a 70-billion-parameter Chinchilla-style language model from Hoffmann et al. (2022). The authors compared outcome-supervised and process-supervised variants of both fine-tuning and reward modeling.

The headline results were nuanced. Both process- and outcome-based feedback reached comparable final-answer error rates (improving the previous best from 16.8% to 12.7% on GSM8K), suggesting outcome supervision can match process supervision on the final metric while using cheaper labels.^[1] But the picture changed when the authors looked at reasoning quality: among solutions with correct final answers, the rate of trace errors (answers that arrived at the right number through invalid reasoning) fell from 14.0% to 3.4% only when process-based supervision (or a learned reward model emulating it) was used. Reward-model reranking dropped trace error from 11.4% to under 5% in their best configuration.^[1]

The paper's main conclusion was that outcome supervision is sufficient for answer accuracy, but process supervision (or a reward model trained to imitate it) is necessary for correct reasoning steps. This established the alignment-flavored motivation for PRMs, generating chains of thought that humans can trust, which would later be central to Let's Verify Step by Step.

Let's Verify Step by Step (Lightman et al., 2023)

The May 2023 paper Let's Verify Step by Step, by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe of OpenAI, brought process supervision into the modern era of large language models by combining it with GPT-4-scale generators and a large human-labeled step dataset.^[2] The paper appeared as arXiv:2305.20050 and was accepted to ICLR 2024.

Setup

The authors fine-tuned a generator from the base GPT-4 model, pre-trained on next-token prediction without RLHF, to produce step-by-step solutions to problems from the MATH benchmark of competition mathematics. They then trained two verifiers from the same base model: an outcome-supervised reward model (ORM) trained on labels indicating whether each candidate solution's final answer matched the reference, and a process-supervised reward model (PRM) trained on per-step human labels. Both verifiers were used at test time to rerank many sampled candidate solutions, a form of best-of-N sampling.

PRM800K

To train the PRM, the OpenAI team collected PRM800K, a public dataset of step-level human feedback labels on GPT-4-generated solutions to MATH problems.^[2]^[4] The headline statistic is roughly 800,000 step-level labels; the released archive contains approximately 1,085,590 step labels across 101,599 solution samples to 12,000 MATH problems, with a phase-1 / phase-2 structure (phase 2 is the curated 800K-label release used to train the best PRM).^[4]

Each step was annotated with one of three labels: positive (correct), neutral (ambiguous), or negative (incorrect). For PRM training, the three-way labels are typically collapsed to a binary target. The dataset is hosted on GitHub at openai/prm800k and remains the primary public resource for human-labeled PRM research.^[4]

The authors also used active learning to choose which solutions to label next, focusing annotator effort on solutions where the current PRM was most uncertain. They reported that active learning was roughly 2.6× more data-efficient than uniform labeling, materially reducing human-labeling cost.^[2]

Empirical results on MATH

Under matched data and matched compute budgets, the PRM-based reranker (used in a best-of-N setup, typically N = 1860) reached 78.2% accuracy on a representative subset of the MATH test set, beating both the ORM-based reranker (72.4%) and majority voting (69.6%).^[2] Increasing N traced out the test-time compute scaling curve that became the empirical core of the paper.

The paper also argued for an alignment dimension: process supervision is more aligned with how humans evaluate reasoning, because it rewards getting each step right rather than only the final number. This argument became one of the most cited rationales for putting process supervision at the heart of subsequent reasoning systems.

How PRMs are trained

A typical PRM is a transformer with the same architecture as the base generator, fine-tuned on a sequence-classification objective. Inputs are formatted as a problem followed by a partial solution ending at a candidate step, and the model produces a scalar score reflecting the probability that the step is correct or that the partial solution will lead to a correct answer.

Human labeling

The PRM800K-style pipeline relies on trained human annotators reading model-generated step-by-step solutions and assigning one of {good, neutral, bad} to each step. To control quality, OpenAI used multi-stage onboarding, periodic relabeling, and instructions distinguishing "is this step in a valid solution?" from "is this step correct?".^[2] Human labeling produces high-quality data but is slow and expensive, which has motivated automated alternatives.

Automated labeling via Monte Carlo rollouts

A wave of follow-up research has investigated whether process labels can be generated automatically by sampling additional rollouts from a model and using the empirical probability that a step leads to a correct final answer as a proxy for its correctness.^[9]^[10] Concretely, given a partial solution ending at step t, a completer LLM samples many completions; the fraction of those completions reaching the correct final answer is taken as the "value" of step t.

Math-Shepherd (Wang et al., December 2023, arXiv:2312.08935) was the first widely cited automated PRM along these lines.^[9] The authors generated automatic step-level supervision for roughly 170,000 solutions on GSM8K and 270,000 on MATH, then used the resulting PRM both as a verifier and as a dense reward in step-by-step PPO. Math-Shepherd lifted Mistral-7B from 77.9% to 84.1% on GSM8K and from 28.6% to 33.0% on MATH, with further gains to 89.1% and 43.5% when the PRM was also used for inference-time verification.^[9]

OmegaPRM (Luo et al., June 2024, Google DeepMind, arXiv:2406.06592) extended automated labeling with a divide-and-conquer Monte Carlo Tree Search algorithm that uses binary search to efficiently locate the first incorrect step.^[10] The authors collected over 1.5 million process-supervision annotations and used them to train a PRM that, combined with weighted self-consistency, raised the instruction-tuned Gemini Pro model from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K, and lifted Gemma2-27B from 42.3% to 58.2% on MATH500, all without human labels.^[10]

Generative PRMs and PRMs that think

A more recent direction reframes the PRM as a generator rather than a classifier. Generative reward models and LLM-as-judge systems use a language model to produce a chain-of-thought critique of a step, then extract a verdict from the critique.^[11] The 2025 paper Process Reward Models That Think introduces ThinkPRM, a long-CoT verifier fine-tuned on orders of magnitude fewer process labels than discriminative PRMs that outperforms LLM-as-judge baselines using only about 1% of the process labels under a comparable token budget.^[12] These approaches blur the line between a "reward model" and a "reasoning model" and connect PRM research to the broader test-time compute literature.

Design choices

Three design choices have proven important:

Step segmentation. Solutions are split into steps using newlines, numbered enumerations, or a dedicated step token. Coarse segmentation reduces label cost but loses precision.
Label conversion. PRM800K's three-way labels are typically collapsed to binary. Qwen2.5-Math-PRM-7B, for instance, treats 1 and 0 as positive and -1 as negative.
Scoring objective. Some PRMs predict per-step correctness independently. Others predict the value of the partial solution as a Q-value, the probability that the partial solution will eventually reach a correct answer. Process Advantage Verifiers (PAVs) go further still, predicting the advantage of each step relative to a base policy rather than its absolute correctness.^[13]

Usage at inference

PRMs are used in several distinct ways at inference, often in combination:

Best-of-N sampling. The generator produces N candidate solutions; the PRM scores each by aggregating per-step "correct" probabilities (commonly via the product or minimum, both of which penalize any clearly-wrong step); the highest-scoring candidate is returned.^[2] This was the headline application in Let's Verify Step by Step and remains the most common test-time use case.
Weighted self-consistency. PRM scores are combined with majority voting across sampled final answers. Candidates whose final answer matches the modal answer get a vote boost, rescaled by PRM step-level quality.^[10] The combination tends to dominate either ingredient alone.
Tree search and MCTS. For long-horizon problems, the PRM acts as a value or prior network for Monte Carlo Tree Search: candidate continuations are scored, promising branches are expanded, and low-scoring branches are pruned. This is the natural extension of PRM use from rerankers to search procedures and is implemented in open frameworks such as OpenR.
Self-correction loops. A PRM can flag the location of the first low-scoring step; the generator is then asked to revise from that point. This turns the verifier into an editor that drives iterative refinement.
Dense reward in RL. In policy optimization, a PRM can provide a per-step reward signal for PPO or GRPO, turning the sparse outcome signal into a dense advantage estimate. Math-Shepherd and follow-ups demonstrated that this improves both sample efficiency and final accuracy on math benchmarks.^[9]

PRMs in OpenAI o1 and test-time compute scaling

OpenAI o1, released in preview in September 2024 and as a full model in December 2024, was the first widely deployed reasoning model to scale long internal chains of thought at inference time. OpenAI has not published the full training recipe, but the o1 announcement and system card emphasize that the model was trained with large-scale reinforcement learning that rewards productive chains of thought, with performance improving both with more train-time RL compute and with more test-time thinking compute.^[5]^[6]

Many third-party analyses cite Let's Verify Step by Step and PRM800K as the canonical methodological precursors of o1, in part because Hunter Lightman and several PRM-paper co-authors later worked on the reasoning teams that produced o1. The precise extent to which o1 uses a neural PRM in its training loop remains undisclosed; OpenAI has emphasized large-scale RL with verifiable rewards on math and code, which can be implemented without a PRM, but the public record is consistent with PRMs playing a role in candidate selection, search, or curriculum construction.^[5] OpenAI o3, announced in December 2024, extends the same paradigm. The connection between PRMs and o1 is clearest at the concept level: the PRM literature established that adding inference compute plus a quality verifier delivers reliable accuracy gains, the trade-off that o1 exposes to end users as "thinking time."

DeepSeek R1: why DeepSeek did not use PRMs

DeepSeek-R1, released in January 2025, is notable for explicitly avoiding neural PRMs and ORMs and instead training with a rule-based reward function. The DeepSeek-R1 technical report (Guo et al., arXiv:2501.12948) discusses PRMs in detail and gives three reasons:^[7]

Step definition is hard in general reasoning. Defining a fine-grained reasoning step is challenging in non-mathematical domains.
Step correctness is hard to judge. Automated annotation is noisy, and manual annotation does not scale.
Reward hacking under large-scale RL. Introducing a neural PRM into a long RL loop "inevitably leads to reward hacking": the policy produces confident-sounding intermediate steps that the PRM scores highly without progressing toward correct answers. Retraining the reward model is expensive and complicates the pipeline.

Instead, DeepSeek-R1-Zero, the variant trained purely with RL from a base model without supervised fine-tuning, used a rule-based reward with two components: an accuracy reward that checks final answers against ground truth (e.g., comparing numerical answers, executing generated code), and a format reward that requires the model to wrap reasoning in <think>...</think> tags. The full DeepSeek-R1 system added a cold-start SFT stage, language-consistency rewards, and a final RLHF stage with a more general reward model, but the core RL signal for reasoning remained rule-based throughout. This produced strong math and code performance, competitive with o1 on several benchmarks, while sidestepping neural PRMs entirely.^[7]

DeepSeek-R1's report is widely cited as evidence that, for tasks with cheap verifiable answers, sparse outcome rewards can be sufficient and safer than neural PRMs, and as a key data point in the rise of RLVR as an alternative paradigm.

Limitations

Despite their attractive properties, PRMs face several well-documented limitations.

Annotation cost. Step-level human labels are dramatically more expensive than outcome labels; even automated rollouts cost more than collecting outcome labels for the same problems.
Reward hacking. When a neural PRM is plugged into a large-scale RL loop, the policy can learn to produce confident-sounding but vacuous intermediate steps that the PRM scores highly without progressing toward a correct answer. DeepSeek-R1's report cited this as a primary reason for avoiding neural PRMs altogether.^[7]
Step definition. It is often unclear what a "step" should be. In math, an algebraic manipulation is a natural step; in code or scientific reasoning the boundaries are fuzzy.
Generalization (out-of-distribution). PRMs trained on competition math often transfer poorly to other reasoning tasks, especially open-ended ones without easily verifiable answers.
Calibration. PRMs trained on Monte Carlo-derived labels can be miscalibrated; value estimates depend on the completer LLM's distribution of completions.
Evaluation difficulty. A PRM that scores well on held-out PRM800K labels can still fail to detect subtle reasoning errors. The 2025 PRMBench benchmark (Song et al., arXiv:2501.03124) evaluated 25 open and closed PRMs and reported significant weaknesses on subtle error detection.^[14]
Meta-reasoning attacks. Because PRMs read natural-language steps, a strong policy can construct chains of thought that "talk past" the verifier, using meta-commentary or hedging to inflate per-step scores without committing to verifiable claims.

Alternatives

Several alternative paradigms compete with or complement neural PRMs.

Rule-based RLVR. RLVR (reinforcement learning with verifiable rewards) sidesteps neural reward modeling entirely: the reward is computed by a rule (an equation checker, a code unit test, a string match) rather than by a learned scorer. This is the approach taken by DeepSeek-R1-Zero and has become standard for math and code reasoning.^[7] RLVR is robust to reward hacking but limited to tasks with mechanizable verifiers.
Generative reward models and LLM-as-judge. Generative reward models prompt an LLM to verbalize a verdict and extract a score from the response.^[11]^[12] These models are more transparent than discriminative PRMs and can scale verification compute at test time.
GRPO and process-free RL. GRPO (Group Relative Policy Optimization), introduced by DeepSeek in DeepSeekMath and adopted by DeepSeek-R1, normalizes rewards within a group of sampled responses, removing the need for a value-function critic.^[7] Combined with rule-based outcome rewards, GRPO provides a process-free RL signal that has scaled well on math and code without any neural PRM.
MCTS and value networks. Monte Carlo Tree Search approaches treat the PRM as one component of a search algorithm and complement it with explicit value networks and rollout policies. AlphaMath and ReST-MCTS are illustrative.
Hybrid pipelines. Many production systems filter training data with automated process rewards, do SFT on the high-quality subset, and then run RL with rule-based outcome rewards. This decouples PRM use (for data) from PRM use (in the RL loop), capturing dense-signal benefits while avoiding the worst reward-hacking pathologies.

Recent variants and research directions

The PRM literature expanded rapidly during 2024 and 2025. Notable directions include:

Process Advantage Verifiers (PAVs). Setlur et al. (October 2024, arXiv:2410.08146) predict the marginal advantage of each step relative to a base policy and reported PAVs to be roughly 8-10 percentage points more accurate and 1.5-5× more compute-efficient than ORMs for test-time search, with around 6× better RL data efficiency.^[13]
Domain expansion. Beyond math, recent PRMs cover code generation, tool use (ToolPRMBench), scientific reasoning, and agentic decision-making.
Benchmarks. Beyond PRMBench, suites such as Socratic-PRMBench and the broader RewardBench / JudgeBench ecosystem standardize PRM evaluation.^[14]
Long-context and multimodal PRMs. As reasoning traces grow into hundreds of thousands of tokens, new PRMs handle long contexts and multimodal inputs such as visual math.
Generative process rewards. ThinkPRM and related approaches produce chain-of-thought critiques, blurring the line between a PRM and a reasoning model.^[12]
Surveys. A Survey of Process Reward Models (Zhou et al., arXiv:2510.08049) consolidates the field.^[15]

Research on PRMs sits at the intersection of reinforcement learning, test-time compute scaling, and AI alignment. The shift toward process supervision is widely seen as a key driver of the 2024-2026 wave of reasoning models, even as practitioners debate when neural PRMs are worth the engineering and reward-hacking costs relative to simpler rule-based rewards.

Datasets

The table below summarizes widely used datasets for training and evaluating PRMs.

Dataset	Year	Source of step labels	Approximate scale	Domain
PRM800K	2023	Human annotators (OpenAI)	~800K step labels, ~75K solutions, 12K MATH problems	Competition math
Math-Shepherd	2023	Automated Monte Carlo rollouts	~170K solutions (GSM8K) + ~270K (MATH)	Grade-school and competition math
OmegaPRM	2024	Automated divide-and-conquer MCTS	~1.5M step annotations	Mathematical reasoning
Math-PSA / OpenR	2024	Automated rollouts and curated mixtures	Several hundred thousand step labels	Math and general reasoning
PRMBench	2025	Curated with controlled error perturbations	6,216 problems, 83,456 step labels	Benchmark for evaluating PRMs

PRM800K remains the de facto reference dataset for human-labeled process supervision; Math-Shepherd and OmegaPRM are the most cited automated-labeling pipelines.

Open PRMs and reasoning systems

The table below lists openly released PRMs and reasoning-focused systems that use process supervision in some form.

Model or family	Developer	Year	Notes
GPT-4 process verifier	OpenAI	2023	Trained on PRM800K, in Let's Verify Step by Step
Math-Shepherd	DeepSeek and PKU authors	2023	First widely cited automated PRM
OmegaPRM	Google DeepMind	2024	Automated MCTS-based labeling, Gemini Pro and Gemma2
Qwen2.5-Math-PRM-7B / 72B	Alibaba Qwen team	2024	Public PRMs on PRM800K and automated data
Skywork-o1 Open PRMs	Kunlun Skywork	2024	1.5B and 7B Qwen-based PRMs
RLHFlow Llama3.1-8B-PRM	RLHFlow community	2024	Trained on DeepSeek-distilled trajectories
OpenAI o1 / o3	OpenAI	2024	RL reasoning models; PRM use undisclosed
DeepSeek-R1	DeepSeek	2025	Avoids neural PRMs; rule-based rewards
ThinkPRM	Academic	2025	Generative long-CoT PRM, label-efficient

References

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., Higgins, I. (2022). "Solving math word problems with process- and outcome-based feedback." arXiv:2211.14275. https://arxiv.org/abs/2211.14275
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K. (2023). "Let's Verify Step by Step." arXiv:2305.20050; ICLR 2024. https://arxiv.org/abs/2305.20050
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. https://arxiv.org/abs/2110.14168
OpenAI. "PRM800K: 800,000 step-level correctness labels on LLM solutions to MATH problems." GitHub repository. https://github.com/openai/prm800k
OpenAI. "Learning to reason with LLMs" (o1 announcement and analysis), September 2024. https://openai.com/index/learning-to-reason-with-llms/
OpenAI. "OpenAI o1 System Card," December 2024. https://cdn.openai.com/o1-system-card-20241205.pdf
DeepSeek-AI, Guo, D., Yang, D., Zhang, H., et al. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948
Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT). arXiv:2203.02155. https://arxiv.org/abs/2203.02155
Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., Sui, Z. (2023). "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations." arXiv:2312.08935; ACL 2024. https://arxiv.org/abs/2312.08935
Luo, L., Liu, Y., Liu, R., Phatale, S., Lara, H., Li, Y., Shu, L., Zhu, Y., Meng, L., Sun, J., Rastogi, A. (2024). "Improve Mathematical Reasoning in Language Models by Automated Process Supervision." arXiv:2406.06592. https://arxiv.org/abs/2406.06592
Mahan, D., Van Phung, D., Rafailov, R., Blagden, C., Lile, N., Castricato, L., Frnchette, J.-P., Foster, C., Zou, A., Cooper, A., Sabhuwal, S., Boyle, C., Maharaj, T., Cundy, C. (2024). "Generative Reward Models." arXiv:2410.12832. https://arxiv.org/abs/2410.12832
Khalifa, M., et al. (2025). "Process Reward Models That Think (ThinkPRM)." arXiv:2504.16828. https://arxiv.org/abs/2504.16828
Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R., Agarwal, A., Berant, J., Kumar, A. (2024). "Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning." arXiv:2410.08146. https://arxiv.org/abs/2410.08146
Song, M., Su, Z., Qu, X., Zhou, J., Cheng, Y. (2025). "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models." arXiv:2501.03124; ACL 2025. https://arxiv.org/abs/2501.03124
Zhou, Y., et al. (2025). "A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models." arXiv:2510.08049. https://arxiv.org/abs/2510.08049
Qwen Team. "Towards Effective Process Supervision in Mathematical Reasoning." Qwen Blog, 2024. https://qwenlm.github.io/blog/qwen2.5-math-prm/
Skywork. "Skywork-o1 Open PRM model cards." Hugging Face, 2024. https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B

Background

ORM versus PRM

Uesato et al. (2022): origin of process supervision

Let's Verify Step by Step (Lightman et al., 2023)

Setup

PRM800K

Empirical results on MATH

How PRMs are trained

Human labeling

Automated labeling via Monte Carlo rollouts

Generative PRMs and PRMs that think

Design choices

Usage at inference

PRMs in OpenAI o1 and test-time compute scaling

DeepSeek R1: why DeepSeek did not use PRMs

Limitations

Alternatives

Recent variants and research directions

Datasets

Open PRMs and reasoning systems

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

Reinforcement learning from human feedback

ARC-AGI 2

AlphaGo

Reward hacking

RLAIF

Background

ORM versus PRM

Uesato et al. (2022): origin of process supervision

Let's Verify Step by Step (Lightman et al., 2023)

Setup

PRM800K

Empirical results on MATH

How PRMs are trained

Human labeling

Automated labeling via Monte Carlo rollouts

Generative PRMs and PRMs that think

Design choices

Usage at inference

PRMs in OpenAI o1 and test-time compute scaling

DeepSeek R1: why DeepSeek did not use PRMs

Limitations

Alternatives

Recent variants and research directions

Datasets

Open PRMs and reasoning systems

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

Reinforcement learning from human feedback

ARC-AGI 2

AlphaGo

Reward hacking

RLAIF