# Process reward model (PRM)

> Source: https://aiwiki.ai/wiki/process_reward_model
> Updated: 2026-06-23
> Categories: AI Safety, Machine Learning, Model Evaluation, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **process reward model** (**PRM**), also called a **process-supervised reward model** or **step-level verifier**, is a learned scoring model that evaluates the correctness or quality of each intermediate step in a [large language model](/wiki/large_language_model)'s [chain of thought](/wiki/chain_of_thought), rather than judging only the final answer.[^1][^2] PRMs are contrasted with **outcome reward models** (**ORMs**), also known as outcome-supervised reward models, which assign a single scalar score after the last step of a reasoning trace based on whether the final answer is correct.[^1][^3] By providing dense, step-by-step feedback, PRMs offer a finer-grained training and verification signal that has proven especially useful for long, multi-step reasoning tasks such as mathematics, scientific problem solving, and code generation.[^2]

The modern PRM paradigm was popularized by the May 2023 paper *Let's Verify Step by Step* from [OpenAI](/wiki/openai), which released the **PRM800K** dataset of roughly 800,000 step-level human correctness labels and showed that a PRM-reranked solver reached 78.2% accuracy on a representative subset of the MATH test set, outperforming both an outcome-supervised verifier (72.4%) and majority voting (69.6%).[^2][^4] The paper's headline finding, stated in its abstract, is that "process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset."[^2] The technique built on an earlier 2022 comparison by Jonathan Uesato and collaborators at [Google DeepMind](/wiki/google_deepmind), who introduced the first systematic comparison of process- and outcome-based feedback on GSM8K.[^1] PRMs have since become a core ingredient of test-time search procedures such as [best-of-N sampling](/wiki/best_of_n) and tree search, and have been associated with reasoning systems including [OpenAI o1](/wiki/o1), OpenAI o3, Google's Gemini Thinking variants, and open replications.[^5][^6] At the same time, large-scale deployments such as [DeepSeek-R1](/wiki/deepseek_r1) have explicitly avoided neural PRMs because of concerns over [reward hacking](/wiki/reward_hacking) and pipeline complexity, instead using rule-based verifiable rewards in the style of [RLVR](/wiki/rlvr).[^7] PRMs sit at the center of one of the most active debates in late-2020s reasoning research: when is dense process supervision worth the engineering cost, and when do sparse verifiable rewards suffice?

## Background

A [reward model](/wiki/reward_model) in the context of language models is a function that takes a prompt and a model response and returns a scalar score reflecting how good the response is. Reward models are central to [reinforcement learning from human feedback](/wiki/rlhf) (RLHF), where they replace expensive human ratings during policy optimization with [PPO](/wiki/ppo) or related algorithms, and to inference-time procedures such as best-of-N ranking.[^8] Reward models also underlie [RLAIF](/wiki/rlaif) (reinforcement learning from AI feedback), where AI-labeled preferences substitute for human ones.

The first large-scale demonstration that a learned verifier could push small models past much larger ones came from Karl Cobbe and colleagues at OpenAI, who in 2021 released the GSM8K grade-school math benchmark together with an *outcome verifier*: a model that scored candidate solutions by whether the final numerical answer was correct, used to select the best of many samples at test time.[^3] This established the outcome verifier paradigm and showed that even a small verifier could match or exceed a much larger fine-tuned generator by exploiting [test-time compute](/wiki/test_time_compute).

For multi-step reasoning, however, outcome-only signals have a well-known weakness: a long chain of thought may contain a single critical error that derails the rest of the solution, yet the outcome label only reports the final result. An ORM cannot localize the failure, which makes it weak at credit assignment and prone to rewarding lucky-but-flawed chains. The central motivation for PRMs is to provide a *dense* signal that tells the model where its reasoning went wrong, not just whether it reached the right destination. Reward models for reasoning are categorized along two axes: granularity of supervision (per-solution vs per-step) and source of labels (humans, heuristics, or Monte Carlo rollouts). PRM800K-style human labeling sits at one extreme of cost and precision.

## How does a PRM differ from an outcome reward model?

The two reward-modeling paradigms differ in label structure, training cost, supervision density, and failure modes. The table below summarizes the central contrasts described across the literature.[^1][^2][^7]

| Property | Outcome reward model (ORM) | Process reward model (PRM) |
| --- | --- | --- |
| Label granularity | One label per full solution, usually correct or incorrect final answer | One label per reasoning step, judging local validity or progress |
| Signal density | Sparse, only at the end | Dense, at every step |
| Annotation cost | Low; free when an oracle answer key exists | High when labeled by humans, moderate when labeled by Monte Carlo rollouts |
| Credit assignment | Cannot localize errors within a long solution | Can identify the first incorrect step |
| Reward hacking risk | Lower, because rewards can be grounded in verifiable outcomes | Higher, because a neural model judges intermediate language and can be gamed |
| Typical use | Best-of-N reranking, RLHF on summary tasks, rule-based RLVR | Step-level verifiers for tree search, dense RL signals for reasoning |
| Generalization | Works for any task with a clear final-answer check | Stronger for long, multi-step domains such as math, science, and code |
| Sample efficiency in RL | Lower; sparse signal needs many rollouts | Higher in principle; dense signal accelerates credit assignment |

Critically, ORM and PRM are not mutually exclusive. In *Let's Verify Step by Step*, the same fully-generated solutions are scored by both verifier types and compared head-to-head; downstream systems often combine a rule-based outcome check with a neural PRM signal to anchor learning and reduce hackability.[^2]

## Who introduced process supervision? Uesato et al. (2022)

The phrase "process-supervised reward model" and the systematic comparison between process and outcome supervision were introduced by Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, [Geoffrey Irving](/wiki/geoffrey_irving), and Irina Higgins of Google DeepMind in their November 2022 arXiv preprint *Solving math word problems with process- and outcome-based feedback*.[^1]

The Uesato et al. paper conducted what the authors called "the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task," using the [GSM8K](/wiki/gsm8k) grade-school math benchmark.[^1] All models were based on a 70-billion-parameter Chinchilla-style language model from Hoffmann et al. (2022). The authors compared outcome-supervised and process-supervised variants of both fine-tuning and reward modeling.

The headline results were nuanced. Both process- and outcome-based feedback reached comparable *final-answer* error rates (improving the previous best from 16.8% to 12.7% on GSM8K), suggesting outcome supervision can match process supervision on the final metric while using cheaper labels.[^1] But the picture changed when the authors looked at *reasoning quality*: among solutions with correct final answers, the rate of *trace errors* (answers that arrived at the right number through invalid reasoning) fell from 14.0% to 3.4% only when process-based supervision (or a learned reward model emulating it) was used. Reward-model reranking dropped trace error from 11.4% to under 5% in their best configuration.[^1]

The paper's main conclusion was that outcome supervision is sufficient for *answer accuracy*, but process supervision (or a reward model trained to imitate it) is necessary for *correct reasoning steps*. This established the alignment-flavored motivation for PRMs, generating chains of thought that humans can trust, which would later be central to *Let's Verify Step by Step*.

## What did Let's Verify Step by Step show? (Lightman et al., 2023)

The May 2023 paper *Let's Verify Step by Step*, by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe of OpenAI, brought process supervision into the modern era of [large language models](/wiki/large_language_model) by combining it with [GPT-4](/wiki/gpt-4)-scale generators and a large human-labeled step dataset.[^2] The paper appeared as arXiv:2305.20050 and was accepted to ICLR 2024. Its abstract states the result plainly: "we conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset."[^2]

### Setup

The authors fine-tuned a generator from the base [GPT-4](/wiki/gpt-4) model, pre-trained on next-token prediction without [RLHF](/wiki/rlhf), to produce step-by-step solutions to problems from the [MATH](/wiki/math_dataset) benchmark of competition mathematics. They then trained two verifiers from the same base model: an outcome-supervised reward model (ORM) trained on labels indicating whether each candidate solution's final answer matched the reference, and a process-supervised reward model (PRM) trained on per-step human labels. Both verifiers were used at test time to rerank many sampled candidate solutions, a form of [best-of-N sampling](/wiki/best_of_n).

### What is PRM800K?

To train the PRM, the OpenAI team collected **PRM800K**, a public dataset of step-level human feedback labels on GPT-4-generated solutions to MATH problems, described in the paper's abstract as "the complete dataset of 800,000 step-level human feedback labels used to train our best reward model."[^2][^4] The headline statistic is roughly 800,000 step-level labels; the released archive contains approximately 1,085,590 step labels across 101,599 solution samples to 12,000 MATH problems, with a phase-1 / phase-2 structure (phase 2 is the curated 800K-label release used to train the best PRM).[^4]

Each step was annotated with one of three labels: **positive** (correct), **neutral** (ambiguous), or **negative** (incorrect). For PRM training, the three-way labels are typically collapsed to a binary target. The dataset is hosted on GitHub at `openai/prm800k` and remains the primary public resource for human-labeled PRM research.[^4]

The authors also used **active learning** to choose which solutions to label next, focusing annotator effort on solutions where the current PRM was most uncertain. They reported that active learning was roughly 2.6x more data-efficient than uniform labeling, materially reducing human-labeling cost.[^2]

### Empirical results on MATH

Under matched data and matched compute budgets, the PRM-based reranker (used in a best-of-N setup, typically N = 1860) reached **78.2% accuracy** on a representative subset of the MATH test set, beating both the ORM-based reranker (**72.4%**) and majority voting (**69.6%**).[^2] Increasing N traced out the test-time compute scaling curve that became the empirical core of the paper, and the performance gap between PRM and ORM widened as the number of sampled solutions grew, indicating that the PRM is more effective at searching over many candidate solutions.[^2]

The paper also argued for an alignment dimension: process supervision is more aligned with how humans evaluate reasoning, because it rewards getting *each step* right rather than only the final number. This argument became one of the most cited rationales for putting process supervision at the heart of subsequent reasoning systems.

## How are PRMs trained?

A typical PRM is a transformer with the same architecture as the base generator, fine-tuned on a sequence-classification objective. Inputs are formatted as a problem followed by a partial solution ending at a candidate step, and the model produces a scalar score reflecting the probability that the step is correct or that the partial solution will lead to a correct answer.

### Human labeling

The PRM800K-style pipeline relies on trained human annotators reading model-generated step-by-step solutions and assigning one of {good, neutral, bad} to each step. To control quality, OpenAI used multi-stage onboarding, periodic relabeling, and instructions distinguishing "is this step in a valid solution?" from "is this step correct?".[^2] Human labeling produces high-quality data but is slow and expensive, which has motivated automated alternatives.

### Automated labeling via Monte Carlo rollouts

A wave of follow-up research has investigated whether process labels can be generated automatically by sampling additional rollouts from a model and using the empirical probability that a step leads to a correct final answer as a proxy for its correctness.[^9][^10] Concretely, given a partial solution ending at step *t*, a completer LLM samples many completions; the fraction of those completions reaching the correct final answer is taken as the "value" of step *t*.

**Math-Shepherd** (Wang et al., December 2023, arXiv:2312.08935) was the first widely cited automated PRM along these lines.[^9] The authors generated automatic step-level supervision for roughly 170,000 solutions on GSM8K and 270,000 on MATH, then used the resulting PRM both as a verifier and as a dense reward in step-by-step [PPO](/wiki/ppo). Math-Shepherd lifted Mistral-7B from 77.9% to 84.1% on GSM8K and from 28.6% to 33.0% on MATH, with further gains to 89.1% and 43.5% when the PRM was also used for inference-time verification.[^9]

**OmegaPRM** (Luo et al., June 2024, Google DeepMind, arXiv:2406.06592) extended automated labeling with a divide-and-conquer [Monte Carlo Tree Search](/wiki/mcts) algorithm that uses binary search to efficiently locate the first incorrect step.[^10] The authors collected over 1.5 million process-supervision annotations and used them to train a PRM that, combined with weighted self-consistency, raised the instruction-tuned Gemini Pro model from 51% to 69.4% on MATH500 and from 86.4% to 93.6% on GSM8K, and lifted Gemma2-27B from 42.3% to 58.2% on MATH500, all without human labels.[^10]

### Generative PRMs and PRMs that think

A more recent direction reframes the PRM as a *generator* rather than a classifier. **Generative reward models** and **LLM-as-judge** systems use a language model to produce a chain-of-thought critique of a step, then extract a verdict from the critique.[^11] The 2025 paper *Process Reward Models That Think* introduces **ThinkPRM**, a long-CoT verifier fine-tuned on orders of magnitude fewer process labels than discriminative PRMs that outperforms LLM-as-judge baselines using only about 1% of the process labels under a comparable token budget.[^12] These approaches blur the line between a "reward model" and a "reasoning model" and connect PRM research to the broader [test-time compute](/wiki/test_time_compute) literature.

### Design choices

Three design choices have proven important:

- **Step segmentation.** Solutions are split into steps using newlines, numbered enumerations, or a dedicated step token. Coarse segmentation reduces label cost but loses precision.
- **Label conversion.** PRM800K's three-way labels are typically collapsed to binary. Qwen2.5-Math-PRM-7B, for instance, treats 1 and 0 as positive and -1 as negative.[^16]
- **Scoring objective.** Some PRMs predict per-step correctness independently. Others predict the *value* of the partial solution as a Q-value, the probability that the partial solution will eventually reach a correct answer. Process Advantage Verifiers (PAVs) go further still, predicting the *advantage* of each step relative to a base policy rather than its absolute correctness.[^13]

## What are PRMs used for at inference?

PRMs are used in several distinct ways at inference, often in combination:

- **Best-of-N sampling.** The generator produces N candidate solutions; the PRM scores each by aggregating per-step "correct" probabilities (commonly via the *product* or *minimum*, both of which penalize any clearly-wrong step); the highest-scoring candidate is returned.[^2] This was the headline application in *Let's Verify Step by Step* and remains the most common test-time use case.
- **Weighted self-consistency.** PRM scores are combined with [majority voting](/wiki/self_consistency) across sampled final answers. Candidates whose final answer matches the modal answer get a vote boost, rescaled by PRM step-level quality.[^10] The combination tends to dominate either ingredient alone.
- **Tree search and MCTS.** For long-horizon problems, the PRM acts as a value or prior network for [Monte Carlo Tree Search](/wiki/mcts): candidate continuations are scored, promising branches are expanded, and low-scoring branches are pruned. This is the natural extension of PRM use from rerankers to search procedures and is implemented in open frameworks such as OpenR.
- **Self-correction loops.** A PRM can flag the location of the first low-scoring step; the generator is then asked to revise from that point. This turns the verifier into an editor that drives iterative refinement.
- **Dense reward in RL.** In policy optimization, a PRM can provide a per-step reward signal for [PPO](/wiki/ppo) or [GRPO](/wiki/grpo), turning the sparse outcome signal into a dense advantage estimate. Math-Shepherd and follow-ups demonstrated that this improves both sample efficiency and final accuracy on math benchmarks.[^9]

## How do PRMs relate to OpenAI o1 and test-time compute scaling?

[OpenAI o1](/wiki/o1), released in preview in September 2024 and as a full model in December 2024, was the first widely deployed [reasoning model](/wiki/reasoning_model) to scale long internal chains of thought at inference time. OpenAI has not published the full training recipe, but the o1 announcement and system card emphasize that the model was trained with large-scale reinforcement learning that rewards productive chains of thought, with performance improving both with more train-time RL compute and with more test-time thinking compute.[^5][^6]

Many third-party analyses cite *Let's Verify Step by Step* and PRM800K as the canonical methodological precursors of o1, in part because Hunter Lightman and several PRM-paper co-authors later worked on the reasoning teams that produced o1. The precise extent to which o1 uses a neural PRM in its training loop remains undisclosed; OpenAI has emphasized large-scale RL with verifiable rewards on math and code, which can be implemented without a PRM, but the public record is consistent with PRMs playing a role in candidate selection, search, or curriculum construction.[^5] OpenAI o3, announced in December 2024, extends the same paradigm. The connection between PRMs and o1 is clearest at the *concept* level: the PRM literature established that adding inference compute plus a quality verifier delivers reliable accuracy gains, the trade-off that o1 exposes to end users as "thinking time."

## Why did DeepSeek-R1 not use a PRM?

[DeepSeek-R1](/wiki/deepseek_r1), released in January 2025, is notable for *explicitly* avoiding neural PRMs and ORMs and instead training with a rule-based reward function. The DeepSeek-R1 technical report (Guo et al., arXiv:2501.12948) states that "we do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline."[^7] The report discusses PRMs in detail and gives three reasons:[^7]

1. **Step definition is hard in general reasoning.** Defining a fine-grained reasoning step is challenging in non-mathematical domains.
2. **Step correctness is hard to judge.** Automated annotation is noisy, and manual annotation does not scale.
3. **Reward hacking under large-scale RL.** Once a model-based PRM is introduced, the report argues, "it inevitably leads to reward hacking": the policy produces confident-sounding intermediate steps that the PRM scores highly without progressing toward correct answers. Retraining the reward model is expensive and complicates the pipeline.

Instead, **DeepSeek-R1-Zero**, the variant trained purely with RL from a base model without supervised fine-tuning, used a rule-based reward with two components: an **accuracy reward** that checks final answers against ground truth (e.g., comparing numerical answers, executing generated code), and a **format reward** that requires the model to wrap reasoning in `<think>...</think>` tags. The full DeepSeek-R1 system added a cold-start SFT stage, language-consistency rewards, and a final RLHF stage with a more general reward model, but the core RL signal for reasoning remained rule-based throughout. This produced strong math and code performance, competitive with o1 on several benchmarks, while sidestepping neural PRMs entirely.[^7]

DeepSeek-R1's report is widely cited as evidence that, for tasks with cheap verifiable answers, *sparse outcome rewards can be sufficient* and *safer* than neural PRMs, and as a key data point in the rise of [RLVR](/wiki/rlvr) as an alternative paradigm.

## Limitations

Despite their attractive properties, PRMs face several well-documented limitations.

- **Annotation cost.** Step-level human labels are dramatically more expensive than outcome labels; even automated rollouts cost more than collecting outcome labels for the same problems.
- **Reward hacking.** When a neural PRM is plugged into a large-scale RL loop, the policy can learn to produce confident-sounding but vacuous intermediate steps that the PRM scores highly without progressing toward a correct answer. DeepSeek-R1's report cited this as a primary reason for avoiding neural PRMs altogether.[^7]
- **Step definition.** It is often unclear what a "step" should be. In math, an algebraic manipulation is a natural step; in code or scientific reasoning the boundaries are fuzzy.
- **Generalization (out-of-distribution).** PRMs trained on competition math often transfer poorly to other reasoning tasks, especially open-ended ones without easily verifiable answers.
- **Calibration.** PRMs trained on Monte Carlo-derived labels can be miscalibrated; value estimates depend on the completer LLM's distribution of completions.
- **Evaluation difficulty.** A PRM that scores well on held-out PRM800K labels can still fail to detect subtle reasoning errors. The 2025 **PRMBench** benchmark (Song et al., arXiv:2501.03124) evaluated 25 open and closed PRMs and reported significant weaknesses on subtle error detection.[^14]
- **Meta-reasoning attacks.** Because PRMs read natural-language steps, a strong policy can construct chains of thought that "talk past" the verifier, using meta-commentary or hedging to inflate per-step scores without committing to verifiable claims.

## What are the alternatives to PRMs?

Several alternative paradigms compete with or complement neural PRMs.

- **Rule-based RLVR.** [RLVR](/wiki/rlvr) (reinforcement learning with verifiable rewards) sidesteps neural reward modeling entirely: the reward is computed by a rule (an equation checker, a code unit test, a string match) rather than by a learned scorer. This is the approach taken by DeepSeek-R1-Zero and has become standard for math and code reasoning.[^7] RLVR is robust to reward hacking but limited to tasks with mechanizable verifiers.
- **Generative reward models and LLM-as-judge.** Generative reward models prompt an LLM to *verbalize* a verdict and extract a score from the response.[^11][^12] These models are more transparent than discriminative PRMs and can scale verification compute at test time.
- **GRPO and process-free RL.** [GRPO](/wiki/grpo) (Group Relative Policy Optimization), introduced by DeepSeek in DeepSeekMath and adopted by DeepSeek-R1, normalizes rewards within a group of sampled responses, removing the need for a value-function critic.[^7] Combined with rule-based outcome rewards, GRPO provides a process-free RL signal that has scaled well on math and code without any neural PRM.
- **MCTS and value networks.** [Monte Carlo Tree Search](/wiki/mcts) approaches treat the PRM as one component of a search algorithm and complement it with explicit value networks and rollout policies. AlphaMath and ReST-MCTS are illustrative.
- **Hybrid pipelines.** Many production systems filter training data with automated process rewards, do SFT on the high-quality subset, and then run RL with rule-based outcome rewards. This decouples PRM use (for data) from PRM use (in the RL loop), capturing dense-signal benefits while avoiding the worst reward-hacking pathologies.

## Recent variants and research directions

The PRM literature expanded rapidly during 2024 and 2025. Notable directions include:

- **Process Advantage Verifiers (PAVs).** Setlur et al. (October 2024, arXiv:2410.08146) predict the marginal *advantage* of each step relative to a base policy and reported PAVs to be roughly 8-10 percentage points more accurate and 1.5-5x more compute-efficient than ORMs for test-time search, with around 6x better RL data efficiency.[^13]
- **Domain expansion.** Beyond math, recent PRMs cover code generation, tool use (ToolPRMBench), scientific reasoning, and agentic decision-making.
- **Benchmarks.** Beyond PRMBench, suites such as Socratic-PRMBench and the broader RewardBench / JudgeBench ecosystem standardize PRM evaluation.[^14]
- **Long-context and multimodal PRMs.** As reasoning traces grow into hundreds of thousands of tokens, new PRMs handle long contexts and multimodal inputs such as visual math.
- **Generative process rewards.** ThinkPRM and related approaches produce chain-of-thought critiques, blurring the line between a PRM and a reasoning model.[^12]
- **Surveys.** *A Survey of Process Reward Models* (Zhou et al., arXiv:2510.08049) consolidates the field.[^15]

Research on PRMs sits at the intersection of [reinforcement learning](/wiki/reinforcement_learning), test-time compute scaling, and [AI alignment](/wiki/ai_alignment). The shift toward process supervision is widely seen as a key driver of the 2024-2026 wave of reasoning models, even as practitioners debate when neural PRMs are worth the engineering and reward-hacking costs relative to simpler rule-based rewards.

## Datasets

The table below summarizes widely used datasets for training and evaluating PRMs.

| Dataset | Year | Source of step labels | Approximate scale | Domain |
| --- | --- | --- | --- | --- |
| PRM800K | 2023 | Human annotators (OpenAI) | ~800K step labels, ~75K solutions, 12K MATH problems | Competition math |
| Math-Shepherd | 2023 | Automated Monte Carlo rollouts | ~170K solutions (GSM8K) + ~270K (MATH) | Grade-school and competition math |
| OmegaPRM | 2024 | Automated divide-and-conquer MCTS | ~1.5M step annotations | Mathematical reasoning |
| Math-PSA / OpenR | 2024 | Automated rollouts and curated mixtures | Several hundred thousand step labels | Math and general reasoning |
| PRMBench | 2025 | Curated with controlled error perturbations | 6,216 problems, 83,456 step labels | Benchmark for evaluating PRMs |

PRM800K remains the de facto reference dataset for human-labeled process supervision; Math-Shepherd and OmegaPRM are the most cited automated-labeling pipelines.

## Open PRMs and reasoning systems

The table below lists openly released PRMs and reasoning-focused systems that use process supervision in some form.

| Model or family | Developer | Year | Notes |
| --- | --- | --- | --- |
| GPT-4 process verifier | OpenAI | 2023 | Trained on PRM800K, in *Let's Verify Step by Step* |
| Math-Shepherd | DeepSeek and PKU authors | 2023 | First widely cited automated PRM |
| OmegaPRM | Google DeepMind | 2024 | Automated MCTS-based labeling, Gemini Pro and Gemma2 |
| Qwen2.5-Math-PRM-7B / 72B | Alibaba Qwen team | 2024 | Public PRMs on PRM800K and automated data |
| Skywork-o1 Open PRMs | Kunlun Skywork | 2024 | 1.5B and 7B Qwen-based PRMs |
| RLHFlow Llama3.1-8B-PRM | RLHFlow community | 2024 | Trained on DeepSeek-distilled trajectories |
| OpenAI o1 / o3 | OpenAI | 2024 | RL reasoning models; PRM use undisclosed |
| DeepSeek-R1 | DeepSeek | 2025 | Avoids neural PRMs; rule-based rewards |
| ThinkPRM | Academic | 2025 | Generative long-CoT PRM, label-efficient |

## See also

- [Reward model](/wiki/reward_model)
- [Outcome reward model](/wiki/orm)
- [Reinforcement learning from human feedback](/wiki/rlhf)
- [Reinforcement learning from AI feedback](/wiki/rlaif)
- [Reinforcement learning with verifiable rewards](/wiki/rlvr)
- [Chain-of-thought prompting](/wiki/chain_of_thought)
- [Reasoning model](/wiki/reasoning_model)
- [OpenAI o1](/wiki/o1)
- [DeepSeek-R1](/wiki/deepseek_r1)
- [Best-of-N sampling](/wiki/best_of_n)
- [Monte Carlo Tree Search](/wiki/mcts)
- [MATH (dataset)](/wiki/math_dataset)
- [PPO](/wiki/ppo)
- [GRPO](/wiki/grpo)
- [Test-time compute](/wiki/test_time_compute)
- [Reward hacking](/wiki/reward_hacking)

## References

[^1]: Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., Higgins, I. (2022). "Solving math word problems with process- and outcome-based feedback." arXiv:2211.14275. https://arxiv.org/abs/2211.14275

[^2]: Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K. (2023). "Let's Verify Step by Step." arXiv:2305.20050; ICLR 2024. https://arxiv.org/abs/2305.20050

[^3]: Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. https://arxiv.org/abs/2110.14168

[^4]: OpenAI. "PRM800K: 800,000 step-level correctness labels on LLM solutions to MATH problems." GitHub repository. https://github.com/openai/prm800k

[^5]: OpenAI. "Learning to reason with LLMs" (o1 announcement and analysis), September 2024. https://openai.com/index/learning-to-reason-with-llms/

[^6]: OpenAI. "OpenAI o1 System Card," December 2024. https://cdn.openai.com/o1-system-card-20241205.pdf

[^7]: DeepSeek-AI, Guo, D., Yang, D., Zhang, H., et al. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948

[^8]: Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT). arXiv:2203.02155. https://arxiv.org/abs/2203.02155

[^9]: Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., Sui, Z. (2023). "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations." arXiv:2312.08935; ACL 2024. https://arxiv.org/abs/2312.08935

[^10]: Luo, L., Liu, Y., Liu, R., Phatale, S., Lara, H., Li, Y., Shu, L., Zhu, Y., Meng, L., Sun, J., Rastogi, A. (2024). "Improve Mathematical Reasoning in Language Models by Automated Process Supervision." arXiv:2406.06592. https://arxiv.org/abs/2406.06592

[^11]: Mahan, D., Van Phung, D., Rafailov, R., Blagden, C., Lile, N., Castricato, L., Frnchette, J.-P., Foster, C., Zou, A., Cooper, A., Sabhuwal, S., Boyle, C., Maharaj, T., Cundy, C. (2024). "Generative Reward Models." arXiv:2410.12832. https://arxiv.org/abs/2410.12832

[^12]: Khalifa, M., et al. (2025). "Process Reward Models That Think (ThinkPRM)." arXiv:2504.16828. https://arxiv.org/abs/2504.16828

[^13]: Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R., Agarwal, A., Berant, J., Kumar, A. (2024). "Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning." arXiv:2410.08146. https://arxiv.org/abs/2410.08146

[^14]: Song, M., Su, Z., Qu, X., Zhou, J., Cheng, Y. (2025). "PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models." arXiv:2501.03124; ACL 2025. https://arxiv.org/abs/2501.03124

[^15]: Zhou, Y., et al. (2025). "A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models." arXiv:2510.08049. https://arxiv.org/abs/2510.08049

[^16]: Qwen Team. "Towards Effective Process Supervision in Mathematical Reasoning." Qwen Blog, 2024. https://qwenlm.github.io/blog/qwen2.5-math-prm/

[^17]: Skywork. "Skywork-o1 Open PRM model cards." Hugging Face, 2024. https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B