ProcessBench
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,404 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,404 words
Add missing citations, update stale details, or suggest a clearer explanation.
ProcessBench is a benchmark for step-level verification of mathematical reasoning, built by the Qwen Team at Alibaba and released in December 2024. [1] It contains 3,400 competition-level math problems, each paired with a model-generated step-by-step solution that human experts annotated for the index of the earliest erroneous step. The task is simple to state: given a problem and a solution, find the first step that goes wrong, or conclude that every step is correct. ProcessBench has become the de facto evaluation for process reward models (PRMs) and for critic or verifier models, because it isolates the one skill those systems are supposed to have, which is judging individual reasoning steps rather than only final answers. [1]
Most math benchmarks score a model on whether its final answer matches the reference. That is easy to grade, but it hides a lot. A solution can reach the right number through a lucky cancellation of two mistakes, and a solution can be flawless for nine steps and then fumble the tenth. If you only check the last line, you cannot tell these cases apart, and you cannot supervise the reasoning itself.
The ProcessBench authors measured exactly how often this gap appears. On the harder subsets, a large share of solutions that produce a correct final answer still contain an erroneous intermediate step: 32.2% on OlympiadBench and 51.8% on Omni-MATH problems, against 3.5% on grade-school GSM8K and 18.8% on MATH. [1] The error rate climbs with difficulty, and it climbs for every solution generator they tested. That finding is one of the paper's sharper points: on hard problems, current large language models often write process errors even when they land on the right answer, which suggests a limit to reward schemes in reinforcement learning that grade only the final answer. [1]
Step-level verification is the alternative, and it underpins two things the field cares about. The first is process supervision, where a reward model scores each step so a policy can be trained or filtered on the quality of its reasoning rather than its luck. The second is test-time compute: when a model samples many candidate solutions, a verifier that can spot a broken step lets you rerank or prune those candidates, which is the engine behind best-of-N sampling and tree search. A benchmark that directly measures error localization tells you whether those verifiers actually work, instead of inferring their quality indirectly from downstream accuracy. [1]
ProcessBench draws its problems from the test sets of four public math datasets, chosen to span a wide difficulty range. GSM8K supplies grade-school word problems. [2] MATH supplies competition problems. [3] OlympiadBench [4] and Omni-MATH [5] supply olympiad-level problems, which are the hardest tier. Apart from GSM8K, all three of the others sit at competition or olympiad difficulty. [1]
The solutions are not human-written and not drawn from a single model. The team generated them with 12 distinct open-source generators from the Qwen and LLaMA families, spanning different sizes and post-training recipes, so that the benchmark would contain a diverse mix of solution styles rather than the quirks of one model. [1] Before annotation they standardized step granularity through a reformatting pass: they replaced existing line breaks with spaces and asked Qwen2.5-72B-Instruct to reinsert double line breaks at logically complete steps, then discarded any solution whose final answer changed after reformatting (under 0.5% of cases). [1]
Annotation was done by human experts with doctoral-level mathematical training, all of whom had to pass a proficiency exam and a tutorial first. Each solution started with three independent annotators. When the first three disagreed, the team added annotators until three of them agreed on the same answer; if no agreement emerged even with five annotators, the solution was discarded, which produced an overall discard rate of roughly 30%. [1] To keep the benchmark balanced between buggy and clean solutions, they used Qwen2.5-72B-Instruct to check final answers and then sampled correct and incorrect solutions in a balanced way, ending with 200 of each for GSM8K and 500 of each for the other three subsets. The table below shows the resulting composition.
| Subset | Difficulty | Error samples | Correct samples | Total | Process-error rate among correct-answer solutions |
|---|---|---|---|---|---|
| GSM8K | Grade school | 207 | 193 | 400 | 3.5% |
| MATH | Competition | 594 | 406 | 1,000 | 18.8% |
| OlympiadBench | Olympiad | 661 | 339 | 1,000 | 32.2% |
| Omni-MATH | Olympiad | 759 | 241 | 1,000 | 51.8% |
| Total | Mixed | 2,221 | 1,179 | 3,400 | varies by subset |
Formally, given a problem P and a solution split into steps S = {s_0, ..., s_(n-1)}, a model must output an index i. A value of i = -1 means every step is correct, and a value of i in {0, ..., n-1} means the earliest error occurs at step s_i. [1] Steps are indexed from zero, so a label of 2 points to the third step. ProcessBench deliberately targets only the first error: once a step rests on a wrong premise, later steps can be locally valid yet globally meaningless, which makes their individual correctness hard to define, so the benchmark stops at the earliest break. [1]
What counts as an error covers four kinds: mathematical mistakes (bad calculations, algebra, or formula use), logical mistakes (invalid deductions or unwarranted assumptions), conceptual mistakes (misapplying a definition), and completeness mistakes (missing a condition or justification the solution needs). Annotators were also free to mark a step wrong on their own expert judgment beyond those categories. [1]
The scoring is built to punish two failure modes at once. For each subset, the benchmark computes accuracy on the erroneous samples (did the model find the right first-error step?) and accuracy on the correct samples (did the model correctly say nothing is wrong?), then reports their harmonic mean as the F1 score. [1] The harmonic mean matters here. A model that flags an error in every solution would ace the erroneous samples and fail every correct one; a model that always says correct would do the reverse. Only a verifier that is both willing to find errors and able to recognize a clean solution scores well, which is why the paper reports F1 rather than raw accuracy. [1] For PRMs that emit a per-step scalar score, the scores are thresholded into binary correct/incorrect predictions and the earliest incorrect step is taken; the threshold is the one that maximizes F1 on the GSM8K subset. [1]
ProcessBench compares two families. Process reward models are trained specifically to score intermediate steps. The paper evaluates several open-source PRMs: Math-Shepherd, two RLHFlow PRMs built on LLaMA-3.1, and two Skywork PRMs built on Qwen2.5-Math, plus one the authors trained themselves by fine-tuning Qwen2.5-Math-7B-Instruct on the PRM800K dataset. [1] Many of these PRMs derive their step labels from Monte Carlo estimation, which marks a step as good if continuations from it tend to reach the correct final answer.
Critic models are general large language models given a prompt that asks them to read a solution step by step and return the index of the first wrong paragraph as a final answer, in the same format used for ordinary math evaluation. [1] No special training is involved; the same task definition that PRMs are scored on lets any instruction-tuned model be repurposed as a critic. The paper prompts Qwen2, Qwen2.5, Qwen2.5-Math, Qwen2.5-Coder, and LLaMA-3 models, the QwQ-32B-Preview reasoning model, and the proprietary GPT-4o and o1-mini. Open-source critics are scored with majority voting over eight samples; o1-mini is run under a single sample because its API does not expose decoding parameters. [1] This setup is what lets ProcessBench function as a clean instrument for model evaluation of verification ability across very different system designs.
The headline result is that prompt-based critics from strong general models can rival or beat the existing PRMs, which was not the expected outcome for a task PRMs were designed for. The table below lists average F1 across the four subsets, taken from the paper's main results. [1]
| Model | Type | GSM8K | MATH | OlympiadBench | Omni-MATH | Average F1 |
|---|---|---|---|---|---|---|
| Math-Shepherd-PRM-7B | PRM | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
| RLHFlow-PRM-Mistral-8B | PRM | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
| RLHFlow-PRM-Deepseek-8B | PRM | 38.8 | 33.8 | 16.9 | 16.9 | 26.6 |
| Skywork-PRM-1.5B | PRM | 59.0 | 48.0 | 19.3 | 19.2 | 36.4 |
| Skywork-PRM-7B | PRM | 70.8 | 53.6 | 22.9 | 21.0 | 42.1 |
| Qwen2.5-Math-7B-PRM800K (authors' own) | PRM | 68.2 | 62.6 | 50.7 | 44.3 | 56.5 |
| Llama-3.3-70B-Instruct | Critic | 82.9 | 59.4 | 46.7 | 43.0 | 58.0 |
| Qwen2.5-72B-Instruct | Critic | 76.2 | 61.8 | 54.6 | 52.2 | 61.2 |
| QwQ-32B-Preview | Critic | 88.0 | 78.7 | 57.8 | 61.3 | 71.5 |
| GPT-4o-0806 | Critic | 79.2 | 63.6 | 51.4 | 53.5 | 61.9 |
| o1-mini | Critic | 93.2 | 88.9 | 87.2 | 82.4 | 87.9 |
Several patterns stand out. Every model loses ground as the problems get harder, from GSM8K down to Omni-MATH, so generalization across difficulty is a shared weakness of both families. [1] The trained-from-scratch open-source PRMs trail the better critics even on the easy GSM8K and MATH subsets, and they collapse on the olympiad subsets, which led the authors to question the generalization and scalability of the Monte-Carlo-style data synthesis used to build them. [1] Their own PRM, fine-tuned on the human-annotated PRM800K data, reached 56.5 average F1 and beat the other PRMs by a wide margin, which points to the value of genuine human step labels over noisy synthetic ones. [1]
Among critics, the reasoning model QwQ-32B-Preview was the strongest open-source system at 71.5 average F1, competitive with proprietary GPT-4o (61.9) and ahead of it, though still well behind the reasoning-specialized o1-mini at 87.9. [1] The paper attributes the critic advantage to the extra thinking these models can do before committing to a verdict on each step, something a single-pass scalar PRM cannot do. [1]
ProcessBench is a close descendant of the process-supervision line that OpenAI opened with "Let's Verify Step by Step" in 2023, the paper that released PRM800K, a set of roughly 800,000 step-level human feedback labels on MATH solutions used to train a process reward model. [6] PRM800K is training data; ProcessBench is an evaluation. The two intersect directly in the experiment where the authors fine-tune a PRM on PRM800K and find it generalizes better than PRMs trained on synthetic labels, which is some of the cleaner evidence that human step annotation still pays off. [1] PRM800K's own test set is one of the prior datasets ProcessBench is positioned against, alongside CriticBench and MathCheck, on the axes of problem difficulty, solution diversity, and whether steps are annotated by humans. [1]
The benchmark also fed back into reasoning-model development at Qwen. The follow-up paper "The Lessons of Developing Process Reward Models in Mathematical Reasoning" used ProcessBench to evaluate a new generation of PRMs and introduced a consensus filtering mechanism that combines Monte Carlo estimation with an LLM-as-judge step to clean up noisy training labels. [7] The resulting Qwen2.5-Math-PRM-72B reached 78.3 F1 on ProcessBench, a large jump over the earlier open PRMs in the original paper and evidence that better data curation, not just more data, was the missing ingredient. [7] More broadly, the strong showing of QwQ-32B-Preview hinted that reasoning models trained to deliberate before answering make better verifiers, and ProcessBench has since served as a standard checkpoint for that claim.
The authors are candid about what the benchmark cannot guarantee. Even with the multi-annotator protocol, some error-location labels may be inaccurate, especially on the olympiad-level problems where the correct solution path is itself hard for experts to pin down. [1] The roughly 30% of solutions discarded during annotation were often the most difficult ones, so the surviving problem distribution may be biased toward cases that human annotators could actually adjudicate, which could understate how hard real-world verification gets. [1]
There are scope limits too. ProcessBench covers mathematics only, so it says nothing directly about verifying reasoning in code, science, or open-ended tasks, and later work has built separate step-level benchmarks for those domains. The focus on the earliest error is a deliberate simplification that sidesteps how to score later steps once the chain is broken. And because the metric rewards finding the first wrong step, it does not distinguish a verifier that understands why a step is wrong from one that flags the right index for the wrong reason, though the qualitative critiques from strong models suggest the better systems do tend to explain themselves. [1]