PaperBench
Last reviewed
Jun 2, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,549 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,549 words
Add missing citations, update stale details, or suggest a clearer explanation.
PaperBench is a benchmark released by OpenAI in April 2025 that measures whether AI agents can replicate cutting-edge machine learning research papers from scratch. Each task asks an agent to read a recent paper, build a codebase that implements its method, run the experiments, and reproduce the paper's empirical results, with no access to the authors' original code. Submissions are scored against fine-grained rubrics that were co-developed with the original authors, and grading is performed by an automated large language model judge. The benchmark and its accompanying paper, "PaperBench: Evaluating AI's Ability to Replicate AI Research," were introduced to gauge the autonomous research-engineering capabilities of frontier models.[1][2]
PaperBench targets a harder problem than most coding or question-answering evaluations: reproducing an entire research result end to end. An agent receives a paper in PDF and Markdown form plus a short set of instructions, and must produce a repository that, when executed in a clean environment, regenerates the paper's headline findings. The task therefore combines reading comprehension, software engineering, and experiment execution into a single long-horizon assignment.[1]
The benchmark was built by an OpenAI team led by Giulio Starace and Tejal Patwardhan, and the work is closely tied to the company's efforts to track potentially dangerous autonomous capabilities under its Preparedness Framework. All code, rubrics, and tooling were open-sourced through OpenAI's preparedness repository on GitHub to support further study of AI engineering ability.[1][2]
A PaperBench attempt is judged on three things working together: that the agent wrote correct code implementing the paper's method, that the code actually runs, and that running it produces results matching the paper. This mirrors what a human researcher must do to reproduce a result, and it makes partial credit meaningful. An agent that writes plausible-looking code but never executes it, or executes code that fails to reproduce the reported numbers, scores poorly even if the implementation looks superficially complete.[1][3]
Crucially, agents are forbidden from using the authors' released code. The benchmark tests genuine replication from the paper's description rather than the ability to locate and rerun an existing implementation.[1]
PaperBench draws on 20 papers that were accepted as Spotlight or Oral presentations at ICML 2024, spanning topics such as deep learning, reinforcement learning, and probabilistic methods. These categories were chosen because Spotlight and Oral papers represent work the conference judged to be especially significant.[1][2]
For each paper the authors built a hierarchical rubric that decomposes the replication into progressively smaller requirements, terminating in concrete, individually checkable leaf nodes. Across all 20 papers the rubrics contain 8,316 individually gradable tasks, giving the benchmark a far finer resolution than a simple pass/fail reproduction check.[1][2][3]
Each leaf node belongs to one of three types:
| Leaf node type | What it checks |
|---|---|
| Code Development | The submitted source code contains a correct implementation of a specific component of the paper.[1] |
| Execution | A specific part of the code was actually run, as evidenced in the execution logs.[1] |
| Result Match | The executed submission contains evidence of having reproduced a particular result reported in the paper.[1] |
To keep the rubrics realistic, OpenAI collaborated directly with the original authors of each paper, who helped define what a faithful replication of their work should contain. This co-development is a central design choice and is intended to make the grading criteria reflect how the researchers themselves would assess a reproduction.[1][2]
Manually checking thousands of rubric items per submission is impractical at scale, so the authors built an LLM-based judge, named SimpleJudge, that reads a submission and scores it against the rubric leaf by leaf. A replication score is then computed by aggregating the leaf-level judgments up the rubric tree, yielding a single percentage between 0 and 100 for each attempt.[1][3]
Because an unreliable judge would undermine the whole benchmark, the authors also created a separate evaluation called JudgeEval to measure how well automated judges agree with human graders. JudgeEval uses partial replications of a subset of papers that human experts had already labeled, allowing the judge's verdicts to be scored for accuracy. The best configuration reported, SimpleJudge backed by the o3-mini reasoning model, reached an F1 score of 0.83 against the human ground truth.[1][3]
Running the judge is computationally expensive. OpenAI estimated that grading a single full PaperBench submission with the o3-mini-based judge costs roughly 66 US dollars in API credits, and a complete benchmark run across all papers and seeds was reported to cost on the order of several thousand dollars.[1][3]
OpenAI evaluated several frontier models using open-source agent scaffolds. The headline figure is the average replication score across the 20 papers, expressed as a percentage. Models were given a single A10 GPU inside an Ubuntu Docker container and a 12-hour wall-clock limit per attempt.[1][3]
The strongest result came from Claude 3.5 Sonnet (New) running a simple "BasicAgent" scaffold, which reached 21.0 percent. OpenAI's o1 scored lower under the same basic scaffold but improved substantially when given a more persistent "IterativeAgent" scaffold that encouraged it to keep working rather than stopping early.[1][3]
| Model | Scaffold | Average replication score |
|---|---|---|
| Claude 3.5 Sonnet (New) | BasicAgent | 21.0% ± 0.8 [1][3] |
| OpenAI o1 (high) | IterativeAgent | 24.4% [1][3] |
| OpenAI o1 (high) | BasicAgent | 13.2% ± 0.3 [1][3] |
| OpenAI o3-mini (high) | IterativeAgent | 8.5% [1][3] |
| DeepSeek-R1 | BasicAgent | 6.0% ± 0.3 [1][3] |
| GPT-4o | BasicAgent | 4.1% ± 0.1 [1][3] |
| Gemini 2.0 Flash | BasicAgent | 3.2% ± 0.2 [1][3] |
| OpenAI o3-mini (high) | BasicAgent | 2.6% ± 0.2 [1][3] |
The authors also defined a lighter variant, PaperBench Code-Dev, which grades only the Code Development leaf nodes and skips the requirement to execute the code and match results. This variant is cheaper to grade, around 10 US dollars per paper, and produces much higher scores: o1 with the IterativeAgent scaffold reached 43.4 percent on Code-Dev, reflecting that writing plausible implementation code is easier than getting it to run and reproduce the original numbers.[1][3]
To establish a human reference point, OpenAI recruited 8 participants who were enrolled in or had completed a PhD in machine learning and had them attempt a subset of the papers under the same rules as the models. On a 3-paper subset, the best human attempt out of three tries reached 41.4 percent after 48 hours of effort, well above o1's 26.6 percent on the same subset.[1][3]
The time dynamics were notable. In the reported comparison, o1 made faster early progress than the humans during the first hour, but its score plateaued while the humans continued improving, overtaking the model after roughly a day and pulling further ahead by the 48-hour mark. The headline conclusion is that current models do not yet match skilled human researchers at full paper replication, even though they can move quickly at the start.[1][3]
PaperBench is part of a wave of evaluations aimed at the autonomous research-engineering frontier, alongside OpenAI's own MLE-bench and software-engineering benchmarks such as SWE-bench. What distinguishes it is the demand for full end-to-end reproduction of published results rather than isolated coding tasks or short-answer questions like GPQA. By tying grading to author-validated rubrics and validating the automated judge against human labels, the benchmark offers a more defensible measure of whether an agent has actually replicated a paper.[1][2]
The benchmark also functions as a capability tripwire. The ability to autonomously reproduce, and eventually extend, frontier research is one of the capabilities OpenAI tracks for safety reasons, since an AI system that could conduct research without human oversight would represent a significant shift. Open-sourcing the full harness lets outside groups rerun the evaluation on new models and monitor progress over time.[1][2]
Several constraints temper the benchmark's results. The automated judge, while validated, is imperfect, and its F1 of 0.83 means some leaf-level judgments will be wrong; grading also remains costly enough that frequent re-evaluation is expensive.[1][3]
The hardware and time budget are deliberately modest. A single A10 GPU and a 12-hour limit cap the kinds of experiments an agent can realistically run, so papers requiring heavy compute are difficult to reproduce regardless of an agent's reasoning ability.[1][3]
Reported scores are also sensitive to the agent scaffold rather than the underlying model alone, as shown by o1 nearly doubling its score when switched from the BasicAgent to the IterativeAgent harness. This makes cross-model comparisons dependent on the surrounding tooling. Finally, the human baseline was collected on a small subset of papers with a limited number of participants, so it should be read as an indicative reference point rather than a definitive measure of expert performance.[1][3]