RE-Bench
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,842 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,842 words
Add missing citations, update stale details, or suggest a clearer explanation.
RE-Bench (short for Research Engineering Benchmark) is a benchmark for evaluating the frontier AI research-and-development capabilities of large language model agents, developed by the nonprofit METR (Model Evaluation and Threat Research). It measures how well AI agents perform on open-ended machine-learning research-engineering tasks and compares their performance directly against human experts working under matched time budgets [1][2]. RE-Bench is unusual among AI evaluations in pairing a suite of realistic, hand-built engineering environments with a large dataset of human-expert attempts, allowing a like-for-like comparison rather than a purely automated score [1]. The benchmark was introduced in a November 2024 paper led by Hjalmar Wijk and 22 coauthors, later published as a spotlight at the International Conference on Machine Learning (ICML) 2025 [1][3].
RE-Bench version 1 consists of seven challenging, open-ended ML research-engineering environments together with data from 71 eight-hour attempts by 61 distinct human experts [1]. Each environment presents an agent (or a human) with a concrete engineering goal, a basic starting solution, and a scoring function that can be evaluated repeatedly during a run. The benchmark is designed so that progress is measurable on a continuous scale rather than as a single pass or fail, which makes it possible to track partial credit and to compare how scores improve as more time is spent [1][2].
The headline finding is that the strongest AI agents tested achieved a score roughly four times higher than human experts when both were limited to a two-hour total time budget per environment, but that humans showed better returns to additional time: they narrowly exceeded the top AI agent at an eight-hour budget and reached about twice the top agent's score at a 32-hour budget [1]. METR frames this contrast as evidence that current frontier models are fast and capable on short, well-specified engineering problems but still fall behind skilled humans on the kind of sustained, iterative work that characterizes real research [2].
METR is an independent nonprofit that studies whether frontier AI systems pose catastrophic risks through dangerous autonomous capabilities, and it conducts pre-deployment evaluations that feed into model system cards for several leading labs [2][4]. A recurring concern in frontier AI safety policy is the prospect that AI systems could automate large parts of AI research and development, potentially accelerating their own improvement. METR argues that despite this concern there were few realistic evaluations of AI R&D ability, and essentially none that offered a direct, calibrated comparison to human performance [1][2].
RE-Bench was built to fill that gap. Rather than testing isolated coding puzzles, the environments target the kind of applied ML engineering that researchers actually do, such as optimizing training scripts, writing custom GPU kernels, and fitting scaling laws [2]. This focus distinguishes RE-Bench from METR's separate line of work on how the length of tasks AI can complete grows over time (see task-completion time horizon), which measures task difficulty in terms of human time rather than research-engineering skill specifically [4].
The seven RE-Bench environments each define a quantitative objective and ship with a starting solution and a stronger reference solution used to calibrate scores [1]. The tasks span performance optimization, model repair, experimental design, and agent scaffolding:
| Environment | Goal | Scoring metric |
|---|---|---|
| Optimize an LLM Foundry finetuning script | Reduce the runtime of a finetuning script without changing its behavior | Log of execution time to finetune on 1000 datapoints [5] |
| Optimize a kernel | Write a custom Triton GPU kernel computing a prefix-sum of a function | Log of execution time [5] |
| Fix embedding | Recover the original webtext performance of a model whose embeddings have been permuted (corrupted) | log(loss − 1.5) on OpenWebText [5] |
| Scaling-law experiment | Predict the optimal tradeoff between hidden size and number of training steps from limited experimental data | Distance from the true optimum [5] |
| Restricted-architecture MLM | Build a text-prediction model from a limited set of PyTorch primitives, excluding division and exponentiation | log(loss − 1.5) on OpenWebText [5] |
| Finetune GPT-2 for QA with reinforcement learning | Finetune GPT-2-small into an effective question-answering chatbot using RL | Average win rate judged by a Llama-3-8B evaluator [5] |
| Build scaffolding for Rust code-contest problems | Prompt and scaffold GPT-3.5 to solve competitive-programming problems written in Rust | Fraction of held-out problems solved [5] |
The environments were chosen to reward genuine ML expertise while remaining tractable within hours rather than weeks. One task (the kernel-optimization problem) drew custom solutions from frontier models that beat every human attempt, which METR highlighted as a sign of real engineering competence rather than memorization [2].
A central feature of RE-Bench is its large human-baseline dataset. METR recruited 61 distinct experts, drawn from people with relevant ML research and engineering experience, and collected 71 attempts in which each expert worked on an environment for up to eight hours [1]. The experts confirmed that the environments are solvable and that meaningful progress is possible in the allotted time: 82% of expert attempts achieved a non-zero score, and 24% matched or exceeded the strong reference solutions [1].
To compare agents with humans fairly, METR aggregated human attempts into score-versus-time curves and built equivalent curves for AI agents by sampling multiple runs and taking a best-of-k under a fixed total time budget [1]. Because human and agent effort are both denominated in time, the two can be plotted on the same axes. This methodology allows statements such as "at a given budget, agent A performs comparably to the Nth-percentile human expert" [2][6].
Scores in each environment are normalized so that the starting solution corresponds to a value of 0 and the reference solution to a value of 1, with higher being better and scores allowed to exceed 1 when a solution beats the reference [1][5]. Comparisons are reported under several total time budgets, principally 2, 8, and 32 hours, reflecting METR's observation that access to compute and wall-clock time is often the binding constraint in real AI R&D [1][2].
The choice of budget matters because humans and agents scale differently. Frontier models can generate and test candidate solutions more than ten times faster than humans and at much lower cost, which gives them an early lead, but they tend to plateau, whereas human scores keep climbing as more time is invested [1][2]. One environment, the scaling-law experiment, was excluded from the best-of-k analysis because repeatedly observing the score would trivialize it [5].
When the November 2024 paper was released, METR reported results for two agent scaffolds, its own "Modular" agent and the tree-search-based AIDE scaffold, running frontier models including Claude 3.5 Sonnet (both the June and October 2024 versions) and OpenAI's o1-preview [1]. Both models outperformed the median human over short horizons but failed to keep improving, while humans overtook them given more time [2][6].
| Comparison | Budget | Result | Source |
|---|---|---|---|
| Best AI agent vs. human experts | 2 hours | Agent scores about 4× the human average | [1] |
| Best AI agent vs. human experts | 8 hours | Humans narrowly exceed the top agent | [1] |
| Best AI agent vs. human experts | 32 hours | Human average is about 2× the top agent | [1] |
| Claude 3.5 Sonnet (agent) | 8 hours | Comparable to a 37th-percentile human expert | [6] |
| o1-preview (agent) | 8 hours | Comparable to a 30th-percentile human expert | [6] |
| Human expert attempts (baseline) | 8 hours | 82% non-zero score; 24% matched or beat reference | [1] |
| o1-preview kernel solution vs. best human | (kernel task) | 0.64 ms vs. 0.67 ms; agent solution faster than all 9 human attempts | [2] |
These figures describe the models available in 2024 and early 2025. METR has since run RE-Bench environments as part of pre-deployment evaluations of newer systems, including OpenAI's o3 and o4-mini, where time-constrained runs of two hours each were used [4][7].
RE-Bench is positioned as an early warning system for one of the capabilities most relevant to frontier AI risk: the automation of AI research itself [1][2]. If agents were to surpass expert humans across these environments at long time budgets and low cost, that would suggest AI systems could meaningfully speed up their own development, a dynamic that safety researchers treat as a potential threshold for rapid and hard-to-govern capability gains [1]. By tying agent performance to concrete human percentiles, the benchmark gives policymakers and labs a more interpretable signal than abstract accuracy numbers [2][6].
The work has been influential in the evaluation community. It was published as a spotlight paper at ICML 2025, and its environments, human-expert data, analysis code, and agent transcripts were open-sourced to support reproduction and follow-on research [1][3][5]. METR and others have since extended the framework, including studies of automated kernel engineering and of adversarial or sabotage behavior in research settings built on RE-Bench [2][8].
METR is explicit that RE-Bench is a partial proxy for real AI R&D rather than a complete measure of it [1][2]. The benchmark contains only seven environments, far fewer than many automated benchmarks, which limits statistical resolution. Its tasks come with clear objectives, a working starting solution, and fast scoring feedback, whereas genuine research often involves ambiguous goals, the need to define one's own metrics, and feedback loops that can take weeks or months [1][2]. Performance is also sensitive to the agent scaffold and to the time-budget aggregation method, so headline comparisons depend on methodological choices [1]. Finally, because the environments and reference solutions are public, future evaluations risk contamination from training data, a general hazard for static benchmarks [1]. METR therefore presents RE-Bench results as one input among several rather than a definitive verdict on whether AI can automate AI research.