RE-Bench

AI Benchmarks Model Evaluation

9 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,842 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RE-Bench (short for Research Engineering Benchmark) is a benchmark for evaluating the frontier AI research-and-development capabilities of large language model agents, developed by the nonprofit METR (Model Evaluation and Threat Research). It measures how well AI agents perform on open-ended machine-learning research-engineering tasks and compares their performance directly against human experts working under matched time budgets ^[1]^[2]. RE-Bench is unusual among AI evaluations in pairing a suite of realistic, hand-built engineering environments with a large dataset of human-expert attempts, allowing a like-for-like comparison rather than a purely automated score ^[1]. The benchmark was introduced in a November 2024 paper led by Hjalmar Wijk and 22 coauthors, later published as a spotlight at the International Conference on Machine Learning (ICML) 2025 ^[1]^[3].

Overview

RE-Bench version 1 consists of seven challenging, open-ended ML research-engineering environments together with data from 71 eight-hour attempts by 61 distinct human experts ^[1]. Each environment presents an agent (or a human) with a concrete engineering goal, a basic starting solution, and a scoring function that can be evaluated repeatedly during a run. The benchmark is designed so that progress is measurable on a continuous scale rather than as a single pass or fail, which makes it possible to track partial credit and to compare how scores improve as more time is spent ^[1]^[2].

The headline finding is that the strongest AI agents tested achieved a score roughly four times higher than human experts when both were limited to a two-hour total time budget per environment, but that humans showed better returns to additional time: they narrowly exceeded the top AI agent at an eight-hour budget and reached about twice the top agent's score at a 32-hour budget ^[1]. METR frames this contrast as evidence that current frontier models are fast and capable on short, well-specified engineering problems but still fall behind skilled humans on the kind of sustained, iterative work that characterizes real research ^[2].

METR and motivation

METR is an independent nonprofit that studies whether frontier AI systems pose catastrophic risks through dangerous autonomous capabilities, and it conducts pre-deployment evaluations that feed into model system cards for several leading labs ^[2]^[4]. A recurring concern in frontier AI safety policy is the prospect that AI systems could automate large parts of AI research and development, potentially accelerating their own improvement. METR argues that despite this concern there were few realistic evaluations of AI R&D ability, and essentially none that offered a direct, calibrated comparison to human performance ^[1]^[2].

RE-Bench was built to fill that gap. Rather than testing isolated coding puzzles, the environments target the kind of applied ML engineering that researchers actually do, such as optimizing training scripts, writing custom GPU kernels, and fitting scaling laws ^[2]. This focus distinguishes RE-Bench from METR's separate line of work on how the length of tasks AI can complete grows over time (see task-completion time horizon), which measures task difficulty in terms of human time rather than research-engineering skill specifically ^[4].

The task suite

The seven RE-Bench environments each define a quantitative objective and ship with a starting solution and a stronger reference solution used to calibrate scores ^[1]. The tasks span performance optimization, model repair, experimental design, and agent scaffolding:

Environment	Goal	Scoring metric
Optimize an LLM Foundry finetuning script	Reduce the runtime of a finetuning script without changing its behavior	Log of execution time to finetune on 1000 datapoints ^[5]
Optimize a kernel	Write a custom Triton GPU kernel computing a prefix-sum of a function	Log of execution time ^[5]
Fix embedding	Recover the original webtext performance of a model whose embeddings have been permuted (corrupted)	log(loss − 1.5) on OpenWebText ^[5]
Scaling-law experiment	Predict the optimal tradeoff between hidden size and number of training steps from limited experimental data	Distance from the true optimum ^[5]
Restricted-architecture MLM	Build a text-prediction model from a limited set of PyTorch primitives, excluding division and exponentiation	log(loss − 1.5) on OpenWebText ^[5]
Finetune GPT-2 for QA with reinforcement learning	Finetune GPT-2-small into an effective question-answering chatbot using RL	Average win rate judged by a Llama-3-8B evaluator ^[5]
Build scaffolding for Rust code-contest problems	Prompt and scaffold GPT-3.5 to solve competitive-programming problems written in Rust	Fraction of held-out problems solved ^[5]

The environments were chosen to reward genuine ML expertise while remaining tractable within hours rather than weeks. One task (the kernel-optimization problem) drew custom solutions from frontier models that beat every human attempt, which METR highlighted as a sign of real engineering competence rather than memorization ^[2].

Human-expert baseline and methodology

A central feature of RE-Bench is its large human-baseline dataset. METR recruited 61 distinct experts, drawn from people with relevant ML research and engineering experience, and collected 71 attempts in which each expert worked on an environment for up to eight hours ^[1]. The experts confirmed that the environments are solvable and that meaningful progress is possible in the allotted time: 82% of expert attempts achieved a non-zero score, and 24% matched or exceeded the strong reference solutions ^[1].

To compare agents with humans fairly, METR aggregated human attempts into score-versus-time curves and built equivalent curves for AI agents by sampling multiple runs and taking a best-of-k under a fixed total time budget ^[1]. Because human and agent effort are both denominated in time, the two can be plotted on the same axes. This methodology allows statements such as "at a given budget, agent A performs comparably to the Nth-percentile human expert" ^[2]^[6].

Scoring and time budgets

Scores in each environment are normalized so that the starting solution corresponds to a value of 0 and the reference solution to a value of 1, with higher being better and scores allowed to exceed 1 when a solution beats the reference ^[1]^[5]. Comparisons are reported under several total time budgets, principally 2, 8, and 32 hours, reflecting METR's observation that access to compute and wall-clock time is often the binding constraint in real AI R&D ^[1]^[2].

The choice of budget matters because humans and agents scale differently. Frontier models can generate and test candidate solutions more than ten times faster than humans and at much lower cost, which gives them an early lead, but they tend to plateau, whereas human scores keep climbing as more time is invested ^[1]^[2]. One environment, the scaling-law experiment, was excluded from the best-of-k analysis because repeatedly observing the score would trivialize it ^[5].

Notable results

When the November 2024 paper was released, METR reported results for two agent scaffolds, its own "Modular" agent and the tree-search-based AIDE scaffold, running frontier models including Claude 3.5 Sonnet (both the June and October 2024 versions) and OpenAI's o1-preview ^[1]. Both models outperformed the median human over short horizons but failed to keep improving, while humans overtook them given more time ^[2]^[6].

Comparison	Budget	Result	Source
Best AI agent vs. human experts	2 hours	Agent scores about 4× the human average	^[1]
Best AI agent vs. human experts	8 hours	Humans narrowly exceed the top agent	^[1]
Best AI agent vs. human experts	32 hours	Human average is about 2× the top agent	^[1]
Claude 3.5 Sonnet (agent)	8 hours	Comparable to a 37th-percentile human expert	^[6]
o1-preview (agent)	8 hours	Comparable to a 30th-percentile human expert	^[6]
Human expert attempts (baseline)	8 hours	82% non-zero score; 24% matched or beat reference	^[1]
o1-preview kernel solution vs. best human	(kernel task)	0.64 ms vs. 0.67 ms; agent solution faster than all 9 human attempts	^[2]

These figures describe the models available in 2024 and early 2025. METR has since run RE-Bench environments as part of pre-deployment evaluations of newer systems, including OpenAI's o3 and o4-mini, where time-constrained runs of two hours each were used ^[4]^[7].

Significance for AI R&D acceleration and safety

RE-Bench is positioned as an early warning system for one of the capabilities most relevant to frontier AI risk: the automation of AI research itself ^[1]^[2]. If agents were to surpass expert humans across these environments at long time budgets and low cost, that would suggest AI systems could meaningfully speed up their own development, a dynamic that safety researchers treat as a potential threshold for rapid and hard-to-govern capability gains ^[1]. By tying agent performance to concrete human percentiles, the benchmark gives policymakers and labs a more interpretable signal than abstract accuracy numbers ^[2]^[6].

The work has been influential in the evaluation community. It was published as a spotlight paper at ICML 2025, and its environments, human-expert data, analysis code, and agent transcripts were open-sourced to support reproduction and follow-on research ^[1]^[3]^[5]. METR and others have since extended the framework, including studies of automated kernel engineering and of adversarial or sabotage behavior in research settings built on RE-Bench ^[2]^[8].

Limitations

METR is explicit that RE-Bench is a partial proxy for real AI R&D rather than a complete measure of it ^[1]^[2]. The benchmark contains only seven environments, far fewer than many automated benchmarks, which limits statistical resolution. Its tasks come with clear objectives, a working starting solution, and fast scoring feedback, whereas genuine research often involves ambiguous goals, the need to define one's own metrics, and feedback loops that can take weeks or months ^[1]^[2]. Performance is also sensitive to the agent scaffold and to the time-budget aggregation method, so headline comparisons depend on methodological choices ^[1]. Finally, because the environments and reference solutions are public, future evaluations risk contamination from training data, a general hazard for static benchmarks ^[1]. METR therefore presents RE-Bench results as one input among several rather than a definitive verdict on whether AI can automate AI research.

References

Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Karnofsky, H., Kinniment, M., Lajko, A., Nix, S., Sato, L., Saunders, W., Taran, M., West, B., & Barnes, E. (2024). "RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts." arXiv:2411.15114. https://arxiv.org/abs/2411.15114 ↩
METR. (2024, November 22). "Evaluating frontier AI R&D capabilities of language model agents against human experts." https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ ↩
Wijk et al. (2025). "RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts." *Proceedings of the 42nd International Conference on Machine Learning* (ICML 2025), PMLR vol. 267, pp. 66772-66832. https://proceedings.mlr.press/v267/wijk25a.html ↩
METR. "Research." https://metr.org/research/ ↩
METR. "RE-Bench (source code and task descriptions)." GitHub. https://github.com/METR/RE-Bench ↩
METR. (2025, January 31). "An update on our preliminary evaluations of Claude 3.5 Sonnet and o1." https://metr.org/blog/2025-01-31-update-sonnet-o1-evals/ ↩
METR. "Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini." https://metr.org/evaluations/openai-o3-report/ ↩
METR. (2025, February 14). "Measuring Automated Kernel Engineering." https://metr.org/blog/2025-02-14-measuring-automated-kernel-engineering/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Benchmark (AI)

Overview

METR and motivation

The task suite

Human-expert baseline and methodology

Scoring and time budgets

Notable results

Significance for AI R&D acceleration and safety

Limitations

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench