ScienceAgentBench

AI Benchmarks Model Evaluation

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,569 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

ScienceAgentBench is an AI benchmark for evaluating whether language model agents can perform real, data-driven scientific analysis by writing and executing code. Introduced in October 2024 by a group led by Ziru Chen at The Ohio State University, the benchmark consists of 102 tasks extracted from 44 peer-reviewed scientific publications spanning four disciplines: bioinformatics, computational chemistry, geographical information science, and psychology and cognitive neuroscience ^[1]^[2]. Each task requires an agent to produce a self-contained Python program that loads scientific data, performs a defined analysis, and saves a verifiable output, mirroring an individual stage of a genuine research workflow.

The paper, "ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery," was published as a preprint on arXiv on 7 October 2024 and subsequently accepted to the International Conference on Learning Representations (ICLR) 2025 ^[1]. The benchmark was designed as a deliberately grounded counterweight to sweeping "AI scientist" claims: rather than judging an agent on an end-to-end discovery pipeline, it isolates concrete, expert-validated coding tasks whose correctness can be checked objectively. In the authors' headline finding, the best-performing agent configurations at release solved only a minority of tasks, with a Claude 3.5 Sonnet agent reaching 32.4 percent independently and OpenAI's o1-preview reaching 42.2 percent at more than ten times the cost, underscoring that contemporary agents remain far from automating scientific discovery ^[1].

Motivation: rigor for AI-scientist claims

By 2024, a wave of work argued that LLM-based agents could accelerate or even automate parts of the scientific process, with some systems marketed as autonomous "AI scientists" capable of generating hypotheses, running experiments, and writing papers. The creators of ScienceAgentBench argued that such claims were difficult to assess because existing evaluations were often coarse, used artificial or toy datasets, or relied on subjective judgments of an agent's full research output ^[1]. Without rigorous, realistic measurement, it was hard to separate real capability from hype.

ScienceAgentBench addresses this gap by adopting an "essential tasks" philosophy. Instead of asking whether an agent can autonomously make a discovery end to end, the benchmark asks whether an agent can reliably perform the individual data-driven steps that real scientists actually code, such as loading domain-specific data formats, running a statistical model, training a predictive classifier, or producing a publication-quality visualization. Because every task is drawn from a published study and reviewed by domain experts, success on the benchmark reflects competence on work that practicing researchers consider meaningful ^[1]^[3]. The authors are explicit that strong performance here would be a necessary but not sufficient condition for trustworthy autonomous research, framing the benchmark as a floor of basic competence rather than a test of full automation.

What ScienceAgentBench contains

The benchmark comprises 102 tasks curated from 44 peer-reviewed publications across four scientific fields ^[1]^[2]. Each task is a self-contained unit of a real research workflow, paired with the input data, a natural-language task instruction, and, where relevant, a domain-knowledge document and a reference (gold) program drawn from the original study's released code.

The four disciplines bring deliberately heterogeneous data types and analysis goals:

Discipline	Representative data and tasks
Bioinformatics	Cell and microscopy images, biological sequences, classification and analysis pipelines
Computational chemistry	Molecular structures and properties, predictive modeling of chemical data
Geographical information science	Geospatial and raster map data, spatial analysis and mapping
Psychology and cognitive neuroscience	EEG and behavioral signals, statistical modeling and visualization

Across these domains, the tasks exercise a range of capabilities including data loading and preprocessing, statistical and machine-learning modeling, scientific computing, and data visualization. The unifying constraint is that every task's expected deliverable is a single Python source file that, when executed, produces the required output artifact (for example a saved figure, a results table, or model predictions) ^[1]. This uniform output format is what makes automated, objective scoring possible across otherwise disparate scientific problems.

To prevent benchmark leakage, the project releases only an annotation sheet containing the inputs needed to run an agent on the public Hugging Face dataset, while the gold programs, full datasets, and evaluation scripts are distributed separately under access control ^[2]. The benchmark code is released under the MIT License and the annotations under Creative Commons Attribution 4.0, with original upstream licenses preserved for tasks derived from specific open-source repositories ^[2].

Evaluation methodology

ScienceAgentBench was built with heavy human oversight. The 102 tasks went through multiple rounds of manual validation by annotators and were checked by nine subject-matter experts to ensure annotation quality and scientific plausibility, reducing the risk that the benchmark rewards superficially plausible but scientifically wrong solutions ^[1]^[3].

At evaluation time, an agent receives a task instruction (optionally with an expert-written knowledge snippet), generates a Python program, and the program is executed in a controlled environment. The authors report four complementary metrics ^[1]:

Valid Execution Rate (VER): the fraction of tasks for which the generated program runs without error and saves an output in the expected location and format.
Success Rate (SR): the fraction of tasks whose saved output meets task-specific success criteria (for example a required model-performance threshold or correct predictions). SR is the headline metric and is strictly harder than VER, since code can run cleanly yet still produce a wrong result.
CodeBERTScore (CBS): a similarity measure between the generated program and the reference program, computed with contextual code embeddings, capturing how close the agent's solution is to the human-written gold code.
API Cost: the average monetary cost (in US dollars) per task, reflecting the practical expense of running each agent configuration.

To counter training-data contamination and "shortcut" behavior, the authors applied two safeguards. First, for tasks with held-out test sets they randomly removed several data points so that an agent reusing a memorized public data loader would become misaligned with the success criteria. Second, for model-development tasks they replaced ground-truth test labels with dummy values (such as -1), so that an agent could not simply read and report the answers instead of building a working model ^[1]. Each task was attempted up to three times per configuration.

Results

The release evaluated five open-weight and proprietary LLMs (Llama 3.1 Instruct at 70B and 405B, Mistral Large 2, GPT-4o, and Claude 3.5 Sonnet), each paired with three agent frameworks: direct prompting, the OpenHands CodeAct agent, and a self-debug framework that lets the model iteratively fix its own code. OpenAI's o1-preview reasoning model was additionally tested with direct prompting and self-debug ^[1].

No configuration came close to solving the benchmark. The strongest non-reasoning result came from Claude 3.5 Sonnet with self-debug, which solved 32.4 percent of tasks without expert-provided knowledge and 34.3 percent with it. The o1-preview reasoning model reached the highest Success Rate overall at 42.2 percent, but at more than ten times the API cost of the cheaper agents, raising clear questions about practicality. Among comparable setups, Claude 3.5 Sonnet's self-debug agent was reported as roughly 17 times cheaper than the OpenHands CodeAct configuration ^[1].

Agent configuration	Success Rate
Claude 3.5 Sonnet, self-debug (no expert knowledge)	32.4%
Claude 3.5 Sonnet, self-debug (with expert knowledge)	34.3%
OpenAI o1-preview (best, at >10x cost)	42.2%

The authors drew several qualitative lessons. Expert-provided domain knowledge helped modestly but did not transform performance. Agentic frameworks that allow execution feedback and self-correction outperformed plain direct prompting. And human analysis of failures showed that getting data loading and processing right was a major distinguishing factor between successful and failed programs, indicating that handling real scientific data formats, not just algorithmic reasoning, is a key bottleneck ^[1].

Significance

ScienceAgentBench provides a concrete, reproducible reality check on the claim that LLM agents can automate scientific discovery. By tying every task to a peer-reviewed publication and validating it with domain experts, it grounds evaluation in work that real scientists do, and by reducing each task to a verifiable executable output, it replaces subjective judgments of an "AI scientist's" research with objective pass/fail scoring ^[1]^[3].

The benchmark's central message has been widely cited in subsequent discussion of AI for science and agentic code generation: even the best agents at its 2024 to 2025 release succeeded on only a minority of individual scientific coding tasks, far short of the reliability that autonomous research would require ^[1]. As a result, ScienceAgentBench has become a standard reference for tempering end-to-end automation claims and for measuring incremental progress, and it has been incorporated into broader agent-evaluation efforts that aggregate multiple benchmarks to assess data-science and scientific-discovery agents ^[4]. Its design choices, expert validation, unified executable outputs, explicit cost accounting, and contamination safeguards, have also influenced how later scientific-agent benchmarks are constructed.

References

Chen, Ziru; Chen, Shijie; Ning, Yuting; Zhang, Qianheng; Wang, Boshi; Yu, Botao; Li, Yifei; Liao, Zeyi; Wei, Cheng; Lu, Zitong; Dey, Vishal; Xue, Mingyi; Baker, Frazier N.; Burns, Benjamin; Adu-Ampratwum, Daniel; Huang, Xuhui; Ning, Xia; Gao, Song; Su, Yu; Sun, Huan. "ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery." arXiv:2410.05080, 7 October 2024 (revised 31 March 2025); accepted at ICLR 2025. https://arxiv.org/abs/2410.05080 ↩
OSU-NLP-Group. "ScienceAgentBench" (code, dataset, and documentation). GitHub. https://github.com/OSU-NLP-Group/ScienceAgentBench ↩
"ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery." OpenReview (ICLR 2025). https://openreview.net/forum?id=6z4YKr0GK6 ↩
Kapoor, Sayash; et al. "Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation." arXiv:2510.11977, 2025. https://arxiv.org/abs/2510.11977 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SWE-bench Verified Terminal-Bench

Overview

Motivation: rigor for AI-scientist claims

What ScienceAgentBench contains

Evaluation methodology

Results

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here