ScienceAgentBench
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,569 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,569 words
Add missing citations, update stale details, or suggest a clearer explanation.
ScienceAgentBench is an AI benchmark for evaluating whether language model agents can perform real, data-driven scientific analysis by writing and executing code. Introduced in October 2024 by a group led by Ziru Chen at The Ohio State University, the benchmark consists of 102 tasks extracted from 44 peer-reviewed scientific publications spanning four disciplines: bioinformatics, computational chemistry, geographical information science, and psychology and cognitive neuroscience [1][2]. Each task requires an agent to produce a self-contained Python program that loads scientific data, performs a defined analysis, and saves a verifiable output, mirroring an individual stage of a genuine research workflow.
The paper, "ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery," was published as a preprint on arXiv on 7 October 2024 and subsequently accepted to the International Conference on Learning Representations (ICLR) 2025 [1]. The benchmark was designed as a deliberately grounded counterweight to sweeping "AI scientist" claims: rather than judging an agent on an end-to-end discovery pipeline, it isolates concrete, expert-validated coding tasks whose correctness can be checked objectively. In the authors' headline finding, the best-performing agent configurations at release solved only a minority of tasks, with a Claude 3.5 Sonnet agent reaching 32.4 percent independently and OpenAI's o1-preview reaching 42.2 percent at more than ten times the cost, underscoring that contemporary agents remain far from automating scientific discovery [1].
By 2024, a wave of work argued that LLM-based agents could accelerate or even automate parts of the scientific process, with some systems marketed as autonomous "AI scientists" capable of generating hypotheses, running experiments, and writing papers. The creators of ScienceAgentBench argued that such claims were difficult to assess because existing evaluations were often coarse, used artificial or toy datasets, or relied on subjective judgments of an agent's full research output [1]. Without rigorous, realistic measurement, it was hard to separate real capability from hype.
ScienceAgentBench addresses this gap by adopting an "essential tasks" philosophy. Instead of asking whether an agent can autonomously make a discovery end to end, the benchmark asks whether an agent can reliably perform the individual data-driven steps that real scientists actually code, such as loading domain-specific data formats, running a statistical model, training a predictive classifier, or producing a publication-quality visualization. Because every task is drawn from a published study and reviewed by domain experts, success on the benchmark reflects competence on work that practicing researchers consider meaningful [1][3]. The authors are explicit that strong performance here would be a necessary but not sufficient condition for trustworthy autonomous research, framing the benchmark as a floor of basic competence rather than a test of full automation.
The benchmark comprises 102 tasks curated from 44 peer-reviewed publications across four scientific fields [1][2]. Each task is a self-contained unit of a real research workflow, paired with the input data, a natural-language task instruction, and, where relevant, a domain-knowledge document and a reference (gold) program drawn from the original study's released code.
The four disciplines bring deliberately heterogeneous data types and analysis goals:
| Discipline | Representative data and tasks |
|---|---|
| Bioinformatics | Cell and microscopy images, biological sequences, classification and analysis pipelines |
| Computational chemistry | Molecular structures and properties, predictive modeling of chemical data |
| Geographical information science | Geospatial and raster map data, spatial analysis and mapping |
| Psychology and cognitive neuroscience | EEG and behavioral signals, statistical modeling and visualization |
Across these domains, the tasks exercise a range of capabilities including data loading and preprocessing, statistical and machine-learning modeling, scientific computing, and data visualization. The unifying constraint is that every task's expected deliverable is a single Python source file that, when executed, produces the required output artifact (for example a saved figure, a results table, or model predictions) [1]. This uniform output format is what makes automated, objective scoring possible across otherwise disparate scientific problems.
To prevent benchmark leakage, the project releases only an annotation sheet containing the inputs needed to run an agent on the public Hugging Face dataset, while the gold programs, full datasets, and evaluation scripts are distributed separately under access control [2]. The benchmark code is released under the MIT License and the annotations under Creative Commons Attribution 4.0, with original upstream licenses preserved for tasks derived from specific open-source repositories [2].
ScienceAgentBench was built with heavy human oversight. The 102 tasks went through multiple rounds of manual validation by annotators and were checked by nine subject-matter experts to ensure annotation quality and scientific plausibility, reducing the risk that the benchmark rewards superficially plausible but scientifically wrong solutions [1][3].
At evaluation time, an agent receives a task instruction (optionally with an expert-written knowledge snippet), generates a Python program, and the program is executed in a controlled environment. The authors report four complementary metrics [1]:
To counter training-data contamination and "shortcut" behavior, the authors applied two safeguards. First, for tasks with held-out test sets they randomly removed several data points so that an agent reusing a memorized public data loader would become misaligned with the success criteria. Second, for model-development tasks they replaced ground-truth test labels with dummy values (such as -1), so that an agent could not simply read and report the answers instead of building a working model [1]. Each task was attempted up to three times per configuration.
The release evaluated five open-weight and proprietary LLMs (Llama 3.1 Instruct at 70B and 405B, Mistral Large 2, GPT-4o, and Claude 3.5 Sonnet), each paired with three agent frameworks: direct prompting, the OpenHands CodeAct agent, and a self-debug framework that lets the model iteratively fix its own code. OpenAI's o1-preview reasoning model was additionally tested with direct prompting and self-debug [1].
No configuration came close to solving the benchmark. The strongest non-reasoning result came from Claude 3.5 Sonnet with self-debug, which solved 32.4 percent of tasks without expert-provided knowledge and 34.3 percent with it. The o1-preview reasoning model reached the highest Success Rate overall at 42.2 percent, but at more than ten times the API cost of the cheaper agents, raising clear questions about practicality. Among comparable setups, Claude 3.5 Sonnet's self-debug agent was reported as roughly 17 times cheaper than the OpenHands CodeAct configuration [1].
| Agent configuration | Success Rate |
|---|---|
| Claude 3.5 Sonnet, self-debug (no expert knowledge) | 32.4% |
| Claude 3.5 Sonnet, self-debug (with expert knowledge) | 34.3% |
| OpenAI o1-preview (best, at >10x cost) | 42.2% |
The authors drew several qualitative lessons. Expert-provided domain knowledge helped modestly but did not transform performance. Agentic frameworks that allow execution feedback and self-correction outperformed plain direct prompting. And human analysis of failures showed that getting data loading and processing right was a major distinguishing factor between successful and failed programs, indicating that handling real scientific data formats, not just algorithmic reasoning, is a key bottleneck [1].
ScienceAgentBench provides a concrete, reproducible reality check on the claim that LLM agents can automate scientific discovery. By tying every task to a peer-reviewed publication and validating it with domain experts, it grounds evaluation in work that real scientists do, and by reducing each task to a verifiable executable output, it replaces subjective judgments of an "AI scientist's" research with objective pass/fail scoring [1][3].
The benchmark's central message has been widely cited in subsequent discussion of AI for science and agentic code generation: even the best agents at its 2024 to 2025 release succeeded on only a minority of individual scientific coding tasks, far short of the reliability that autonomous research would require [1]. As a result, ScienceAgentBench has become a standard reference for tempering end-to-end automation claims and for measuring incremental progress, and it has been incorporated into broader agent-evaluation efforts that aggregate multiple benchmarks to assess data-science and scientific-discovery agents [4]. Its design choices, expert validation, unified executable outputs, explicit cost accounting, and contamination safeguards, have also influenced how later scientific-agent benchmarks are constructed.