SciCode

SciCode
Overview
Full name	SciCode: A Research Coding Benchmark Curated by Scientists
Abbreviation	SciCode
Description	A research coding benchmark of PhD level scientific problems decomposed into subproblems, with scientist annotated gold solutions and numerical test cases
Release date	July 2024 (arXiv); NeurIPS 2024 Datasets and Benchmarks Track
Authors	Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, et al. (30 authors)
Lead institutions	University of Illinois Urbana-Champaign, Carnegie Mellon University, Argonne National Laboratory
Technical Details
Type	Scientific computing, code generation, LLM evaluation
Modality	Text (Python code)
Task format	Multi step Python function implementation with executable test cases
Total main problems	80 (65 test, 15 development)
Total subproblems	338 (288 test, 50 development)
Disciplines covered	16 subfields across 6 domains
Domains	Mathematics, physics, chemistry, biology, materials science, computational mechanics
Languages	Python
Evaluation	Numerical test cases, domain specific tests, pass@1
Performance
Random baseline	Approximately 0%
Best main problem score (no background)	7.7% (o1 preview, Sept 2024); 9.2% to 10.8% (o3 mini and o4 mini variants, 2025)
Best subproblem score (no background)	28.7% (o1 preview); higher for newer reasoning models
Best score with background	12.3% (Claude 3.5 Sonnet, main); 35.4% (Claude 3.5 Sonnet, subproblems)
Saturated	No
Resources
Website	scicode-bench.github.io
Paper	arXiv:2407.13168
GitHub	scicode-bench/SciCode
Leaderboard	HAL Princeton SciCode
License	Apache 2.0

SciCode is a research coding benchmark that asks large language models to write Python code for realistic, PhD level scientific problems drawn from working scientists' day to day workflows. It contains 80 main problems decomposed into 338 subproblems across 16 subfields in mathematics, physics, chemistry, biology, and materials science. SciCode was introduced by Minyang Tian and 29 collaborators in the paper "SciCode: A Research Coding Benchmark Curated by Scientists" (arXiv:2407.13168, July 2024) and accepted to the NeurIPS 2024 Datasets and Benchmarks Track.^[1]^[2]^[3]

The project is led from the University of Illinois Urbana-Champaign, Carnegie Mellon University, and Argonne National Laboratory, with contributions from MIT, Harvard, the University of Chicago, Stanford, and Princeton. Unlike exam style benchmarks such as HumanEval or MBPP, SciCode targets the kind of code that produces published results, including numerical methods, simulations, and quantitative modeling.^[1]^[4]

SciCode is notably difficult. In the paper's headline result, Claude 3.5 Sonnet, the strongest model evaluated at the time of submission, solved only 4.6% of main problems in the realistic setting without background notes. Even the latest reasoning models released through 2025 score in the high single digits to low double digits on the main problem metric.^[1]^[2]^[5]

Background and motivation

Research software is messy in ways that classroom problems are not. A scientist usually has to combine knowledge of an underlying physical theory, a numerical method that handles stiffness or stability, an implementation in NumPy or SciPy, and a way to validate the output against a known limiting case. Existing code generation benchmarks largely ignore this layered structure: HumanEval evaluates short self contained functions, SWE-bench targets software engineering bug fixes, and competitive programming sets focus on puzzles.^[1]

The SciCode authors wanted a benchmark that reflects how scientists actually use code. Many problems were sourced directly from scripts the contributors had written for their own published research, then rewritten so that the solution path is well defined and a hidden numerical test exists for every subproblem. Several problems are based on Nobel Prize related methods, including density functional theory, the Kohn Sham equations, and Monte Carlo techniques.^[1]^[2]

Dataset composition

The full benchmark contains 80 main problems and 338 subproblems. The authors release a development split (15 main problems, 50 subproblems) for prompt engineering and a held out test split (65 main problems, 288 subproblems) used for the public leaderboard. The breakdown of main problems by subfield follows the table below.^[1]^[2]

Domain	Subfield	Main problems
Physics	Condensed matter physics	13
Physics	Optics	10
Physics	Quantum information and computing	6
Physics	Computational physics	5
Physics	Astrophysics	2
Physics	Particle physics	1
Mathematics	Numerical linear algebra	8
Mathematics	Computational mechanics	5
Mathematics	Computational finance	1
Chemistry	Quantum chemistry	5
Chemistry	Computational chemistry	3
Materials science	Semiconductor materials	7
Materials science	Molecular modeling	6
Biology	Ecology	6
Biology	Biochemistry	1
Biology	Genetics	1

Main problems are split into between 2 and roughly 15 subproblems, ordered so that earlier steps can be reused as helper functions in later ones. Subproblems are written as Python function signatures with a docstring that describes the scientific task, inputs, and expected outputs. Test cases live in a numerical results file (test_data.h5) and many of them check agreement with closed form analytical solutions, published results, or independent reference implementations.^[1]^[2]^[6]

Problem structure

Every subproblem combines four kinds of difficulty: knowledge recall (retrieving the relevant scientific facts), mathematical reasoning (deriving or rearranging the right equations), algorithm design (picking and adapting a numerical method), and code synthesis (writing a runnable Python function). The model is also required to remain consistent across subproblems, since later steps typically import the solutions of earlier ones.^[1]

Required skill	What the model has to do	Example
Knowledge recall	Pull domain specific facts from memory	Form factor of a Bragg grating; lattice constant of silicon
Mathematical reasoning	Derive or rearrange formulas	Going from the time independent Schrodinger equation to a tridiagonal matrix system
Algorithm design	Choose stable, efficient numerical methods	Picking a symplectic integrator for orbital mechanics
Code synthesis	Translate the chosen approach into Python	Implementing the SCF loop in a quantum chemistry calculation
Cross step consistency	Reuse outputs from earlier subproblems	Plugging an earlier Hamiltonian builder into a later eigensolver

Reproducing methods like density functional theory, BCS superconductivity calculations, or Diels Alder reaction modeling from scratch in Python requires both conceptual understanding and the bookkeeping discipline that real research demands.^[1]^[2]

Evaluation protocol

SciCode runs in two main settings. In the standard setting, the model receives only the function signature, a brief docstring, and any imports already produced for earlier subproblems. In the with background setting, the model also gets a scientist authored note that explains the relevant physics, equations, or algorithm. The two settings let researchers separate two skills: how much a model knows about a field versus how well it can implement an already explained method.^[1]

Setting	Inputs to model	What it measures
Standard (no background)	Function signature, docstring, prior subproblem solutions	Combined knowledge, reasoning, and implementation
With background	All of the above plus scientist annotated background	Implementation and instruction following, given correct knowledge

Generated code is executed against the hidden numerical tests with np.allclose style comparisons. A subproblem counts as correct when all of its tests pass; a main problem counts as correct only when every subproblem in it passes, which is why main problem accuracy is far lower than subproblem accuracy. The original harness used a two step pipeline (gencode.py then test_generated_code.py); newer evaluations integrate with the Inspect AI framework maintained by the UK AI Safety Institute.^[2]^[6]

Quality control involved three rounds of validation: an in domain scientist reviewing each problem and its tests, an out of domain scientist checking clarity, and a GPT-4 pass used to flag ambiguous prompts. Dependencies are kept to widely used libraries such as NumPy, SciPy, and SymPy.^[1]^[2]

Headline results

The original paper reported pass@1 scores for ten models. Claude 3.5 Sonnet led the standard setting with 4.6% main problem accuracy, with GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3 Opus clustered near 1.5%. Subproblem scores were much higher, since many subproblems are isolated helper functions that do not require all earlier code to be perfect.^[1]

Model	Subproblem accuracy (no background)	Main problem accuracy (no background)
Claude 3.5 Sonnet	26.0%	4.6%
GPT-4o	25.0%	1.5%
GPT-4 Turbo	22.9%	1.5%
Gemini 1.5 Pro	21.9%	1.5%
Claude 3 Opus	21.5%	1.5%
DeepSeek Coder v2	21.2%	3.1%
Claude 3 Sonnet	17.0%	1.5%
Qwen2 72B Instruct	17.0%	1.5%
Mixtral 8x22B Instruct	16.3%	0.0%
Llama 3 70B Instruct	14.6%	0.0%

Adding scientist annotated background boosted performance, especially for reasoning models. Claude 3.5 Sonnet jumped to 35.4% on subproblems and 12.3% on main problems. OpenAI's o1 mini gained the most from background knowledge and topped that chart at around 13.8% pass@1 on main problems in some reports, which suggests knowledge gaps, not pure reasoning, are a major bottleneck.^[1]^[7]

Updates and the 2025 leaderboard

The maintained leaderboard at HAL (Holistic Agent Leaderboard) hosted by Princeton tracks both raw accuracy and dollar cost per evaluation across newer models and agent scaffolds. As of mid 2025 the top entries include o4 mini and o3 variants, GPT-4.1, and Claude Opus 4.1 running under either zero shot or tool calling SciCode agents.^[5]

Agent	Model	Main problem accuracy	Estimated cost
SciCode zero shot agent	o4 mini Low	9.23%	About $1.74
SciCode tool calling agent	o3 Medium	9.23%	About $111
SciCode tool calling agent	Claude Opus 4.1	7.69%	About $625
SciCode tool calling agent	Claude Opus 4.1 High	6.92%	About $551
SciCode zero shot agent	GPT-4.1	6.15%	About $2.82
SciCode zero shot agent	o1 preview (Sept 2024)	7.7%	Reported in paper

The GitHub leaderboard also lists o3 mini variants between roughly 9% and 11% and reports DeepSeek R1 at 4.6% main problem accuracy, matching Claude 3.5 Sonnet from the original paper. None of these numbers approach the level at which a model could replace a scientist; SciCode remains far from saturated.^[5]^[6]

Common failure modes and example problems

Error analyses in the paper surface a recurring pattern: models often produce code that looks reasonable but uses a wrong sign convention, an outdated empirical formula, or an unstable algorithm. Cross subproblem consistency is another sticking point, where later functions expect shapes earlier ones do not return. Problems that combine two separate ideas, such as a finite difference scheme with a custom boundary condition, also push models past their reliability point.^[1]^[7]

Failure mode	Description	Example
Domain knowledge gap	Wrong constant, formula, or sign convention	Mixing CGS and SI units in an electrodynamics problem
Numerical instability	Choosing an unstable scheme for a stiff problem	Using forward Euler on a stiff chemical kinetics ODE
Cross step inconsistency	Mismatched input or output shapes between subproblems	Returning a list when a dense ndarray is expected
Unfinished implementation	Leaving a stub or placeholder	Returning zeros instead of computing the integral
Misreading the prompt	Solving a related but different problem	Computing a different transform than the one requested

The public dataset spans a wide range of real research tasks. A few representative examples are listed below.^[1]^[2]

Subfield	Example task	Required ideas
Quantum chemistry	Implementing a Hartree Fock self consistent field loop	Slater determinants, Roothaan equations, eigensolvers
Condensed matter	Computing band structures via tight binding	Bloch theorem, Hamiltonian construction, diagonalization
Optics	Modeling guided modes in a slab waveguide	Maxwell equations, transfer matrices, root finding
Ecology	Simulating predator prey dynamics with stochastic perturbations	Lotka Volterra equations, Monte Carlo methods
Computational physics	Performing radiation transfer through an atmosphere	Two stream approximation, integration schemes
Numerical linear algebra	Implementing iterative eigensolvers	Lanczos or Arnoldi iteration, Krylov subspaces

SciCode sits in a small but growing group of benchmarks for science focused code generation and reasoning, complementing more famous coding evaluations.

Benchmark	Focus	How it differs from SciCode
HumanEval	Short Python function synthesis	No scientific domain knowledge required
MBPP	Basic Python programming problems	Aimed at entry level coders, not researchers
SWE-bench	Real GitHub issue fixing	Software engineering, not numerical science
MATH	Competition mathematics	No code execution, pure math reasoning
GPQA	Graduate level multiple choice science questions	No coding component
MLE-bench	Machine learning engineering	Kaggle style ML competitions
LAB-Bench	Biology lab tasks for AI agents	Wet lab and protocol focus
ResearchBench	Open ended research tasks	Less structured evaluation

SciCode's combination of executable tests, scientist authored gold solutions, and broad scientific scope makes it one of the few standardized ways to track progress on AI assistants for research computing.^[1]^[3]

Limitations

The paper and follow up commentary note several limitations. The benchmark is Python only, so it does not test Fortran, C++, or Julia. The dataset is fixed, which means scores can drift as public solutions seep into training corpora, though the held out test split helps. The 16 subfields are a sample of natural science computing, not a complete map; computational social science, neuroscience, and large scale climate simulation are out of scope. Evaluation also requires running scientific Python code with real dependencies, which adds engineering overhead.^[1]^[2]

Limitation	Why it matters
Python only	Underestimates gaps in Fortran or C++ scientific code
Static dataset	Public solutions can leak into training corpora
Limited subfield coverage	Excludes whole sectors of computational science
Numerical tolerances	Strict tests; small numerical bugs cause failures
Cost and compute	Long generations under reasoning models can be expensive

Reception and use

SciCode has become a standard reference in discussions of AI for science and the question of whether LLMs can do real research work. It is tracked by Artificial Analysis, the HAL leaderboard at Princeton, and Inspect AI evaluation suites, and frontier releases from Anthropic, OpenAI, Google DeepMind, and DeepSeek have included SciCode scores in technical reports or third party comparisons.^[5]^[8]

Where state of the art models now solve the majority of HumanEval problems, SciCode keeps showing single digit main problem accuracy years after release. That gap preserves a meaningful signal for future models to climb against, which is the job of a benchmark.^[1]^[5]

References

Tian, M., Gao, L., Zhang, S. D., et al. "SciCode: A Research Coding Benchmark Curated by Scientists." arXiv:2407.13168, July 2024. https://arxiv.org/abs/2407.13168
SciCode official website. https://scicode-bench.github.io/
NeurIPS 2024 Datasets and Benchmarks Track poster page for SciCode. https://neurips.cc/virtual/2024/poster/97822
NeurIPS 2024 proceedings PDF, "SciCode: A Research Coding Benchmark Curated by Scientists." https://proceedings.neurips.cc/paper_files/paper/2024/file/36850592258c8c41cecdaa3dea5ff7de-Paper-Datasets_and_Benchmarks_Track.pdf
HAL Princeton SciCode leaderboard. https://hal.cs.princeton.edu/scicode
SciCode GitHub repository. https://github.com/scicode-bench/SciCode
OpenReview discussion thread for SciCode. https://openreview.net/forum?id=ADLaALtdoG
Artificial Analysis SciCode evaluations page. https://artificialanalysis.ai/evaluations/scicode

SciCode

Background and motivation

Dataset composition

Problem structure

Evaluation protocol

Headline results

Updates and the 2025 leaderboard

Common failure modes and example problems

Limitations

Reception and use

See also

References

Improve this article

Background and motivation

Dataset composition

Problem structure

Evaluation protocol

Headline results

Updates and the 2025 leaderboard

Common failure modes and example problems

Limitations

Reception and use

See also

References

Background and motivation

Dataset composition

Problem structure

Evaluation protocol

Headline results

Updates and the 2025 leaderboard

Common failure modes and example problems

Related benchmarks

Limitations

Reception and use

See also

References

Improve this article

Related Articles

Aider Polyglot

τ-bench

Humanity's Last Exam

WeirdML

CharXiv

AIME 2024

Background and motivation

Dataset composition

Problem structure

Evaluation protocol

Headline results

Updates and the 2025 leaderboard

Common failure modes and example problems

Related benchmarks

Limitations

Reception and use

See also

References

Related Articles

Aider Polyglot

τ-bench

Humanity's Last Exam

WeirdML

CharXiv

AIME 2024