SciCode
Last reviewed
May 10, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,732 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,732 words
Add missing citations, update stale details, or suggest a clearer explanation.
| SciCode | |
|---|---|
| Overview | |
| Full name | SciCode: A Research Coding Benchmark Curated by Scientists |
| Abbreviation | SciCode |
| Description | A research coding benchmark of PhD level scientific problems decomposed into subproblems, with scientist annotated gold solutions and numerical test cases |
| Release date | July 2024 (arXiv); NeurIPS 2024 Datasets and Benchmarks Track |
| Authors | Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, et al. (30 authors) |
| Lead institutions | University of Illinois Urbana-Champaign, Carnegie Mellon University, Argonne National Laboratory |
| Technical Details | |
| Type | Scientific computing, code generation, LLM evaluation |
| Modality | Text (Python code) |
| Task format | Multi step Python function implementation with executable test cases |
| Total main problems | 80 (65 test, 15 development) |
| Total subproblems | 338 (288 test, 50 development) |
| Disciplines covered | 16 subfields across 6 domains |
| Domains | Mathematics, physics, chemistry, biology, materials science, computational mechanics |
| Languages | Python |
| Evaluation | Numerical test cases, domain specific tests, pass@1 |
| Performance | |
| Random baseline | Approximately 0% |
| Best main problem score (no background) | 7.7% (o1 preview, Sept 2024); 9.2% to 10.8% (o3 mini and o4 mini variants, 2025) |
| Best subproblem score (no background) | 28.7% (o1 preview); higher for newer reasoning models |
| Best score with background | 12.3% (Claude 3.5 Sonnet, main); 35.4% (Claude 3.5 Sonnet, subproblems) |
| Saturated | No |
| Resources | |
| Website | scicode-bench.github.io |
| Paper | arXiv:2407.13168 |
| GitHub | scicode-bench/SciCode |
| Leaderboard | HAL Princeton SciCode |
| License | Apache 2.0 |
SciCode is a research coding benchmark that asks large language models to write Python code for realistic, PhD level scientific problems drawn from working scientists' day to day workflows. It contains 80 main problems decomposed into 338 subproblems across 16 subfields in mathematics, physics, chemistry, biology, and materials science. SciCode was introduced by Minyang Tian and 29 collaborators in the paper "SciCode: A Research Coding Benchmark Curated by Scientists" (arXiv:2407.13168, July 2024) and accepted to the NeurIPS 2024 Datasets and Benchmarks Track.[1][2][3]
The project is led from the University of Illinois Urbana-Champaign, Carnegie Mellon University, and Argonne National Laboratory, with contributions from MIT, Harvard, the University of Chicago, Stanford, and Princeton. Unlike exam style benchmarks such as HumanEval or MBPP, SciCode targets the kind of code that produces published results, including numerical methods, simulations, and quantitative modeling.[1][4]
SciCode is notably difficult. In the paper's headline result, Claude 3.5 Sonnet, the strongest model evaluated at the time of submission, solved only 4.6% of main problems in the realistic setting without background notes. Even the latest reasoning models released through 2025 score in the high single digits to low double digits on the main problem metric.[1][2][5]
Research software is messy in ways that classroom problems are not. A scientist usually has to combine knowledge of an underlying physical theory, a numerical method that handles stiffness or stability, an implementation in NumPy or SciPy, and a way to validate the output against a known limiting case. Existing code generation benchmarks largely ignore this layered structure: HumanEval evaluates short self contained functions, SWE-bench targets software engineering bug fixes, and competitive programming sets focus on puzzles.[1]
The SciCode authors wanted a benchmark that reflects how scientists actually use code. Many problems were sourced directly from scripts the contributors had written for their own published research, then rewritten so that the solution path is well defined and a hidden numerical test exists for every subproblem. Several problems are based on Nobel Prize related methods, including density functional theory, the Kohn Sham equations, and Monte Carlo techniques.[1][2]
The full benchmark contains 80 main problems and 338 subproblems. The authors release a development split (15 main problems, 50 subproblems) for prompt engineering and a held out test split (65 main problems, 288 subproblems) used for the public leaderboard. The breakdown of main problems by subfield follows the table below.[1][2]
| Domain | Subfield | Main problems |
|---|---|---|
| Physics | Condensed matter physics | 13 |
| Physics | Optics | 10 |
| Physics | Quantum information and computing | 6 |
| Physics | Computational physics | 5 |
| Physics | Astrophysics | 2 |
| Physics | Particle physics | 1 |
| Mathematics | Numerical linear algebra | 8 |
| Mathematics | Computational mechanics | 5 |
| Mathematics | Computational finance | 1 |
| Chemistry | Quantum chemistry | 5 |
| Chemistry | Computational chemistry | 3 |
| Materials science | Semiconductor materials | 7 |
| Materials science | Molecular modeling | 6 |
| Biology | Ecology | 6 |
| Biology | Biochemistry | 1 |
| Biology | Genetics | 1 |
Main problems are split into between 2 and roughly 15 subproblems, ordered so that earlier steps can be reused as helper functions in later ones. Subproblems are written as Python function signatures with a docstring that describes the scientific task, inputs, and expected outputs. Test cases live in a numerical results file (test_data.h5) and many of them check agreement with closed form analytical solutions, published results, or independent reference implementations.[1][2][6]
Every subproblem combines four kinds of difficulty: knowledge recall (retrieving the relevant scientific facts), mathematical reasoning (deriving or rearranging the right equations), algorithm design (picking and adapting a numerical method), and code synthesis (writing a runnable Python function). The model is also required to remain consistent across subproblems, since later steps typically import the solutions of earlier ones.[1]
| Required skill | What the model has to do | Example |
|---|---|---|
| Knowledge recall | Pull domain specific facts from memory | Form factor of a Bragg grating; lattice constant of silicon |
| Mathematical reasoning | Derive or rearrange formulas | Going from the time independent Schrodinger equation to a tridiagonal matrix system |
| Algorithm design | Choose stable, efficient numerical methods | Picking a symplectic integrator for orbital mechanics |
| Code synthesis | Translate the chosen approach into Python | Implementing the SCF loop in a quantum chemistry calculation |
| Cross step consistency | Reuse outputs from earlier subproblems | Plugging an earlier Hamiltonian builder into a later eigensolver |
Reproducing methods like density functional theory, BCS superconductivity calculations, or Diels Alder reaction modeling from scratch in Python requires both conceptual understanding and the bookkeeping discipline that real research demands.[1][2]
SciCode runs in two main settings. In the standard setting, the model receives only the function signature, a brief docstring, and any imports already produced for earlier subproblems. In the with background setting, the model also gets a scientist authored note that explains the relevant physics, equations, or algorithm. The two settings let researchers separate two skills: how much a model knows about a field versus how well it can implement an already explained method.[1]
| Setting | Inputs to model | What it measures |
|---|---|---|
| Standard (no background) | Function signature, docstring, prior subproblem solutions | Combined knowledge, reasoning, and implementation |
| With background | All of the above plus scientist annotated background | Implementation and instruction following, given correct knowledge |
Generated code is executed against the hidden numerical tests with np.allclose style comparisons. A subproblem counts as correct when all of its tests pass; a main problem counts as correct only when every subproblem in it passes, which is why main problem accuracy is far lower than subproblem accuracy. The original harness used a two step pipeline (gencode.py then test_generated_code.py); newer evaluations integrate with the Inspect AI framework maintained by the UK AI Safety Institute.[2][6]
Quality control involved three rounds of validation: an in domain scientist reviewing each problem and its tests, an out of domain scientist checking clarity, and a GPT-4 pass used to flag ambiguous prompts. Dependencies are kept to widely used libraries such as NumPy, SciPy, and SymPy.[1][2]
The original paper reported pass@1 scores for ten models. Claude 3.5 Sonnet led the standard setting with 4.6% main problem accuracy, with GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3 Opus clustered near 1.5%. Subproblem scores were much higher, since many subproblems are isolated helper functions that do not require all earlier code to be perfect.[1]
| Model | Subproblem accuracy (no background) | Main problem accuracy (no background) |
|---|---|---|
| Claude 3.5 Sonnet | 26.0% | 4.6% |
| GPT-4o | 25.0% | 1.5% |
| GPT-4 Turbo | 22.9% | 1.5% |
| Gemini 1.5 Pro | 21.9% | 1.5% |
| Claude 3 Opus | 21.5% | 1.5% |
| DeepSeek Coder v2 | 21.2% | 3.1% |
| Claude 3 Sonnet | 17.0% | 1.5% |
| Qwen2 72B Instruct | 17.0% | 1.5% |
| Mixtral 8x22B Instruct | 16.3% | 0.0% |
| Llama 3 70B Instruct | 14.6% | 0.0% |
Adding scientist annotated background boosted performance, especially for reasoning models. Claude 3.5 Sonnet jumped to 35.4% on subproblems and 12.3% on main problems. OpenAI's o1 mini gained the most from background knowledge and topped that chart at around 13.8% pass@1 on main problems in some reports, which suggests knowledge gaps, not pure reasoning, are a major bottleneck.[1][7]
The maintained leaderboard at HAL (Holistic Agent Leaderboard) hosted by Princeton tracks both raw accuracy and dollar cost per evaluation across newer models and agent scaffolds. As of mid 2025 the top entries include o4 mini and o3 variants, GPT-4.1, and Claude Opus 4.1 running under either zero shot or tool calling SciCode agents.[5]
| Agent | Model | Main problem accuracy | Estimated cost |
|---|---|---|---|
| SciCode zero shot agent | o4 mini Low | 9.23% | About $1.74 |
| SciCode tool calling agent | o3 Medium | 9.23% | About $111 |
| SciCode tool calling agent | Claude Opus 4.1 | 7.69% | About $625 |
| SciCode tool calling agent | Claude Opus 4.1 High | 6.92% | About $551 |
| SciCode zero shot agent | GPT-4.1 | 6.15% | About $2.82 |
| SciCode zero shot agent | o1 preview (Sept 2024) | 7.7% | Reported in paper |
The GitHub leaderboard also lists o3 mini variants between roughly 9% and 11% and reports DeepSeek R1 at 4.6% main problem accuracy, matching Claude 3.5 Sonnet from the original paper. None of these numbers approach the level at which a model could replace a scientist; SciCode remains far from saturated.[5][6]
Error analyses in the paper surface a recurring pattern: models often produce code that looks reasonable but uses a wrong sign convention, an outdated empirical formula, or an unstable algorithm. Cross subproblem consistency is another sticking point, where later functions expect shapes earlier ones do not return. Problems that combine two separate ideas, such as a finite difference scheme with a custom boundary condition, also push models past their reliability point.[1][7]
| Failure mode | Description | Example |
|---|---|---|
| Domain knowledge gap | Wrong constant, formula, or sign convention | Mixing CGS and SI units in an electrodynamics problem |
| Numerical instability | Choosing an unstable scheme for a stiff problem | Using forward Euler on a stiff chemical kinetics ODE |
| Cross step inconsistency | Mismatched input or output shapes between subproblems | Returning a list when a dense ndarray is expected |
| Unfinished implementation | Leaving a stub or placeholder | Returning zeros instead of computing the integral |
| Misreading the prompt | Solving a related but different problem | Computing a different transform than the one requested |
The public dataset spans a wide range of real research tasks. A few representative examples are listed below.[1][2]
| Subfield | Example task | Required ideas |
|---|---|---|
| Quantum chemistry | Implementing a Hartree Fock self consistent field loop | Slater determinants, Roothaan equations, eigensolvers |
| Condensed matter | Computing band structures via tight binding | Bloch theorem, Hamiltonian construction, diagonalization |
| Optics | Modeling guided modes in a slab waveguide | Maxwell equations, transfer matrices, root finding |
| Ecology | Simulating predator prey dynamics with stochastic perturbations | Lotka Volterra equations, Monte Carlo methods |
| Computational physics | Performing radiation transfer through an atmosphere | Two stream approximation, integration schemes |
| Numerical linear algebra | Implementing iterative eigensolvers | Lanczos or Arnoldi iteration, Krylov subspaces |
SciCode sits in a small but growing group of benchmarks for science focused code generation and reasoning, complementing more famous coding evaluations.
| Benchmark | Focus | How it differs from SciCode |
|---|---|---|
| HumanEval | Short Python function synthesis | No scientific domain knowledge required |
| MBPP | Basic Python programming problems | Aimed at entry level coders, not researchers |
| SWE-bench | Real GitHub issue fixing | Software engineering, not numerical science |
| MATH | Competition mathematics | No code execution, pure math reasoning |
| GPQA | Graduate level multiple choice science questions | No coding component |
| MLE-bench | Machine learning engineering | Kaggle style ML competitions |
| LAB-Bench | Biology lab tasks for AI agents | Wet lab and protocol focus |
| ResearchBench | Open ended research tasks | Less structured evaluation |
SciCode's combination of executable tests, scientist authored gold solutions, and broad scientific scope makes it one of the few standardized ways to track progress on AI assistants for research computing.[1][3]
The paper and follow up commentary note several limitations. The benchmark is Python only, so it does not test Fortran, C++, or Julia. The dataset is fixed, which means scores can drift as public solutions seep into training corpora, though the held out test split helps. The 16 subfields are a sample of natural science computing, not a complete map; computational social science, neuroscience, and large scale climate simulation are out of scope. Evaluation also requires running scientific Python code with real dependencies, which adds engineering overhead.[1][2]
| Limitation | Why it matters |
|---|---|
| Python only | Underestimates gaps in Fortran or C++ scientific code |
| Static dataset | Public solutions can leak into training corpora |
| Limited subfield coverage | Excludes whole sectors of computational science |
| Numerical tolerances | Strict tests; small numerical bugs cause failures |
| Cost and compute | Long generations under reasoning models can be expensive |
SciCode has become a standard reference in discussions of AI for science and the question of whether LLMs can do real research work. It is tracked by Artificial Analysis, the HAL leaderboard at Princeton, and Inspect AI evaluation suites, and frontier releases from Anthropic, OpenAI, Google DeepMind, and DeepSeek have included SciCode scores in technical reports or third party comparisons.[5][8]
Where state of the art models now solve the majority of HumanEval problems, SciCode keeps showing single digit main problem accuracy years after release. That gap preserves a meaningful signal for future models to climb against, which is the job of a benchmark.[1][5]