| SciCode | |
|---|---|
| Overview | |
| Full name | Scientific Code Benchmark |
| Abbreviation | SciCode |
| Description | A research coding benchmark curated by scientists for realistic scientific problem-solving |
| Release date | 2024-07 |
| Latest version | 1.0 |
| Benchmark updated | 2025-01 |
| Authors | Minyang Tian, Luyu Gao, Et al. |
| Organization | Princeton University, Carnegie Mellon University |
| Technical Details | |
| Type | Scientific Computing, Code Generation |
| Modality | Text (Code) |
| Task format | Code synthesis for scientific problems |
| Number of tasks | 80 main problems (338 subproblems) |
| Total examples | 338 |
| Evaluation metric | Success rate, Correctness |
| Domains | Physics, Math, Biology, Chemistry, Materials Science, Ecology |
| Languages | Python (primarily) |
| Performance | |
| Human performance | Not reported |
| Baseline | <1% (random)
Property "Baseline score" (as page type) with input value "" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| SOTA score | 7.7% |
| SOTA model | OpenAI o1-preview |
| SOTA date | 2024-10 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT
|
SciCode is a challenging artificial intelligence benchmark designed to evaluate large language models' capabilities in generating code for solving realistic scientific research problems. Released in July 2024 through collaboration between Princeton University, Carnegie Mellon University, and scientists from 16 diverse natural science sub-fields, SciCode represents a paradigm shift from exam-like questions to real research problems that scientists encounter in their everyday workflow.
SciCode addresses a critical gap in evaluating AI systems' ability to assist with scientific research by presenting problems drawn from actual scripts that scientists use in their research, many of which have been used in published papers. Unlike traditional coding benchmarks that focus on algorithmic challenges or software engineering tasks, SciCode requires deep integration of domain knowledge, mathematical reasoning, and computational implementation.
The development of SciCode was motivated by several key observations:
The benchmark specifically targets the evaluation of models' ability to transform scientific concepts and mathematical formulations into working computational code.
SciCode covers 6 main scientific domains with 16 specialized subdomains:
| Main Domain | Subdomains | Problem Count | Key Topics |
|---|---|---|---|
| Physics | Computational Physics, Optics, Condensed Matter | ~60 | Simulations, quantum mechanics, statistical physics |
| Mathematics | Numerical Linear Algebra, PDEs, Optimization | ~50 | Matrix computations, differential equations, algorithms |
| Chemistry | Quantum Chemistry, Computational Chemistry | ~50 | Molecular dynamics, electronic structure |
| Biology | Ecology, Bioinformatics, Systems Biology | ~50 | Population dynamics, sequence analysis |
| Materials Science | Semiconductor Materials, Crystallography | ~40 | Band structure, material properties |
| Earth Science | Geophysics, Climatology | ~38 | Climate models, seismic analysis |
Each main problem in SciCode is decomposed into multiple subproblems:
| Component | Description | Count |
|---|---|---|
| Main Problems | Complete scientific challenges | 80 |
| Subproblems | Decomposed implementation steps | 338 |
| Test Cases | Scientist-annotated validations | ~1000 |
| Gold Solutions | Reference implementations | 338 |
| Characteristic | Description | Example |
|---|---|---|
| Knowledge Recall | Retrieving domain-specific facts | Physical constants, equations |
| Mathematical Reasoning | Deriving and manipulating formulas | Solving differential equations |
| Algorithm Design | Creating computational approaches | Numerical integration methods |
| Code Synthesis | Implementing solutions in code | Python implementations |
| Validation | Verifying correctness | Comparing with analytical solutions |
SciCode offers two primary evaluation configurations:
| Setting | Description | Background Knowledge | Difficulty |
|---|---|---|---|
| Standard | No additional context provided | None | Highest |
| With Background | Scientist-annotated context included | Domain-specific hints | Moderate |
| Metric | Description | Calculation |
|---|---|---|
| Overall Success Rate | Percentage of correctly solved problems | (Solved problems / Total) × 100% |
| Domain Success Rate | Performance per scientific field | (Domain solved / Domain total) × 100% |
| Subproblem Accuracy | Correctness at subproblem level | (Correct subproblems / 338) × 100% |
| Test Case Pass Rate | Percentage of passing test cases | (Passed tests / Total tests) × 100% |
SciCode employs rigorous validation: 1. **Test Case Execution**: Running generated code against scientist-created test cases 2. **Numerical Verification**: Checking numerical accuracy within specified tolerances 3. **Output Format Validation**: Ensuring correct data structures and formats 4. **Performance Checks**: Verifying computational efficiency where relevant
| Rank | Model | Success Rate (Standard) | Success Rate (w/ Background) | Organization |
|---|---|---|---|---|
| 1 | OpenAI o1-preview | 7.7% | 15.2% | OpenAI |
| 2 | OpenAI o1-mini | 5.8% | 11.3% | OpenAI |
| 3 | Claude 3.5 Sonnet | 4.6% | 9.8% | Anthropic |
| 4 | GPT-4o | 3.9% | 8.7% | OpenAI |
| 5 | DeepSeek-R1 | ~3.5% | ~8.2% | DeepSeek |
| 6 | DeepSeek-V3 | ~3.2% | ~7.8% | DeepSeek |
| 7 | GPT-4 Turbo | 2.8% | 6.5% | OpenAI |
| 8 | Claude 3 Opus | 2.5% | 5.9% | Anthropic |
| Domain | Best Model Performance | Average Performance | Difficulty Rating |
|---|---|---|---|
| Mathematics | 12% | 5% | High |
| Physics | 8% | 3% | Very High |
| Chemistry | 6% | 2% | Very High |
| Biology | 7% | 3% | High |
| Materials Science | 5% | 2% | Very High |
| Earth Science | 4% | 1.5% | Very High |
| Category | Example Problem | Required Knowledge |
|---|---|---|
| Quantum Mechanics | Solving Schrödinger equation numerically | Wave functions, numerical methods |
| Molecular Dynamics | Simulating protein folding | Force fields, integration algorithms |
| Climate Modeling | Implementing radiative transfer | Atmospheric physics, numerical schemes |
| Population Dynamics | Predator-prey models | Differential equations, ecology |
| Crystal Structures | Computing band structures | Solid state physics, linear algebra |
| Signal Processing | Implementing FFT variants | Mathematics, algorithms |
SciCode includes problems based on Nobel Prize-winning scientific methods, highlighting the benchmark's connection to groundbreaking research:
```bash
git clone https://github.com/scicode-bench/SciCode cd SciCode
pip install -r requirements.txt
python download_data.py ```
```python
from scicode import SciCodeBench
benchmark = SciCodeBench()
results_standard = benchmark.evaluate(
model='gpt-4', setting='standard'
)
results_background = benchmark.evaluate(
model='gpt-4', setting='with_background'
) ```
```python
from inspect_ai import eval from scicode.inspect import scicode_suite
results = eval(
scicode_suite(), model="openai/gpt-4"
)
from opencompass.benchmarks import SciCode
evaluator = SciCode() score = evaluator.eval(model_output) ```
| Research Stage | SciCode Component | Skills Tested |
|---|---|---|
| Literature Review | Background understanding | Knowledge recall |
| Theory Development | Mathematical formulation | Reasoning |
| Method Design | Algorithm selection | Problem-solving |
| Implementation | Code writing | Programming |
| Validation | Testing and verification | Debugging |
SciCode problems reflect actual scientific computing tasks:
| Application | Purpose | Value |
|---|---|---|
| AI for Science | Evaluating scientific AI assistants | Progress tracking |
| Model Development | Identifying capability gaps | Targeted improvement |
| Education | Assessing teaching assistants | Educational tools |
| Collaboration | Human-AI scientific partnerships | Integration planning |
| Limitation | Description | Impact |
|---|---|---|
| Low Success Rates | Best models solve <8% | Limited practical utility |
| Python Focus | Primarily Python implementations | Language diversity |
| Static Dataset | Fixed problem set | Potential overfitting |
| Domain Coverage | Limited to 16 subfields | Scope constraints |
| Evaluation Cost | Computationally intensive | Resource requirements |
1. **Expanded Coverage**: More scientific domains and subfields 2. **Multi-language Support**: Beyond Python implementations 3. **Interactive Problems**: Multi-step research workflows 4. **Collaborative Tasks**: Team science scenarios 5. **Dynamic Updates**: Continuously adding new problems 6. **Human Baselines**: Expert scientist performance metrics
SciCode represents a crucial step toward AI systems capable of meaningful scientific assistance. Its extremely low success rates even for state-of-the-art models highlight the significant gap between current AI capabilities and the needs of scientific research. The benchmark's focus on real research problems provides:
Cite error: <ref> tag with name "scicode_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "scicode_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "scicode_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "neurips2024" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "hal_leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "aa_scicode" defined in <references> has group attribute "" which does not appear in prior text.