ChemBench

AI Benchmarks Model Evaluation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,352 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ChemBench is an automated AI benchmark that measures the chemical knowledge, reasoning, and safety judgment of large language models and compares their performance against expert human chemists. Introduced in the paper "Are large language models superhuman chemists?" by Adrian Mirza, Kevin Maik Jablonka, and colleagues, ChemBench was first released as a preprint in April 2024 and published in the journal Nature Chemistry in 2025. ^[1]^[2] It is one of the most frequently cited AI for science evaluations in chemistry, and its headline finding, that the strongest models can outperform the best human experts in the study on average while still failing some basic tasks and giving overconfident answers, has been widely discussed in debates about AI capability and AI safety. ^[1]

Overview

ChemBench is both a curated question corpus and a software framework for running and scoring chemistry evaluations automatically. The corpus contains more than 2,700 question-answer pairs spanning the major branches of chemistry, together with questions on chemical toxicity and safety and on chemical "intuition" or human preference. ^[1]^[2] The framework presents these questions to a model, parses the model's free-text completion to recover its answer, and scores the result, allowing many models to be benchmarked under identical conditions. A subset of the questions was also answered by human chemists through a custom web application, producing a human baseline for direct comparison. ^[1]^[3]

The project is developed by the Jablonka group (the "lamalab" laboratory associated with Kevin Maik Jablonka) and is released openly: the code, the dataset, and a public leaderboard are available online, and the dataset is distributed through Hugging Face. ^[3]^[4] The work involved roughly three dozen contributors across multiple institutions, including co-authors such as Nawaf Alampara, Martino Rios-Garcia, Mara Schilling-Wilhelmi, Santiago Miret, Michael Pieler, and Philippe Schwaller. ^[1]

Motivation

By 2023 and 2024, large language models were increasingly being applied to chemistry tasks: answering technical questions, planning syntheses, interpreting spectra, and assisting with literature. Despite this rapid adoption, there was no rigorous, chemistry-specific way to measure what these models actually knew, how well they reasoned, and whether their judgments about hazardous substances could be trusted. The ChemBench authors report that when they looked for an existing benchmark suited to evaluating general-purpose chemical models, they found nothing adequate, which motivated building one from scratch. ^[3]

A second motivation was safety. Because chemistry involves toxic, explosive, and otherwise dangerous materials, a model that answers fluently but incorrectly, or that is confidently wrong about a substance's hazards, could be actively harmful if relied upon. ChemBench was therefore designed not only to test textbook knowledge but also to probe chemical safety and to measure how well a model's stated confidence tracks its actual accuracy. ^[1]^[5]

What ChemBench contains

The published version of ChemBench comprises 2,788 question-answer pairs. Of these, 1,039 were manually written or curated from sources such as university examinations, chemistry olympiads, textbooks, and newly authored items, while 1,749 were generated semi-automatically from structured data. ^[2]^[5] By format, the corpus is dominated by multiple-choice items but also includes open-ended questions that require a specific numerical or short-text answer.

The questions are organized into topical categories covering the breadth of the discipline. The table below summarizes the approximate distribution reported for the dataset.

Topic area	Approx. number of questions
Chemical preference / intuition	~1,000
Toxicity and safety	~675
Organic chemistry	~429
Physical chemistry	~165
Analytical chemistry	~152
General chemistry	~149
Inorganic chemistry	~92
Materials science	~83
Technical chemistry	~40

Source: ChemBench dataset categories. ^[4]^[5]

Because molecular structures are often stored as machine-readable strings (for example, SMILES notation), the authors built a web application that renders molecules visually so that human participants could answer the same questions presented to the models. ^[3] For automated grading, ChemBench extracts a model's answer using a multi-step regular-expression parser, falling back to an LLM-based extraction step when the regex approach fails; partial correctness is generally treated as incorrect. ^[5] To enable a lighter-weight human study and faster iteration, the authors also defined a smaller curated subset, referred to as ChemBench-Mini, containing 236 questions. ^[5]

Comparison to human chemists

The human baseline was collected through the chembench.org web platform. Nineteen chemistry experts participated, a group composed mostly of advanced graduate researchers: roughly thirteen PhD students holding master's degrees, two researchers beyond the postdoctoral stage, and one bachelor's-level participant, among others. ^[5] Participants answered the ChemBench-Mini subset. Some volunteers were permitted to use tools such as web search and the ChemDraw structure editor, but all were prohibited from using language models, so the comparison measured model performance against human experts working with conventional resources rather than against unaided memory alone. ^[5]

This design lets ChemBench report not just an aggregate model-versus-human comparison but also a breakdown by topic, revealing where models match or exceed human experts and where they fall short.

Findings, limitations, and safety

The central result is that the best models scored higher than the best human chemists in the study on average. In the original 2024 preprint, the leading models already outperformed the strongest human participants overall. ^[1] In the updated study published in Nature Chemistry, the reasoning-oriented model o1-preview was the top performer and exceeded the best human expert by close to a factor of two in overall accuracy on the evaluated subset. ^[5] Strong general-purpose models such as GPT-4 and Claude 3.5 Sonnet also performed well, with Claude 3.5 Sonnet leading on many individual domains while GPT-4 was comparatively stronger on chemical safety. ^[2]^[5]

The "superhuman" framing, however, is deliberately nuanced and should not be read as a claim that these models are reliably better chemists than people. The authors emphasize several important limitations: ^[1]^[5]

Models stumbled on tasks that are easy for human experts, particularly questions requiring multi-step, structured chemical reasoning rather than recall, indicating that high aggregate scores can mask brittle reasoning.
Models were systematically overconfident. Their stated confidence did not track correctness; in one illustrative case a model reported lower confidence on a safety question it answered correctly than on several it answered incorrectly, a calibration failure that is especially concerning in a safety setting. ^[5]
Performance on toxicity and safety questions was notably weaker than on other chemistry topics, meaning the models were least dependable precisely where errors carry the greatest real-world risk. ^[1]^[5]

The authors frame these results as a dual reality: language models already display remarkable proficiency on many chemical tasks, yet further research is needed before they can be safely relied upon, and confident-sounding answers about hazardous chemicals warrant particular caution. ^[1]^[5]

Significance

ChemBench helped establish a rigorous, reproducible standard for evaluating chemistry knowledge and reasoning in language models, filling a gap that earlier, more general science benchmarks did not address. Its public leaderboard and open dataset allow new models to be scored consistently over time, and the project has continued to track frontier systems as they are released. ^[3]^[4] By pairing knowledge questions with explicit tests of confidence calibration and chemical safety, it influenced how the AI-for-science community thinks about evaluation, shifting attention from raw accuracy toward reliability, calibration, and the risks of overconfident outputs.

The benchmark sits alongside other domain-specific science evaluations and is often cited in discussions of AI safety for the chemical sciences, where the concern is not only whether a model is accurate but whether its mistakes and its misplaced confidence could enable harm. Its findings are frequently invoked both by those highlighting the rapid progress of LLMs in technical domains and by those cautioning that aggregate "superhuman" scores can obscure serious, safety-relevant weaknesses. ^[1]^[2]

References

Mirza, A., Alampara, N., Kunchapu, S., et al. "Are large language models superhuman chemists?" arXiv:2404.01475, April 2024. https://arxiv.org/abs/2404.01475 ↩
Mirza, A., et al. "Are large language models superhuman chemists?" Nature Chemistry, 2025. https://www.nature.com/articles/s41557-025-01865-1 ↩
"Behind the ChemBench Paper: Measuring AI's Chemistry Capabilities." Springer Nature Research Communities, 2025. https://communities.springernature.com/posts/behind-the-chembench-paper-measuring-ai-s-chemistry-capabilities ↩
"jablonkagroup/ChemBench." Hugging Face Datasets. https://huggingface.co/datasets/jablonkagroup/ChemBench ↩
Heidenreich, H. "ChemBench: Evaluating LLM Chemistry Against Experts." https://hunterheidenreich.com/notes/chemistry/llm-applications/chembench-llm-chemistry-evaluation/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

LLM-as-a-judge MATH

Overview

Motivation

What ChemBench contains

Comparison to human chemists

Findings, limitations, and safety

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here