ChemBench
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,352 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,352 words
Add missing citations, update stale details, or suggest a clearer explanation.
ChemBench is an automated AI benchmark that measures the chemical knowledge, reasoning, and safety judgment of large language models and compares their performance against expert human chemists. Introduced in the paper "Are large language models superhuman chemists?" by Adrian Mirza, Kevin Maik Jablonka, and colleagues, ChemBench was first released as a preprint in April 2024 and published in the journal Nature Chemistry in 2025. [1][2] It is one of the most frequently cited AI for science evaluations in chemistry, and its headline finding, that the strongest models can outperform the best human experts in the study on average while still failing some basic tasks and giving overconfident answers, has been widely discussed in debates about AI capability and AI safety. [1]
ChemBench is both a curated question corpus and a software framework for running and scoring chemistry evaluations automatically. The corpus contains more than 2,700 question-answer pairs spanning the major branches of chemistry, together with questions on chemical toxicity and safety and on chemical "intuition" or human preference. [1][2] The framework presents these questions to a model, parses the model's free-text completion to recover its answer, and scores the result, allowing many models to be benchmarked under identical conditions. A subset of the questions was also answered by human chemists through a custom web application, producing a human baseline for direct comparison. [1][3]
The project is developed by the Jablonka group (the "lamalab" laboratory associated with Kevin Maik Jablonka) and is released openly: the code, the dataset, and a public leaderboard are available online, and the dataset is distributed through Hugging Face. [3][4] The work involved roughly three dozen contributors across multiple institutions, including co-authors such as Nawaf Alampara, Martino Rios-Garcia, Mara Schilling-Wilhelmi, Santiago Miret, Michael Pieler, and Philippe Schwaller. [1]
By 2023 and 2024, large language models were increasingly being applied to chemistry tasks: answering technical questions, planning syntheses, interpreting spectra, and assisting with literature. Despite this rapid adoption, there was no rigorous, chemistry-specific way to measure what these models actually knew, how well they reasoned, and whether their judgments about hazardous substances could be trusted. The ChemBench authors report that when they looked for an existing benchmark suited to evaluating general-purpose chemical models, they found nothing adequate, which motivated building one from scratch. [3]
A second motivation was safety. Because chemistry involves toxic, explosive, and otherwise dangerous materials, a model that answers fluently but incorrectly, or that is confidently wrong about a substance's hazards, could be actively harmful if relied upon. ChemBench was therefore designed not only to test textbook knowledge but also to probe chemical safety and to measure how well a model's stated confidence tracks its actual accuracy. [1][5]
The published version of ChemBench comprises 2,788 question-answer pairs. Of these, 1,039 were manually written or curated from sources such as university examinations, chemistry olympiads, textbooks, and newly authored items, while 1,749 were generated semi-automatically from structured data. [2][5] By format, the corpus is dominated by multiple-choice items but also includes open-ended questions that require a specific numerical or short-text answer.
The questions are organized into topical categories covering the breadth of the discipline. The table below summarizes the approximate distribution reported for the dataset.
| Topic area | Approx. number of questions |
|---|---|
| Chemical preference / intuition | ~1,000 |
| Toxicity and safety | ~675 |
| Organic chemistry | ~429 |
| Physical chemistry | ~165 |
| Analytical chemistry | ~152 |
| General chemistry | ~149 |
| Inorganic chemistry | ~92 |
| Materials science | ~83 |
| Technical chemistry | ~40 |
Source: ChemBench dataset categories. [4][5]
Because molecular structures are often stored as machine-readable strings (for example, SMILES notation), the authors built a web application that renders molecules visually so that human participants could answer the same questions presented to the models. [3] For automated grading, ChemBench extracts a model's answer using a multi-step regular-expression parser, falling back to an LLM-based extraction step when the regex approach fails; partial correctness is generally treated as incorrect. [5] To enable a lighter-weight human study and faster iteration, the authors also defined a smaller curated subset, referred to as ChemBench-Mini, containing 236 questions. [5]
The human baseline was collected through the chembench.org web platform. Nineteen chemistry experts participated, a group composed mostly of advanced graduate researchers: roughly thirteen PhD students holding master's degrees, two researchers beyond the postdoctoral stage, and one bachelor's-level participant, among others. [5] Participants answered the ChemBench-Mini subset. Some volunteers were permitted to use tools such as web search and the ChemDraw structure editor, but all were prohibited from using language models, so the comparison measured model performance against human experts working with conventional resources rather than against unaided memory alone. [5]
This design lets ChemBench report not just an aggregate model-versus-human comparison but also a breakdown by topic, revealing where models match or exceed human experts and where they fall short.
The central result is that the best models scored higher than the best human chemists in the study on average. In the original 2024 preprint, the leading models already outperformed the strongest human participants overall. [1] In the updated study published in Nature Chemistry, the reasoning-oriented model o1-preview was the top performer and exceeded the best human expert by close to a factor of two in overall accuracy on the evaluated subset. [5] Strong general-purpose models such as GPT-4 and Claude 3.5 Sonnet also performed well, with Claude 3.5 Sonnet leading on many individual domains while GPT-4 was comparatively stronger on chemical safety. [2][5]
The "superhuman" framing, however, is deliberately nuanced and should not be read as a claim that these models are reliably better chemists than people. The authors emphasize several important limitations: [1][5]
The authors frame these results as a dual reality: language models already display remarkable proficiency on many chemical tasks, yet further research is needed before they can be safely relied upon, and confident-sounding answers about hazardous chemicals warrant particular caution. [1][5]
ChemBench helped establish a rigorous, reproducible standard for evaluating chemistry knowledge and reasoning in language models, filling a gap that earlier, more general science benchmarks did not address. Its public leaderboard and open dataset allow new models to be scored consistently over time, and the project has continued to track frontier systems as they are released. [3][4] By pairing knowledge questions with explicit tests of confidence calibration and chemical safety, it influenced how the AI-for-science community thinks about evaluation, shifting attention from raw accuracy toward reliability, calibration, and the risks of overconfident outputs.
The benchmark sits alongside other domain-specific science evaluations and is often cited in discussions of AI safety for the chemical sciences, where the concern is not only whether a model is accurate but whether its mistakes and its misplaced confidence could enable harm. Its findings are frequently invoked both by those highlighting the rapid progress of LLMs in technical domains and by those cautioning that aggregate "superhuman" scores can obscure serious, safety-relevant weaknesses. [1][2]