LAB-Bench

AI Benchmarks Model Evaluation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,496 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LAB-Bench (the Language Agent Biology Benchmark) is an AI benchmark of more than 2,400 multiple-choice questions built to measure how well language models and AI agents can perform practical biology research tasks. Released in 2024 by FutureHouse, it spans eight broad categories, including literature search and comprehension, scientific figure and table interpretation, database access, protocol troubleshooting, and DNA and protein sequence manipulation. The benchmark was introduced in the paper "LAB-Bench: Measuring Capabilities of Language Models for Biology Research" by Jon M. Laurent, Samuel G. Rodriques, and colleagues at FutureHouse, and it has become a common reference point for evaluating biomedical research agents.^[1]^[2]

Overview

LAB-Bench is designed to test capabilities that working biologists rely on day to day, rather than the textbook-style factual recall emphasized by earlier scientific question-answering datasets. Its tasks require retrieving and reasoning over the primary research literature, interpreting figures and tables from papers, querying biological databases, troubleshooting laboratory protocols, and carrying out the sequence reasoning and molecular cloning steps common to wet-lab work.^[1]

The full benchmark contains 2,457 questions organized into eight categories that further decompose into roughly 30 narrower subtasks.^[1] Each question is multiple choice and includes an explicit option to decline to answer, which lets the benchmark separate how often a model attempts a question from how often it is correct. The dataset is released under a Creative Commons license on Hugging Face, with about 80 percent of the questions made public and a roughly 20 percent private test subset withheld to detect training-data contamination.^[2]^[3]

Motivation: AI for biology research

FutureHouse positions LAB-Bench as a measurement tool for its broader AI for science mission of building autonomous research systems for biology.^[1] The authors argue that the bottleneck for AI assistance in biology is not knowledge of facts but the ability to execute the practical, multi-step reasoning that research requires: finding a result buried in a supplemental table, reading a complicated figure, navigating a sequence database, or diagnosing why a cloning experiment failed.

The benchmark is therefore framed as a step toward autonomous research agents. The authors note that an AI system able to score consistently well on the harder LAB-Bench tasks, especially literature search and molecular cloning, would already function as a useful assistant for researchers.^[1] By measuring these capabilities directly, LAB-Bench is intended to track progress toward systems that can take on parts of the scientific workflow without constant human supervision.

Structure: task categories

LAB-Bench groups its questions into eight categories. The table below lists each category, what it tests, and the number of questions in the full dataset, along with the share of questions that the benchmark's human expert annotators chose to answer (their "coverage") as a measure of difficulty.^[1]

Category	What it tests	Questions	Human coverage
SeqQA	DNA and protein sequence manipulation and molecular biology workflows (15 subtasks)	750	64%
DbQA	Retrieving information from common biological databases	650	35%
TableQA	Interpreting data in scientific tables, beyond simple lookup	305	82%
LitQA2	Retrieving findings from the primary literature, not just abstracts	248	100%
FigQA	Reasoning about scientific figures, often requiring multi-hop reasoning	226	100%
ProtocolQA	Troubleshooting modified laboratory protocols to identify needed fixes	135	100%
SuppQA	Finding information available only in supplemental materials	102	100%
CloningScenarios	Multi-step molecular cloning problems	41	100%
Total		2,457	69%

LitQA2 is a successor to an earlier LitQA literature task from FutureHouse and focuses on findings that require reasoning over the full text of papers rather than information available in abstracts.^[1] The CloningScenarios category is described as "human-hard": each item is a complex, multi-step molecular cloning problem expected to take a trained molecular biologist more than ten minutes, and in some cases hours, to answer completely.^[1]^[4] The low human coverage on DbQA (35 percent) reflects that many database-retrieval questions are difficult even for experts to answer confidently without tool access.^[1]

Evaluation and human comparison

LAB-Bench reports three metrics. Accuracy is the fraction of all questions answered correctly (correct divided by the total). Precision is the fraction of attempted questions answered correctly (correct divided by attempted), which rewards models that abstain rather than guess. Coverage is the share of questions a respondent chose to answer.^[1]

The abstention mechanism is central to the design. Models are given an explicit option to decline a question for lack of information, and human annotators could mark a question "unsure" while optionally providing a best guess; coverage and precision are computed from the answers respondents were confident enough to commit to.^[1] This precision and coverage framing is meant to reflect real research use, where a confidently wrong answer can be worse than no answer at all.

To establish a human baseline, FutureHouse had expert biology researchers answer questions across the categories, allowing direct comparison between frontier models and human experts on the same items.^[1] The models evaluated in the original paper included Claude 3.5 Sonnet, Claude 3 Haiku, GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Llama 3 70B, tested on the multiple-choice questions without external tools.^[1]

Results

In the original evaluation, human experts substantially outperformed the language models on most categories in both accuracy and precision.^[1] The clearest exception was Claude 3.5 Sonnet, which performed far better than the other models on figure and table interpretation. On TableQA it narrowly exceeded human precision and roughly matched human accuracy, the only place where a model surpassed the human baseline.^[1] Reported TableQA precision was about 0.90 for Claude 3.5 Sonnet versus about 0.87 for humans.^[1]

Other findings illustrate where models fell short:

On FigQA, every model except Claude 3.5 Sonnet performed near chance, reflecting the multi-hop visual reasoning the task demands; Claude 3.5 Sonnet reached roughly 0.54 precision against about 0.82 for humans.^[1]
On LitQA2, the models clustered together and scored well above random, exceeding 40 percent accuracy, but still trailed human experts, who reached roughly 0.76 precision compared with about 0.47 for Claude 3.5 Sonnet.^[1]
On the human-hard CloningScenarios, models scored poorly, with Claude 3.5 Sonnet and GPT-4o around 0.28 accuracy against a human baseline near 0.60.^[1]
On SeqQA, models reached roughly 40 to 50 percent precision and approached or occasionally matched humans on a few narrow subtasks, while humans retained a clear edge on tasks requiring long-sequence processing.^[1]

Notably, the LAB-Bench paper deliberately evaluated only base language models and did not benchmark tool-augmented or agentic systems. The authors stated that they had not explored the capabilities of agents and left those comparisons "to the community at large."^[1] FutureHouse's literature-search agent PaperQA and its Aviary agent environments are part of the same research program, and the LitQA tasks trace to that line of work, but the original LAB-Bench results report base-model and human scores rather than agent scores.^[1]

Significance

LAB-Bench is one of the first benchmarks to target the procedural, tool-relevant skills of biology research rather than recall of biological facts, and its category structure has made it a standard yardstick for biomedical research agents.^[1] It has been incorporated into third-party evaluation suites, including the Inspect Evals collection maintained alongside the UK AI Safety Institute, which makes the dataset's tasks runnable as a reproducible evaluation.^[5]

The benchmark also fits into a fast-moving area. Subsequent work has reported that newer frontier models, sometimes paired with retrieval or tool use, can match or exceed human experts on portions of LAB-Bench and related biology evaluations, which has prompted discussion of both the scientific promise and the biosecurity implications of capable biology research agents.^[6] The private test subset and the precision-and-coverage scoring are intended to keep the benchmark meaningful as models improve, by guarding against contamination and by distinguishing genuine capability from confident guessing.^[2]^[1]

LAB-Bench sits alongside other scientific and biomedical evaluations and is most closely associated with FutureHouse's goal of automating parts of the scientific process. By providing a concrete, multi-category measure of research-relevant capability, it gives developers and researchers a way to judge how close AI agents are to serving as practical assistants in the laboratory.^[1]

References

Laurent, J. M., Janizek, J. D., Ruzo, M., Hinks, M. M., Hammerling, M. J., Narayanan, S., Ponnapati, M., White, A. D., and Rodriques, S. G. "LAB-Bench: Measuring Capabilities of Language Models for Biology Research." arXiv:2407.10362, July 2024. https://arxiv.org/abs/2407.10362 ↩
"futurehouse/lab-bench." Hugging Face Datasets. https://huggingface.co/datasets/futurehouse/lab-bench ↩
"Future-House/LAB-Bench: Evaluation dataset for AI systems intended to benchmark capabilities foundational to scientific research in biology." GitHub. https://github.com/Future-House/LAB-Bench ↩
"The New Biology Benchmark Dataset LAB-Bench Is Now Open Source! It Covers 8 Major Tasks and Contains Over 2.4K Multiple-Choice Questions." HyperAI. https://hyper.ai/en/news/33112 ↩
"LAB-Bench: Measuring Capabilities of Language Models for Biology Research." Inspect Evals, UK AI Safety Institute. https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/lab_bench/ ↩
"LLMs Outperform Experts on Challenging Biology Benchmarks." arXiv:2505.06108, 2025. https://arxiv.org/abs/2505.06108 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

LLM-as-a-judge MATH

Overview

Motivation: AI for biology research

Structure: task categories

Evaluation and human comparison

Results

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here