LAB-Bench
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
LAB-Bench (the Language Agent Biology Benchmark) is an AI benchmark of more than 2,400 multiple-choice questions built to measure how well language models and AI agents can perform practical biology research tasks. Released in 2024 by FutureHouse, it spans eight broad categories, including literature search and comprehension, scientific figure and table interpretation, database access, protocol troubleshooting, and DNA and protein sequence manipulation. The benchmark was introduced in the paper "LAB-Bench: Measuring Capabilities of Language Models for Biology Research" by Jon M. Laurent, Samuel G. Rodriques, and colleagues at FutureHouse, and it has become a common reference point for evaluating biomedical research agents.[1][2]
LAB-Bench is designed to test capabilities that working biologists rely on day to day, rather than the textbook-style factual recall emphasized by earlier scientific question-answering datasets. Its tasks require retrieving and reasoning over the primary research literature, interpreting figures and tables from papers, querying biological databases, troubleshooting laboratory protocols, and carrying out the sequence reasoning and molecular cloning steps common to wet-lab work.[1]
The full benchmark contains 2,457 questions organized into eight categories that further decompose into roughly 30 narrower subtasks.[1] Each question is multiple choice and includes an explicit option to decline to answer, which lets the benchmark separate how often a model attempts a question from how often it is correct. The dataset is released under a Creative Commons license on Hugging Face, with about 80 percent of the questions made public and a roughly 20 percent private test subset withheld to detect training-data contamination.[2][3]
FutureHouse positions LAB-Bench as a measurement tool for its broader AI for science mission of building autonomous research systems for biology.[1] The authors argue that the bottleneck for AI assistance in biology is not knowledge of facts but the ability to execute the practical, multi-step reasoning that research requires: finding a result buried in a supplemental table, reading a complicated figure, navigating a sequence database, or diagnosing why a cloning experiment failed.
The benchmark is therefore framed as a step toward autonomous research agents. The authors note that an AI system able to score consistently well on the harder LAB-Bench tasks, especially literature search and molecular cloning, would already function as a useful assistant for researchers.[1] By measuring these capabilities directly, LAB-Bench is intended to track progress toward systems that can take on parts of the scientific workflow without constant human supervision.
LAB-Bench groups its questions into eight categories. The table below lists each category, what it tests, and the number of questions in the full dataset, along with the share of questions that the benchmark's human expert annotators chose to answer (their "coverage") as a measure of difficulty.[1]
| Category | What it tests | Questions | Human coverage |
|---|---|---|---|
| SeqQA | DNA and protein sequence manipulation and molecular biology workflows (15 subtasks) | 750 | 64% |
| DbQA | Retrieving information from common biological databases | 650 | 35% |
| TableQA | Interpreting data in scientific tables, beyond simple lookup | 305 | 82% |
| LitQA2 | Retrieving findings from the primary literature, not just abstracts | 248 | 100% |
| FigQA | Reasoning about scientific figures, often requiring multi-hop reasoning | 226 | 100% |
| ProtocolQA | Troubleshooting modified laboratory protocols to identify needed fixes | 135 | 100% |
| SuppQA | Finding information available only in supplemental materials | 102 | 100% |
| CloningScenarios | Multi-step molecular cloning problems | 41 | 100% |
| Total | 2,457 | 69% |
LitQA2 is a successor to an earlier LitQA literature task from FutureHouse and focuses on findings that require reasoning over the full text of papers rather than information available in abstracts.[1] The CloningScenarios category is described as "human-hard": each item is a complex, multi-step molecular cloning problem expected to take a trained molecular biologist more than ten minutes, and in some cases hours, to answer completely.[1][4] The low human coverage on DbQA (35 percent) reflects that many database-retrieval questions are difficult even for experts to answer confidently without tool access.[1]
LAB-Bench reports three metrics. Accuracy is the fraction of all questions answered correctly (correct divided by the total). Precision is the fraction of attempted questions answered correctly (correct divided by attempted), which rewards models that abstain rather than guess. Coverage is the share of questions a respondent chose to answer.[1]
The abstention mechanism is central to the design. Models are given an explicit option to decline a question for lack of information, and human annotators could mark a question "unsure" while optionally providing a best guess; coverage and precision are computed from the answers respondents were confident enough to commit to.[1] This precision and coverage framing is meant to reflect real research use, where a confidently wrong answer can be worse than no answer at all.
To establish a human baseline, FutureHouse had expert biology researchers answer questions across the categories, allowing direct comparison between frontier models and human experts on the same items.[1] The models evaluated in the original paper included Claude 3.5 Sonnet, Claude 3 Haiku, GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Llama 3 70B, tested on the multiple-choice questions without external tools.[1]
In the original evaluation, human experts substantially outperformed the language models on most categories in both accuracy and precision.[1] The clearest exception was Claude 3.5 Sonnet, which performed far better than the other models on figure and table interpretation. On TableQA it narrowly exceeded human precision and roughly matched human accuracy, the only place where a model surpassed the human baseline.[1] Reported TableQA precision was about 0.90 for Claude 3.5 Sonnet versus about 0.87 for humans.[1]
Other findings illustrate where models fell short:
Notably, the LAB-Bench paper deliberately evaluated only base language models and did not benchmark tool-augmented or agentic systems. The authors stated that they had not explored the capabilities of agents and left those comparisons "to the community at large."[1] FutureHouse's literature-search agent PaperQA and its Aviary agent environments are part of the same research program, and the LitQA tasks trace to that line of work, but the original LAB-Bench results report base-model and human scores rather than agent scores.[1]
LAB-Bench is one of the first benchmarks to target the procedural, tool-relevant skills of biology research rather than recall of biological facts, and its category structure has made it a standard yardstick for biomedical research agents.[1] It has been incorporated into third-party evaluation suites, including the Inspect Evals collection maintained alongside the UK AI Safety Institute, which makes the dataset's tasks runnable as a reproducible evaluation.[5]
The benchmark also fits into a fast-moving area. Subsequent work has reported that newer frontier models, sometimes paired with retrieval or tool use, can match or exceed human experts on portions of LAB-Bench and related biology evaluations, which has prompted discussion of both the scientific promise and the biosecurity implications of capable biology research agents.[6] The private test subset and the precision-and-coverage scoring are intended to keep the benchmark meaningful as models improve, by guarding against contamination and by distinguishing genuine capability from confident guessing.[2][1]
LAB-Bench sits alongside other scientific and biomedical evaluations and is most closely associated with FutureHouse's goal of automating parts of the scientific process. By providing a concrete, multi-category measure of research-relevant capability, it gives developers and researchers a way to judge how close AI agents are to serving as practical assistants in the laboratory.[1]