PubMedQA

AI Benchmarks Healthcare AI Natural Language Processing

26 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v5 · 5,108 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

PubMedQA is a biomedical question answering dataset and benchmark that evaluates whether machine learning models can answer yes/no/maybe research questions using evidence from PubMed abstracts. Introduced by Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu in 2019, it was the first question answering dataset that specifically required reasoning over biomedical research texts, including their quantitative experimental results.^[1] The dataset was presented at the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP 2019) in Hong Kong^[1] and has become one of the most widely used benchmarks for evaluating large language models and natural language processing systems in the biomedical domain.

PubMedQA consists of three subsets totaling over 273,000 question-answer instances: 1,000 expert-annotated examples (PQA-L), 61,200 unlabeled examples (PQA-U), and 211,300 artificially generated examples (PQA-A).^[1] The core task gives a model a research question and its associated PubMed abstract context, then asks it to classify the answer as "yes," "no," or "maybe." The benchmark is publicly available at pubmedqa.github.io ^[10] and is distributed under the MIT license.^[11]

What is PubMedQA?

PubMedQA is a research question answering dataset built directly from PubMed, the biomedical literature database maintained by the National Library of Medicine. The original paper frames the goal succinctly: "We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts."^[1] The authors emphasize that "PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions."^[1]

What distinguishes PubMedQA from earlier work is the requirement to reason over evidence rather than retrieve a fact. Many biomedical research questions demand understanding experimental designs, interpreting statistical comparisons between groups, and synthesizing multiple pieces of evidence from different parts of an abstract. For example, answering "Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?" requires reading the study's results, understanding the comparison between statin-treated and control groups, and drawing a conclusion from the reported statistical findings.

Why was PubMedQA created?

Before PubMedQA, existing question answering datasets in the biomedical domain focused primarily on factoid questions (asking for specific entities or facts) or list-type questions (asking for collections of items). Datasets like BioASQ had advanced the field of biomedical QA, but none specifically targeted the challenge of answering yes/no research questions that demand reasoning over scientific evidence.

The authors observed that approximately 760,000 articles in PubMed use questions as their titles. Among those, roughly 120,000 have structured abstracts divided into labeled subsections such as "Background," "Methods," "Results," and "Conclusions."^[1] This structure creates a natural question-answer pair: the title poses a research question, the abstract body provides the evidence, and the conclusion provides the answer. This observation formed the foundation for constructing PubMedQA.

A central motivation was to create a benchmark that tests whether models can perform genuine reasoning over biomedical texts rather than simply matching keywords or relying on superficial patterns.

How is PubMedQA structured?

Source Material

PubMedQA draws its content from PubMed, the biomedical literature database maintained by the National Library of Medicine. The dataset specifically targets articles with question-format titles and structured abstracts. Each instance in PubMedQA contains four components:

Question: Either an existing research article title or a question derived from one.
Context: The corresponding abstract with its conclusion section removed. This forces models to reason from the evidence rather than simply reading the answer.
Long answer: The conclusion section of the abstract, which serves as the natural language answer to the research question.
Short answer: A yes, no, or maybe label that summarizes whether the conclusion affirms, denies, or is inconclusive about the question.^[1]

PQA-Labeled (PQA-L)

The expert-annotated subset, PQA-L, contains 1,000 instances that were manually labeled by qualified M.D. candidates with biomedical domain expertise.^[1] The annotation process followed a carefully designed protocol:

Instances were randomly sampled from a pool of candidate question-abstract pairs (pre-PQA-U).
Questions that could not be answered with yes, no, or maybe were removed. The authors found that approximately 50.2% of sampled question titles were not answerable in this format, including wh-questions ("What causes...") and questions requiring more complex answers.^[1]
Two annotators independently labeled each remaining instance under different conditions:
- Annotator 1 (reasoning-free setting): Had access to the question, context, and long answer (the conclusion). This made labeling relatively straightforward because yes/no/maybe answers are typically stated explicitly in conclusions.
- Annotator 2 (reasoning-required setting): Had access only to the question and context (without the conclusion). This required genuine reasoning over the abstract's evidence.
When both annotators agreed, the label was accepted.
When labels disagreed, annotators discussed the instance to reach consensus.
Instances where no consensus could be reached were removed from the dataset.^[1]

The final label distribution in PQA-L is:

Answer Label	Percentage	Count (approx.)
Yes	55.2%	552
No	33.8%	338
Maybe	11.0%	110

The "yes" majority gives a majority-class baseline accuracy of 55.2%.^[1] The relatively small proportion of "maybe" answers reflects the fact that most published research reaches a definitive conclusion, while the "maybe" label captures cases where the evidence is mixed or the answer depends on specific conditions.

The PQA-L subset is split into 500 instances for training/development (evaluated using 10-fold cross-validation) and 500 instances reserved as the official test set for leaderboard evaluation.^[1]

Annotation Guidelines

The annotation criteria specified in the paper define each label as follows:

Yes: The experiments described in the abstract indicate that the statement in the question title is true. A statistically significant difference was observed between the groups being compared.
No: No significant difference was found, or the evidence contradicts the claim in the question.
Maybe: The paper discusses conditions under which the answer varies, or multiple interventions were tested with mixed results across them.^[1]

PQA-Unlabeled (PQA-U)

The unlabeled subset contains 61,200 context-question pairs collected from PubMed articles that met the same structural criteria as PQA-L (question titles, structured abstracts).^[1] These instances include the question, context, and long answer, but lack expert-annotated yes/no/maybe labels.

PQA-U serves an important role in the multi-phase training pipeline described in the original paper. Models can leverage these unlabeled instances through semi-supervised learning techniques such as pseudo-labeling, where a model trained on PQA-L generates predicted labels for PQA-U instances and then trains on its own confident predictions.

PQA-Artificial (PQA-A)

The artificially generated subset contains 211,300 instances created through an automated heuristic process. The construction method worked as follows:

PubMed articles with statement-format titles (rather than question titles) and structured abstracts were identified.
Using Stanford CoreNLP for part-of-speech (POS) tagging, the system located noun phrase structures followed by verb phrases (NP-VBP/VBZ patterns) in the title.
Statement titles were converted to yes/no questions by moving or adding copulas ("is," "are") or auxiliary verbs ("does," "do") to the front of the sentence.
The yes/no answer was determined automatically based on the negation status of the verb in the original title. If the original statement was affirmative, the answer was "yes"; if it contained a negation, the answer was "no."^[1]

For example:

"Spontaneous electrocardiogram alterations predict ventricular fibrillation..." becomes "Do spontaneous electrocardiogram alterations predict ventricular fibrillation...?" with the answer "yes."
"Liver grafts from selected older donors do not have significantly more..." becomes "Do liver grafts from selected older donors have significantly more...?" with the answer "no."^[1]

The label distribution in PQA-A is heavily skewed:

Answer Label	Percentage
Yes	92.8%
No	7.2%
Maybe	0.0%

The absence of "maybe" labels and the strong skew toward "yes" are expected consequences of the heuristic approach. Most published research findings are positive (affirming the hypothesis), and the automated method cannot capture the nuance of inconclusive or conditional results.

PQA-A is split into 200,000 training instances and 11,300 validation instances.^[1]

What statistics describe the PubMedQA subsets?

The following table summarizes key statistics across all three subsets:^[1]

Property	PQA-L	PQA-U	PQA-A
Number of instances	1,000	61,200	211,300
Average question length (words)	14.4	~15	16.3
Average context length (words)	237	~238	239
Average long answer length (words)	41	~43	46
Expert-annotated labels	Yes	No	No (heuristic)
Label classes	yes / no / maybe	N/A	yes / no
Total size	~273,500 instances

What kinds of questions does PubMedQA contain?

Analysis of the questions in PubMedQA reveals several common research question patterns:^[1]

Question Type	Percentage
Factor influence (does X affect Y?)	36.5%
Therapy evaluation (is treatment X effective?)	26.0%
Statement verification (is X true?)	18.0%
Relational queries (is X associated with Y?)	18.0%

What kinds of reasoning does PubMedQA require?

The dataset also characterizes the types of reasoning required:^[1]

Reasoning Pattern	Percentage
Inter-group comparisons	57.5%
Interpreting subgroup statistics	16.5%
Single-group statistics analysis	16.0%
Other	10.0%

The dominance of inter-group comparisons reflects the nature of biomedical research, where clinical studies typically compare treatment groups against control groups.

How is PubMedQA evaluated?

PubMedQA defines two distinct evaluation settings that test different aspects of model capability.

Reasoning-Required Setting

In the reasoning-required setting, models receive only the question and the context (the abstract without its conclusion). The model must reason over the evidence presented in the abstract body to determine whether the answer is yes, no, or maybe. This is the primary evaluation setting used for the official leaderboard, as it tests genuine comprehension and inference rather than simple pattern matching.

Human performance in this setting is 78.0% accuracy and 72.2% macro-F1, reflecting the inherent difficulty of drawing conclusions from incomplete scientific evidence without seeing the authors' own conclusions.^[1]

Reasoning-Free Setting

In the reasoning-free setting, models receive the question and the long answer (the abstract's conclusion) instead of the context. Since conclusions typically state the answer explicitly ("Our findings suggest that..." or "No significant difference was observed..."), this setting is considerably easier and primarily serves as a diagnostic tool.

Human performance in the reasoning-free setting reaches 90.4% accuracy and 84.2% macro-F1, confirming that the conclusion usually contains enough information to determine the answer directly.^[1]

Evaluation Metrics

The primary evaluation metrics for PubMedQA are:

Accuracy: The percentage of correctly classified instances on the PQA-L test set (500 instances).
Macro-F1: The unweighted average of per-class F1 scores across yes, no, and maybe. This metric penalizes models that perform poorly on minority classes (particularly "maybe").

What were the original 2019 baseline results?

The original PubMedQA paper evaluated several baseline models using a multi-phase fine-tuning pipeline.

Multi-Phase Training Pipeline

The training pipeline consists of three phases designed to maximize the use of all three dataset subsets:

Phase I (Pre-training on PQA-A): The model is fine-tuned on the large PQA-A dataset using question and context as input. This provides the model with a broad foundation of biomedical reasoning patterns, though the labels are noisy.

Phase II (Bootstrapping with PQA-U): This phase uses a three-step bootstrapping process:

Fine-tune on PQA-A in the reasoning-free setting (using question and long answer as input).
Fine-tune on PQA-L in the reasoning-free setting.
Use the resulting model to generate pseudo-labels for PQA-U, keeping only confident predictions while maintaining label proportions that match PQA-L.

Final Phase (Fine-tuning on PQA-L): The model from Phase I is further fine-tuned on the PQA-L training set, now supplemented with bootstrapped PQA-U instances, in the reasoning-required setting.^[1]

Additional Supervision (A.S.)

The paper introduced an auxiliary training signal alongside the main question answering loss. The model was trained with two loss functions:

QA Loss: Standard cross-entropy loss for the yes/no/maybe classification.
Bag-of-Words Loss: A binary bag-of-words prediction task that trained the model to predict which words appear in the long answer, given only the question and context. This encouraged the model to develop internal representations of what the conclusion would say, even when the conclusion was not provided as input.

The combined loss function was: L = L_QA + beta * L_BoW, where beta was set to zero in the reasoning-free setting (since the long answer was already available as input).^[1]

Baseline Results

The following table presents results from the original paper on the PQA-L test set in the reasoning-required setting:

Model	Accuracy	Macro-F1
Majority baseline	55.2%	N/A
Shallow Features (w/o A.S.)	53.9%	36.1%
Shallow Features (w/ A.S.)	53.6%	35.9%
BiLSTM (w/o A.S.)	55.2%	24.0%
BiLSTM (w/ A.S.)	55.2%	23.9%
ESIM w/ BioELMo (w/o A.S.)	53.9%	32.4%
ESIM w/ BioELMo (w/ A.S.)	54.0%	31.1%
BioBERT (w/o A.S.)	57.0%	28.5%
BioBERT (w/ A.S.)	57.3%	28.7%

These are "Final Only" results (trained only on PQA-L without multi-phase training). With the full multi-phase pipeline, performance improved substantially:

Model (Multi-Phase)	Accuracy	Macro-F1
BiLSTM (w/o A.S.)	59.8%	41.9%
BiLSTM (w/ A.S.)	58.9%	41.1%
ESIM w/ BioELMo (w/o A.S.)	62.1%	45.8%
ESIM w/ BioELMo (w/ A.S.)	63.7%	47.9%
BioBERT (w/o A.S.)	67.7%	52.4%
BioBERT (w/ A.S.)	68.1%	52.7%
Human (single, reasoning-required)	78.0%	72.2%

The best-performing baseline, BioBERT with additional supervision in the multi-phase setting, achieved 68.1% accuracy and 52.7% macro-F1.^[1]^[9] This remained nearly 10 percentage points below single human performance, highlighting the difficulty of biomedical reasoning for machine learning models at the time.

How do models perform on PubMedQA today?

Since its release, PubMedQA has served as a standard benchmark in the evaluation of biomedical and medical language models. The official leaderboard, maintained by Qiao Jin at pubmedqa.github.io, tracks submissions in the reasoning-required setting.^[10] Over time, advances in pre-trained language models, domain-specific fine-tuning, and sophisticated prompting strategies have pushed performance well beyond the original baselines.

Notable Model Performances

The following table summarizes key results on the PubMedQA leaderboard (reasoning-required setting, PQA-L test set of 500 instances):

Model	Organization	Accuracy	Year	Notes
BioBERT (multi-phase, w/ A.S.)	Original paper	68.1%	2019	First strong baseline ^[1]
BioGPT (base)	Microsoft Research	78.2%	2022	Generative pre-trained transformer for biomedical text ^[5]
Flan-PaLM 540B	Google	79.0%	2022	Instruction-tuned PaLM at 540B parameters ^[3]
BioMedLM (2.7B)	Stanford CRFM	74.4%	2022	Smaller domain-specific model
Claude 3	Anthropic	79.7%	2024	General-purpose LLM
Med-PaLM 2	Google Research & DeepMind	81.8%	2023	With self-consistency (11 samples) ^[4]
BioGPT-Large (1.5B)	Microsoft Research	81.0%	2022	Scaled-up BioGPT ^[5]
Palmyra-Med (40B)	Writer Inc.	81.1%	2023	Domain-specific medical LLM ^[10]
MEDITRON (70B)	EPFL	81.6%	2023	Open-source medical LLM with chain-of-thought + self-consistency ^[6]
GPT-4 (Medprompt)	Microsoft	82.0%	2023	Advanced prompting strategy without fine-tuning ^[7]

As of the last leaderboard update in April 2024, GPT-4 with Medprompt from Microsoft holds the top position at 82.0% accuracy.^[7]^[10] This result surpasses single human performance (78.0%), though it is worth noting that the Medprompt strategy uses ensemble-style inference with multiple prompting phases at test time.

Key Trends in Performance

Several patterns emerge from the progression of PubMedQA scores:

Domain-specific pre-training pays off. Models like BioGPT, BioMedLM, and MEDITRON, which were pre-trained or fine-tuned on biomedical corpora, consistently outperform general-purpose models of similar size.^[5]^[6] BioGPT-Large achieved 81.0% with only 1.5 billion parameters, competitive with much larger general models.^[5]

Prompting strategies matter. Med-PaLM 2 improved from 75.0% with standard prompting to 81.8% with self-consistency prompting (sampling 11 responses and taking the majority vote).^[4] Similarly, GPT-4's Medprompt strategy boosted performance to 82.0% without any domain-specific fine-tuning, using chain-of-thought prompting with automatically selected in-context examples.^[7]

Scale alone is not sufficient. The base GPT-4 model in a zero-shot setting achieves roughly 75% accuracy on PubMedQA, below several smaller domain-specialized models.^[7] This suggests that biomedical reasoning requires either domain-specific training data or carefully designed inference strategies.

The human-machine gap has closed. While the original 2019 baselines lagged human performance by nearly 10 points, modern models have surpassed the single-human benchmark of 78.0%. However, the authors of Med-PaLM 2 noted that many remaining errors on PubMedQA may be attributable to label noise in the relatively small 500-instance test set, raising questions about the benchmark's ceiling.^[4]

Frontier Models and Benchmark Saturation (2025-2026)

The official leaderboard has remained largely static since 2024: its most recent submission, Reka Core, was logged on April 18, 2024 at 74.6% accuracy, and GPT-4 with Medprompt remains the top entry at 82.0%.^[10] Evaluation of newer frontier and open-weight models has instead shifted into the standardized evaluation harnesses reported in model technical papers, which generally do not use the heavy test-time ensembling behind the top leaderboard submissions.

In the MedGemma technical report (Google Research and Google DeepMind, July 2025), PubMedQA was scored under a uniform protocol.^[13] Under that protocol OpenAI o3 led at 80.0%, followed by GPT-4o at 78.4%, DeepSeek-R1 at 77.2%, and Gemini 2.5 Flash and 2.5 Pro at 76.2% and 75.8% respectively. The open medical model MedGemma 27B reached 76.8% (with test-time scaling) and MedGemma 4B reached 73.4%, each improving on the corresponding general-purpose Gemma 3 model.^[13]

Model	PubMedQA accuracy	Category
OpenAI o3	80.0%	Proprietary
GPT-4o	78.4%	Proprietary
DeepSeek-R1	77.2%	Open weight
MedGemma 27B (test-time scaling)	76.8%	Open, medical
Gemini 2.5 Flash	76.2%	Proprietary
Gemini 2.5 Pro	75.8%	Proprietary
MedGemma 4B	73.4%	Open, medical
Gemma 3 27B	73.4%	Open
Gemma 3 4B	68.4%	Open

These standardized scores cluster in the mid-to-high 70s, near or slightly above the single-annotator human score of 78.0% and several points below the 82.0% leaderboard peak set with specialized test-time strategies.^[13] The pattern is consistent with a broader observation that PubMedQA has approached saturation: a 2025 study of biology and medicine benchmarks reported that PubMedQA, together with the MMLU and WMDP biology subsets, plateaus well below 100% accuracy, which the authors attributed to benchmark saturation and errors in the underlying answer labels rather than to model limitations.^[15]

PubMedQA occupies a specific niche within the landscape of medical and biomedical NLP benchmarks. The following table highlights how it compares with other commonly used datasets:

Benchmark	Domain	Task Type	Size	Answer Format
PubMedQA	Biomedical research	Research QA	273K total (1K labeled)	Yes / No / Maybe
MedQA (USMLE)	Clinical medicine	Exam QA	~12,700	4-option multiple choice
BioASQ	Biomedical	Multi-type QA	Varies by challenge year	Factoid, list, yes/no, summary
MMLU (Medical subsets)	Medical knowledge	Exam QA	~1,000+ per subset	4-option multiple choice
MedMCQA	Medical entrance exams	Exam QA	~194K	4-option multiple choice
HealthSearchQA	Consumer health	Open-domain QA	~3,375	Free-text

PubMedQA is distinctive in several ways. First, it tests reasoning over primary research literature rather than textbook knowledge. Second, its three-way answer format (yes/no/maybe) is more nuanced than binary classification, capturing the reality that scientific evidence is sometimes inconclusive. Third, the inclusion of the long answer (conclusion) enables both reasoning-required and reasoning-free evaluation paradigms.

PubMedQA is included in several composite evaluation suites. It appears in BLURB (the Biomedical Language Understanding and Reasoning Benchmark maintained by Microsoft Research), which aggregates multiple biomedical NLP tasks into a unified leaderboard.^[8] It is also part of the Open Medical-LLM Leaderboard on Hugging Face, which benchmarks language models across several medical question answering datasets.^[18]

PubMedQA has also been adapted for retrieval-augmented generation evaluation. The MIRAGE benchmark (2024), co-developed by PubMedQA creator Qiao Jin, includes a PubMedQA* variant whose 500 test questions have their gold contexts removed, forcing systems to retrieve supporting evidence themselves; the accompanying MedRAG toolkit improved the accuracy of six language models by up to 18 percentage points over chain-of-thought prompting alone.^[16]

What is PubMedQA used for?

Model Development and Evaluation

PubMedQA's primary application is as an evaluation benchmark for biomedical NLP systems. Researchers developing new language models, fine-tuning strategies, or prompting techniques for the medical domain routinely report PubMedQA scores alongside results on MedQA, BioASQ, and other benchmarks. The dataset's inclusion in composite benchmarks like BLURB ensures that it remains central to biomedical NLP research.

Clinical Decision Support

While PubMedQA itself is a research benchmark, the capability it tests (answering biomedical research questions from abstracts) has direct relevance to clinical decision support systems. A model that performs well on PubMedQA has demonstrated the ability to extract and synthesize evidence from medical literature, a skill that could be applied in evidence-based medicine tools, systematic review assistants, and clinical knowledge retrieval systems.

Literature Review Automation

The task format of PubMedQA closely mirrors what a researcher does when scanning PubMed abstracts to answer a specific research question. Organizations evaluating language models for automated literature review applications often look at PubMedQA scores (alongside BioASQ scores) as a proxy for how well a model might perform on real-world literature synthesis tasks.

Training Resource

Beyond evaluation, the PQA-A and PQA-U subsets serve as training resources. The 211,300 artificially labeled instances in PQA-A provide a large-scale, if noisy, signal for pre-training or fine-tuning biomedical QA models. The 61,200 unlabeled instances in PQA-U enable semi-supervised learning approaches that can improve model performance without additional human annotation.

What are the limitations and criticisms of PubMedQA?

Small Test Set

The official PubMedQA test set contains only 500 instances. With top models now achieving accuracy above 80%, the difference between models often falls within the margin of statistical uncertainty. A single misclassified instance changes accuracy by 0.2 percentage points. The authors of Med-PaLM 2 noted that remaining errors may be partly due to label noise, suggesting that the benchmark's effective ceiling could be lower than 100%.^[4]

Label Noise

Despite the careful annotation process, the three-way classification (yes/no/maybe) involves subjective judgment, particularly for the "maybe" category. Different domain experts may disagree on whether a study with borderline statistical significance warrants a "yes" or "maybe" label. The authors estimated that annotation error "could be as low as 1%," but this estimate was based on internal consistency checks rather than external validation.^[1]

Skewed PQA-A Labels

The artificial subset has a 92.8% "yes" label rate and contains no "maybe" instances.^[1] Models pre-trained heavily on PQA-A may develop a bias toward predicting "yes," which could hurt performance on the more balanced PQA-L test set. Researchers using PQA-A for training should account for this distributional mismatch.

Limited Question Diversity

All PubMedQA questions are derived from PubMed article titles, which tend to follow specific rhetorical patterns common in biomedical research writing. The dataset may not fully represent the range of questions that a clinician or researcher might ask about a scientific paper. More natural, free-form questions about biomedical literature are not covered.

English Only

PubMedQA contains only English-language content, reflecting the dominance of English in international biomedical publishing. This limits its applicability as a benchmark for multilingual biomedical NLP systems.

Benchmark Contamination

Because PubMedQA questions are drawn verbatim from indexed PubMed article titles, models and agents with live web access can sometimes retrieve the source article, including its conclusion, during inference. A 2026 study of deep research agents reported that one system retrieved leaked ground-truth answers on 65 of 100 sampled PubMedQA questions, compared with none on the licensing-exam-derived MedQA, which inflates measured accuracy and complicates fair comparison of retrieval-enabled systems.^[14]

How is PubMedQA accessed and formatted?

Data Format and Access

PubMedQA is distributed in JSON format. The official data files include:

ori_pqal.json: The labeled subset (PQA-L)
ori_pqau.json: The unlabeled subset (PQA-U)
ori_pqaa.json: The artificial subset (PQA-A)
test_ground_truth.json: Ground truth labels for the 500-instance test set^[11]

The dataset is available through multiple channels:

Official website: pubmedqa.github.io ^[10]
GitHub repository: github.com/pubmedqa/pubmedqa (PQA-L included directly; PQA-U and PQA-A available via Google Drive) ^[11]
Hugging Face Datasets: Available as qiaojin/PubMedQA with all three subsets in Parquet format (~300 MB total) ^[12]

Instance Format

Each instance is keyed by its PubMed ID (PMID) and contains the following fields:

Field	Description
pubid	PubMed article identifier (integer)
question	Research question in yes/no/maybe format (text)
context.contexts	Relevant abstract passage(s) with conclusion removed (text)
context.labels	Section labels from the structured abstract (text)
context.meshes	MeSH (Medical Subject Headings) descriptors (text)
long_answer	The conclusion section serving as the natural language answer (text)
final_decision	Answer label: "yes," "no," or "maybe" (text)

PQA-L instances additionally include reasoning_required_pred and reasoning_free_pred fields that store the individual annotator predictions under each evaluation setting.^[11]

Evaluation Script

The official repository includes an evaluation.py script that accepts model predictions in JSON format (where the key is a PMID and the value is "yes," "no," or "maybe") and computes accuracy and macro-F1 against the test set ground truth.^[11]

Usage in Hugging Face

The dataset can be loaded with the Hugging Face Datasets library:

from datasets import load_dataset

# Load the labeled subset
pqa_labeled = load_dataset("qiaojin/PubMedQA", "pqa_labeled", split="train")

# Load the artificial subset
pqa_artificial = load_dataset("qiaojin/PubMedQA", "pqa_artificial", split="train")

# Load the unlabeled subset
pqa_unlabeled = load_dataset("qiaojin/PubMedQA", "pqa_unlabeled", split="train")

As of early 2026, the dataset sees over 20,000 monthly downloads on Hugging Face and has been used to fine-tune more than 99 models.^[12] By June 2026 the Hugging Face mirror reported roughly 27,000 monthly downloads, more than 110 derived models, and 19 associated Spaces.^[12]

What is the impact and legacy of PubMedQA?

PubMedQA has had a significant influence on the development of biomedical NLP and medical AI. It was among the first benchmarks to demonstrate that biomedical question answering requires not just language understanding but genuine scientific reasoning. The dataset's design, which separates evidence from conclusions, created a clean framework for testing whether models can draw inferences from experimental results.

The benchmark helped catalyze the development of domain-specific biomedical language models. BioGPT, MEDITRON, Palmyra-Med, and other biomedical LLMs were all evaluated on PubMedQA as part of their release.^[5]^[6] The dataset's inclusion in BLURB and the Open Medical-LLM Leaderboard ensures its continued relevance in the field.

PubMedQA also contributed to the broader understanding of how large language models handle specialized scientific reasoning. The rapid progression from 68.1% accuracy (BioBERT, 2019) to 82.0% (GPT-4 Medprompt, 2023) tracks the broader arc of LLM capabilities, while the remaining gap to perfect performance highlights the continuing challenges of biomedical text comprehension.^[1]^[7]

The dataset's creators, particularly Qiao Jin, have continued to maintain the benchmark and update the leaderboard. Jin went on to contribute to other influential medical AI projects, including work at the National Institutes of Health on applying large language models to biomedical research tasks. At the National Center for Biotechnology Information within the National Library of Medicine, Jin co-developed the MedCPT model for biomedical information retrieval, the TrialGPT framework for matching patients to clinical trials (published in Nature Communications in 2024), and the MIRAGE benchmark for retrieval-augmented medical question answering, which reuses the PubMedQA test set with contexts removed.^[16]^[17]

References

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., & Lu, X. (2019). "PubMedQA: A Dataset for Biomedical Research Question Answering." *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*. Association for Computational Linguistics. https://aclanthology.org/D19-1259/ ↩
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., & Lu, X. (2019). "PubMedQA: A Dataset for Biomedical Research Question Answering." arXiv preprint arXiv:1909.06146. https://arxiv.org/abs/1909.06146
Singhal, K., Azizi, S., Tu, T., et al. (2023). "Large Language Models Encode Clinical Knowledge." *Nature*, 620, 172-180. https://www.nature.com/articles/s41586-023-06291-2 ↩
Singhal, K., Tu, T., Gottweis, J., et al. (2025). "Toward Expert-Level Medical Question Answering with Large Language Models." *Nature Medicine*, 31, 943-950. https://www.nature.com/articles/s41591-024-03423-7 ↩
Luo, R., Sun, L., Xia, Y., et al. (2022). "BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining." *Briefings in Bioinformatics*, 23(6), bbac409. https://academic.oup.com/bib/article/23/6/bbac409/6713511 ↩
Chen, Z., Cano, A. H., Romanou, A., et al. (2023). "MEDITRON-70B: Scaling Medical Pretraining for Large Language Models." arXiv preprint arXiv:2311.16079. https://arxiv.org/abs/2311.16079 ↩
Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). "Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine." arXiv preprint arXiv:2311.16452. https://arxiv.org/abs/2311.16452 ↩
Gu, Y., Tinn, R., Cheng, H., et al. (2021). "Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing." *ACM Transactions on Computing for Healthcare*, 3(1), 1-23. ↩
Lee, J., Yoon, W., Kim, S., et al. (2020). "BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining." *Bioinformatics*, 36(4), 1234-1240. ↩
PubMedQA Official Leaderboard. https://pubmedqa.github.io/ ↩
PubMedQA GitHub Repository. https://github.com/pubmedqa/pubmedqa ↩
PubMedQA on Hugging Face Datasets. https://huggingface.co/datasets/qiaojin/PubMedQA ↩
Sellergren, A., Golden, D., Yang, L., et al. (2025). "MedGemma Technical Report." arXiv preprint arXiv:2507.05201. https://arxiv.org/abs/2507.05201 ↩
Wang, Y., Zhang, X., Yao, K., et al. (2026). "Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation." arXiv preprint arXiv:2606.05241. https://arxiv.org/abs/2606.05241 ↩
Justen, L. (2025). "LLMs Outperform Experts on Challenging Biology Benchmarks." arXiv preprint arXiv:2505.06108. https://arxiv.org/abs/2505.06108 ↩
Xiong, G., Jin, Q., Lu, Z., & Zhang, A. (2024). "Benchmarking Retrieval-Augmented Generation for Medicine." arXiv preprint arXiv:2402.13178. https://arxiv.org/abs/2402.13178 ↩
Jin, Q., et al. (2024). "Matching Patients to Clinical Trials with Large Language Models." *Nature Communications*, 15, 9074. https://www.nature.com/articles/s41467-024-53081-z ↩
Open Medical-LLM Leaderboard. Hugging Face. https://huggingface.co/blog/leaderboard-medicalllm ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

BioBERT BioGPT HealthBench Machine learning terms/Natural Language Processing MedQA

What is PubMedQA?

Why was PubMedQA created?

How is PubMedQA structured?

Source Material

PQA-Labeled (PQA-L)

Annotation Guidelines

PQA-Unlabeled (PQA-U)

PQA-Artificial (PQA-A)

What statistics describe the PubMedQA subsets?

What kinds of questions does PubMedQA contain?

What kinds of reasoning does PubMedQA require?

How is PubMedQA evaluated?

Reasoning-Required Setting

Reasoning-Free Setting

Evaluation Metrics

What were the original 2019 baseline results?

Multi-Phase Training Pipeline

Additional Supervision (A.S.)

Baseline Results

How do models perform on PubMedQA today?

Notable Model Performances

Key Trends in Performance

Frontier Models and Benchmark Saturation (2025-2026)

How does PubMedQA compare with related benchmarks?

What is PubMedQA used for?

Model Development and Evaluation

Clinical Decision Support

Literature Review Automation

Training Resource

What are the limitations and criticisms of PubMedQA?

Small Test Set

Label Noise

Skewed PQA-A Labels

Limited Question Diversity

English Only

Benchmark Contamination

How is PubMedQA accessed and formatted?

Data Format and Access

Instance Format

Evaluation Script

Usage in Hugging Face

What is the impact and legacy of PubMedQA?

See Also

References

Improve this article

Related Articles

MedQA

HealthBench

HealthBench Hard

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

What links here

Related Articles

MedQA

HealthBench

HealthBench Hard

AA-LCR

DROP (Discrete Reasoning Over Paragraphs)

HellaSwag

What links here