PubMedQA is a biomedical question answering dataset and benchmark designed to evaluate the ability of machine learning models to answer research questions using evidence from PubMed abstracts. Introduced by Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu in 2019, PubMedQA was the first question answering dataset that specifically required reasoning over biomedical research texts, including quantitative experimental results. The dataset was presented at the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP 2019) and has since become one of the most widely used benchmarks for evaluating large language models and natural language processing systems in the biomedical domain.
PubMedQA consists of three subsets totaling over 273,000 question-answer instances: 1,000 expert-annotated examples (PQA-L), 61,200 unlabeled examples (PQA-U), and 211,300 artificially generated examples (PQA-A). The core task requires models to read a research question and its associated PubMed abstract context, then classify the answer as "yes," "no," or "maybe." The benchmark is publicly available at pubmedqa.github.io and is distributed under the MIT license.
Before PubMedQA, existing question answering datasets in the biomedical domain focused primarily on factoid questions (asking for specific entities or facts) or list-type questions (asking for collections of items). Datasets like BioASQ had advanced the field of biomedical QA, but none specifically targeted the challenge of answering yes/no research questions that demand reasoning over scientific evidence.
The authors of PubMedQA observed that approximately 760,000 articles in PubMed use questions as their titles. Among those, roughly 120,000 have structured abstracts divided into labeled subsections such as "Background," "Methods," "Results," and "Conclusions." This structure creates a natural question-answer pair: the title poses a research question, the abstract body provides the evidence, and the conclusion provides the answer. This observation formed the foundation for constructing PubMedQA.
A central motivation behind PubMedQA was to create a benchmark that tests whether models can perform genuine reasoning over biomedical texts rather than simply matching keywords or relying on superficial patterns. Many biomedical research questions require understanding experimental designs, interpreting statistical comparisons between groups, and synthesizing multiple pieces of evidence from different parts of an abstract. For example, answering "Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?" requires reading the study's results, understanding the comparison between statin-treated and control groups, and drawing a conclusion from the reported statistical findings.
PubMedQA draws its content from PubMed, the biomedical literature database maintained by the National Library of Medicine. The dataset specifically targets articles with question-format titles and structured abstracts. Each instance in PubMedQA contains four components:
The expert-annotated subset, PQA-L, contains 1,000 instances that were manually labeled by qualified M.D. candidates with biomedical domain expertise. The annotation process followed a carefully designed protocol:
The final label distribution in PQA-L is:
| Answer Label | Percentage | Count (approx.) |
|---|---|---|
| Yes | 55.2% | 552 |
| No | 33.8% | 338 |
| Maybe | 11.0% | 110 |
The "yes" majority gives a majority-class baseline accuracy of 55.2%. The relatively small proportion of "maybe" answers reflects the fact that most published research reaches a definitive conclusion, while the "maybe" label captures cases where the evidence is mixed or the answer depends on specific conditions.
The PQA-L subset is split into 500 instances for training/development (evaluated using 10-fold cross-validation) and 500 instances reserved as the official test set for leaderboard evaluation.
The annotation criteria specified in the paper define each label as follows:
The unlabeled subset contains 61,200 context-question pairs collected from PubMed articles that met the same structural criteria as PQA-L (question titles, structured abstracts). These instances include the question, context, and long answer, but lack expert-annotated yes/no/maybe labels.
PQA-U serves an important role in the multi-phase training pipeline described in the original paper. Models can leverage these unlabeled instances through semi-supervised learning techniques such as pseudo-labeling, where a model trained on PQA-L generates predicted labels for PQA-U instances and then trains on its own confident predictions.
The artificially generated subset contains 211,300 instances created through an automated heuristic process. The construction method worked as follows:
For example:
The label distribution in PQA-A is heavily skewed:
| Answer Label | Percentage |
|---|---|
| Yes | 92.8% |
| No | 7.2% |
| Maybe | 0.0% |
The absence of "maybe" labels and the strong skew toward "yes" are expected consequences of the heuristic approach. Most published research findings are positive (affirming the hypothesis), and the automated method cannot capture the nuance of inconclusive or conditional results.
PQA-A is split into 200,000 training instances and 11,300 validation instances.
The following table summarizes key statistics across all three subsets:
| Property | PQA-L | PQA-U | PQA-A |
|---|---|---|---|
| Number of instances | 1,000 | 61,200 | 211,300 |
| Average question length (words) | 14.4 | ~15 | 16.3 |
| Average context length (words) | 237 | ~238 | 239 |
| Average long answer length (words) | 41 | ~43 | 46 |
| Expert-annotated labels | Yes | No | No (heuristic) |
| Label classes | yes / no / maybe | N/A | yes / no |
| Total size | ~273,500 instances |
Analysis of the questions in PubMedQA reveals several common research question patterns:
| Question Type | Percentage |
|---|---|
| Factor influence (does X affect Y?) | 36.5% |
| Therapy evaluation (is treatment X effective?) | 26.0% |
| Statement verification (is X true?) | 18.0% |
| Relational queries (is X associated with Y?) | 18.0% |
The dataset also characterizes the types of reasoning required:
| Reasoning Pattern | Percentage |
|---|---|
| Inter-group comparisons | 57.5% |
| Interpreting subgroup statistics | 16.5% |
| Single-group statistics analysis | 16.0% |
| Other | 10.0% |
The dominance of inter-group comparisons reflects the nature of biomedical research, where clinical studies typically compare treatment groups against control groups.
PubMedQA defines two distinct evaluation settings that test different aspects of model capability.
In the reasoning-required setting, models receive only the question and the context (the abstract without its conclusion). The model must reason over the evidence presented in the abstract body to determine whether the answer is yes, no, or maybe. This is the primary evaluation setting used for the official leaderboard, as it tests genuine comprehension and inference rather than simple pattern matching.
Human performance in this setting is 78.0% accuracy and 72.2% macro-F1, reflecting the inherent difficulty of drawing conclusions from incomplete scientific evidence without seeing the authors' own conclusions.
In the reasoning-free setting, models receive the question and the long answer (the abstract's conclusion) instead of the context. Since conclusions typically state the answer explicitly ("Our findings suggest that..." or "No significant difference was observed..."), this setting is considerably easier and primarily serves as a diagnostic tool.
Human performance in the reasoning-free setting reaches 90.4% accuracy and 84.2% macro-F1, confirming that the conclusion usually contains enough information to determine the answer directly.
The primary evaluation metrics for PubMedQA are:
The original PubMedQA paper evaluated several baseline models using a multi-phase fine-tuning pipeline.
The training pipeline consists of three phases designed to maximize the use of all three dataset subsets:
Phase I (Pre-training on PQA-A): The model is fine-tuned on the large PQA-A dataset using question and context as input. This provides the model with a broad foundation of biomedical reasoning patterns, though the labels are noisy.
Phase II (Bootstrapping with PQA-U): This phase uses a three-step bootstrapping process:
Final Phase (Fine-tuning on PQA-L): The model from Phase I is further fine-tuned on the PQA-L training set, now supplemented with bootstrapped PQA-U instances, in the reasoning-required setting.
The paper introduced an auxiliary training signal alongside the main question answering loss. The model was trained with two loss functions:
The combined loss function was: L = L_QA + beta * L_BoW, where beta was set to zero in the reasoning-free setting (since the long answer was already available as input).
The following table presents results from the original paper on the PQA-L test set in the reasoning-required setting:
| Model | Accuracy | Macro-F1 |
|---|---|---|
| Majority baseline | 55.2% | N/A |
| Shallow Features (w/o A.S.) | 53.9% | 36.1% |
| Shallow Features (w/ A.S.) | 53.6% | 35.9% |
| BiLSTM (w/o A.S.) | 55.2% | 24.0% |
| BiLSTM (w/ A.S.) | 55.2% | 23.9% |
| ESIM w/ BioELMo (w/o A.S.) | 53.9% | 32.4% |
| ESIM w/ BioELMo (w/ A.S.) | 54.0% | 31.1% |
| BioBERT (w/o A.S.) | 57.0% | 28.5% |
| BioBERT (w/ A.S.) | 57.3% | 28.7% |
These are "Final Only" results (trained only on PQA-L without multi-phase training). With the full multi-phase pipeline, performance improved substantially:
| Model (Multi-Phase) | Accuracy | Macro-F1 |
|---|---|---|
| BiLSTM (w/o A.S.) | 59.8% | 41.9% |
| BiLSTM (w/ A.S.) | 58.9% | 41.1% |
| ESIM w/ BioELMo (w/o A.S.) | 62.1% | 45.8% |
| ESIM w/ BioELMo (w/ A.S.) | 63.7% | 47.9% |
| BioBERT (w/o A.S.) | 67.7% | 52.4% |
| BioBERT (w/ A.S.) | 68.1% | 52.7% |
| Human (single, reasoning-required) | 78.0% | 72.2% |
The best-performing baseline, BioBERT with additional supervision in the multi-phase setting, achieved 68.1% accuracy and 52.7% macro-F1. This remained nearly 10 percentage points below single human performance, highlighting the difficulty of biomedical reasoning for machine learning models at the time.
Since its release, PubMedQA has served as a standard benchmark in the evaluation of biomedical and medical language models. The official leaderboard, maintained by Qiao Jin at pubmedqa.github.io, tracks submissions in the reasoning-required setting. Over time, advances in pre-trained language models, domain-specific fine-tuning, and sophisticated prompting strategies have pushed performance well beyond the original baselines.
The following table summarizes key results on the PubMedQA leaderboard (reasoning-required setting, PQA-L test set of 500 instances):
| Model | Organization | Accuracy | Year | Notes |
|---|---|---|---|---|
| BioBERT (multi-phase, w/ A.S.) | Original paper | 68.1% | 2019 | First strong baseline |
| BioGPT (base) | Microsoft Research | 78.2% | 2022 | Generative pre-trained transformer for biomedical text |
| Flan-PaLM 540B | 79.0% | 2022 | Instruction-tuned PaLM at 540B parameters | |
| BioMedLM (2.7B) | Stanford CRFM | 74.4% | 2022 | Smaller domain-specific model |
| Claude 3 | Anthropic | 79.7% | 2024 | General-purpose LLM |
| Med-PaLM 2 | Google Research & DeepMind | 81.8% | 2023 | With self-consistency (11 samples) |
| BioGPT-Large (1.5B) | Microsoft Research | 81.0% | 2022 | Scaled-up BioGPT |
| Palmyra-Med (40B) | Writer Inc. | 81.1% | 2023 | Domain-specific medical LLM |
| MEDITRON (70B) | EPFL | 81.6% | 2023 | Open-source medical LLM with chain-of-thought + self-consistency |
| GPT-4 (Medprompt) | Microsoft | 82.0% | 2023 | Advanced prompting strategy without fine-tuning |
As of the last leaderboard update in April 2024, GPT-4 with Medprompt from Microsoft holds the top position at 82.0% accuracy. This result surpasses single human performance (78.0%), though it is worth noting that the Medprompt strategy uses ensemble-style inference with multiple prompting phases at test time.
Several patterns emerge from the progression of PubMedQA scores:
Domain-specific pre-training pays off. Models like BioGPT, BioMedLM, and MEDITRON, which were pre-trained or fine-tuned on biomedical corpora, consistently outperform general-purpose models of similar size. BioGPT-Large achieved 81.0% with only 1.5 billion parameters, competitive with much larger general models.
Prompting strategies matter. Med-PaLM 2 improved from 75.0% with standard prompting to 81.8% with self-consistency prompting (sampling 11 responses and taking the majority vote). Similarly, GPT-4's Medprompt strategy boosted performance to 82.0% without any domain-specific fine-tuning, using chain-of-thought prompting with automatically selected in-context examples.
Scale alone is not sufficient. The base GPT-4 model in a zero-shot setting achieves roughly 75% accuracy on PubMedQA, below several smaller domain-specialized models. This suggests that biomedical reasoning requires either domain-specific training data or carefully designed inference strategies.
The human-machine gap has closed. While the original 2019 baselines lagged human performance by nearly 10 points, modern models have surpassed the single-human benchmark of 78.0%. However, the authors of Med-PaLM 2 noted that many remaining errors on PubMedQA may be attributable to label noise in the relatively small 500-instance test set, raising questions about the benchmark's ceiling.
PubMedQA occupies a specific niche within the landscape of medical and biomedical NLP benchmarks. The following table highlights how it compares with other commonly used datasets:
| Benchmark | Domain | Task Type | Size | Answer Format |
|---|---|---|---|---|
| PubMedQA | Biomedical research | Research QA | 273K total (1K labeled) | Yes / No / Maybe |
| MedQA (USMLE) | Clinical medicine | Exam QA | ~12,700 | 4-option multiple choice |
| BioASQ | Biomedical | Multi-type QA | Varies by challenge year | Factoid, list, yes/no, summary |
| MMLU (Medical subsets) | Medical knowledge | Exam QA | ~1,000+ per subset | 4-option multiple choice |
| MedMCQA | Medical entrance exams | Exam QA | ~194K | 4-option multiple choice |
| HealthSearchQA | Consumer health | Open-domain QA | ~3,375 | Free-text |
PubMedQA is distinctive in several ways. First, it tests reasoning over primary research literature rather than textbook knowledge. Second, its three-way answer format (yes/no/maybe) is more nuanced than binary classification, capturing the reality that scientific evidence is sometimes inconclusive. Third, the inclusion of the long answer (conclusion) enables both reasoning-required and reasoning-free evaluation paradigms.
PubMedQA is included in several composite evaluation suites. It appears in BLURB (the Biomedical Language Understanding and Reasoning Benchmark maintained by Microsoft Research), which aggregates multiple biomedical NLP tasks into a unified leaderboard. It is also part of the Open Medical-LLM Leaderboard on Hugging Face, which benchmarks language models across several medical question answering datasets.
PubMedQA's primary application is as an evaluation benchmark for biomedical NLP systems. Researchers developing new language models, fine-tuning strategies, or prompting techniques for the medical domain routinely report PubMedQA scores alongside results on MedQA, BioASQ, and other benchmarks. The dataset's inclusion in composite benchmarks like BLURB ensures that it remains central to biomedical NLP research.
While PubMedQA itself is a research benchmark, the capability it tests (answering biomedical research questions from abstracts) has direct relevance to clinical decision support systems. A model that performs well on PubMedQA has demonstrated the ability to extract and synthesize evidence from medical literature, a skill that could be applied in evidence-based medicine tools, systematic review assistants, and clinical knowledge retrieval systems.
The task format of PubMedQA closely mirrors what a researcher does when scanning PubMed abstracts to answer a specific research question. Organizations evaluating language models for automated literature review applications often look at PubMedQA scores (alongside BioASQ scores) as a proxy for how well a model might perform on real-world literature synthesis tasks.
Beyond evaluation, the PQA-A and PQA-U subsets serve as training resources. The 211,300 artificially labeled instances in PQA-A provide a large-scale, if noisy, signal for pre-training or fine-tuning biomedical QA models. The 61,200 unlabeled instances in PQA-U enable semi-supervised learning approaches that can improve model performance without additional human annotation.
The official PubMedQA test set contains only 500 instances. With top models now achieving accuracy above 80%, the difference between models often falls within the margin of statistical uncertainty. A single misclassified instance changes accuracy by 0.2 percentage points. The authors of Med-PaLM 2 noted that remaining errors may be partly due to label noise, suggesting that the benchmark's effective ceiling could be lower than 100%.
Despite the careful annotation process, the three-way classification (yes/no/maybe) involves subjective judgment, particularly for the "maybe" category. Different domain experts may disagree on whether a study with borderline statistical significance warrants a "yes" or "maybe" label. The authors estimated that annotation error "could be as low as 1%," but this estimate was based on internal consistency checks rather than external validation.
The artificial subset has a 92.8% "yes" label rate and contains no "maybe" instances. Models pre-trained heavily on PQA-A may develop a bias toward predicting "yes," which could hurt performance on the more balanced PQA-L test set. Researchers using PQA-A for training should account for this distributional mismatch.
All PubMedQA questions are derived from PubMed article titles, which tend to follow specific rhetorical patterns common in biomedical research writing. The dataset may not fully represent the range of questions that a clinician or researcher might ask about a scientific paper. More natural, free-form questions about biomedical literature are not covered.
PubMedQA contains only English-language content, reflecting the dominance of English in international biomedical publishing. This limits its applicability as a benchmark for multilingual biomedical NLP systems.
PubMedQA is distributed in JSON format. The official data files include:
ori_pqal.json: The labeled subset (PQA-L)ori_pqau.json: The unlabeled subset (PQA-U)ori_pqaa.json: The artificial subset (PQA-A)test_ground_truth.json: Ground truth labels for the 500-instance test setThe dataset is available through multiple channels:
qiaojin/PubMedQA with all three subsets in Parquet format (~300 MB total)Each instance is keyed by its PubMed ID (PMID) and contains the following fields:
| Field | Description |
|---|---|
| pubid | PubMed article identifier (integer) |
| question | Research question in yes/no/maybe format (text) |
| context.contexts | Relevant abstract passage(s) with conclusion removed (text) |
| context.labels | Section labels from the structured abstract (text) |
| context.meshes | MeSH (Medical Subject Headings) descriptors (text) |
| long_answer | The conclusion section serving as the natural language answer (text) |
| final_decision | Answer label: "yes," "no," or "maybe" (text) |
PQA-L instances additionally include reasoning_required_pred and reasoning_free_pred fields that store the individual annotator predictions under each evaluation setting.
The official repository includes an evaluation.py script that accepts model predictions in JSON format (where the key is a PMID and the value is "yes," "no," or "maybe") and computes accuracy and macro-F1 against the test set ground truth.
The dataset can be loaded with the Hugging Face Datasets library:
from datasets import load_dataset
# Load the labeled subset
pqa_labeled = load_dataset("qiaojin/PubMedQA", "pqa_labeled", split="train")
# Load the artificial subset
pqa_artificial = load_dataset("qiaojin/PubMedQA", "pqa_artificial", split="train")
# Load the unlabeled subset
pqa_unlabeled = load_dataset("qiaojin/PubMedQA", "pqa_unlabeled", split="train")
As of early 2026, the dataset sees over 20,000 monthly downloads on Hugging Face and has been used to fine-tune more than 99 models.
PubMedQA has had a significant influence on the development of biomedical NLP and medical AI. It was among the first benchmarks to demonstrate that biomedical question answering requires not just language understanding but genuine scientific reasoning. The dataset's design, which separates evidence from conclusions, created a clean framework for testing whether models can draw inferences from experimental results.
The benchmark helped catalyze the development of domain-specific biomedical language models. BioGPT, MEDITRON, Palmyra-Med, and other biomedical LLMs were all evaluated on PubMedQA as part of their release. The dataset's inclusion in BLURB and the Open Medical-LLM Leaderboard ensures its continued relevance in the field.
PubMedQA also contributed to the broader understanding of how large language models handle specialized scientific reasoning. The rapid progression from 68.1% accuracy (BioBERT, 2019) to 82.0% (GPT-4 Medprompt, 2023) tracks the broader arc of LLM capabilities, while the remaining gap to perfect performance highlights the continuing challenges of biomedical text comprehension.
The dataset's creators, particularly Qiao Jin, have continued to maintain the benchmark and update the leaderboard. Jin went on to contribute to other influential medical AI projects, including work at the National Institutes of Health on applying large language models to biomedical research tasks.