# PubMedQA

> Source: https://aiwiki.ai/wiki/pubmedqa
> Updated: 2026-06-09
> Categories: AI Benchmarks, Healthcare AI, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

PubMedQA is a biomedical question answering dataset and benchmark designed to evaluate the ability of machine learning models to answer research questions using evidence from PubMed abstracts. Introduced by Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu in 2019, PubMedQA was the first question answering dataset that specifically required reasoning over biomedical research texts, including quantitative experimental results.[1] The dataset was presented at the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP 2019)[1] and has since become one of the most widely used benchmarks for evaluating [large language models](/wiki/large_language_model) and [natural language processing](/wiki/natural_language_processing) systems in the biomedical domain.

PubMedQA consists of three subsets totaling over 273,000 question-answer instances: 1,000 expert-annotated examples (PQA-L), 61,200 unlabeled examples (PQA-U), and 211,300 artificially generated examples (PQA-A).[1] The core task requires models to read a research question and its associated PubMed abstract context, then classify the answer as "yes," "no," or "maybe." The benchmark is publicly available at pubmedqa.github.io [10] and is distributed under the MIT license.[11]

## Background and Motivation

Before PubMedQA, existing [question answering](/wiki/question_answering) datasets in the biomedical domain focused primarily on factoid questions (asking for specific entities or facts) or list-type questions (asking for collections of items). Datasets like [BioASQ](/wiki/bioasq) had advanced the field of biomedical QA, but none specifically targeted the challenge of answering yes/no research questions that demand reasoning over scientific evidence.

The authors of PubMedQA observed that approximately 760,000 articles in PubMed use questions as their titles. Among those, roughly 120,000 have structured abstracts divided into labeled subsections such as "Background," "Methods," "Results," and "Conclusions."[1] This structure creates a natural question-answer pair: the title poses a research question, the abstract body provides the evidence, and the conclusion provides the answer. This observation formed the foundation for constructing PubMedQA.

A central motivation behind PubMedQA was to create a benchmark that tests whether models can perform genuine reasoning over biomedical texts rather than simply matching keywords or relying on superficial patterns. Many biomedical research questions require understanding experimental designs, interpreting statistical comparisons between groups, and synthesizing multiple pieces of evidence from different parts of an abstract. For example, answering "Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?" requires reading the study's results, understanding the comparison between statin-treated and control groups, and drawing a conclusion from the reported statistical findings.

## Dataset Construction

### Source Material

PubMedQA draws its content from [PubMed](/wiki/pubmed), the biomedical literature database maintained by the National Library of Medicine. The dataset specifically targets articles with question-format titles and structured abstracts. Each instance in PubMedQA contains four components:

1. **Question**: Either an existing research article title or a question derived from one.
2. **Context**: The corresponding abstract with its conclusion section removed. This forces models to reason from the evidence rather than simply reading the answer.
3. **Long answer**: The conclusion section of the abstract, which serves as the natural language answer to the research question.
4. **Short answer**: A yes, no, or maybe label that summarizes whether the conclusion affirms, denies, or is inconclusive about the question.[1]

### PQA-Labeled (PQA-L)

The expert-annotated subset, PQA-L, contains 1,000 instances that were manually labeled by qualified M.D. candidates with biomedical domain expertise.[1] The annotation process followed a carefully designed protocol:

1. Instances were randomly sampled from a pool of candidate question-abstract pairs (pre-PQA-U).
2. Questions that could not be answered with yes, no, or maybe were removed. The authors found that approximately 50.2% of sampled question titles were not answerable in this format, including wh-questions ("What causes...") and questions requiring more complex answers.[1]
3. Two annotators independently labeled each remaining instance under different conditions:
   - **Annotator 1** (reasoning-free setting): Had access to the question, context, and long answer (the conclusion). This made labeling relatively straightforward because yes/no/maybe answers are typically stated explicitly in conclusions.
   - **Annotator 2** (reasoning-required setting): Had access only to the question and context (without the conclusion). This required genuine reasoning over the abstract's evidence.
4. When both annotators agreed, the label was accepted.
5. When labels disagreed, annotators discussed the instance to reach consensus.
6. Instances where no consensus could be reached were removed from the dataset.[1]

The final label distribution in PQA-L is:

| Answer Label | Percentage | Count (approx.) |
|---|---|---|
| Yes | 55.2% | 552 |
| No | 33.8% | 338 |
| Maybe | 11.0% | 110 |

The "yes" majority gives a majority-class baseline accuracy of 55.2%.[1] The relatively small proportion of "maybe" answers reflects the fact that most published research reaches a definitive conclusion, while the "maybe" label captures cases where the evidence is mixed or the answer depends on specific conditions.

The PQA-L subset is split into 500 instances for training/development (evaluated using 10-fold cross-validation) and 500 instances reserved as the official test set for leaderboard evaluation.[1]

### Annotation Guidelines

The annotation criteria specified in the paper define each label as follows:

- **Yes**: The experiments described in the abstract indicate that the statement in the question title is true. A statistically significant difference was observed between the groups being compared.
- **No**: No significant difference was found, or the evidence contradicts the claim in the question.
- **Maybe**: The paper discusses conditions under which the answer varies, or multiple interventions were tested with mixed results across them.[1]

### PQA-Unlabeled (PQA-U)

The unlabeled subset contains 61,200 context-question pairs collected from PubMed articles that met the same structural criteria as PQA-L (question titles, structured abstracts).[1] These instances include the question, context, and long answer, but lack expert-annotated yes/no/maybe labels.

PQA-U serves an important role in the multi-phase training pipeline described in the original paper. Models can leverage these unlabeled instances through semi-supervised learning techniques such as pseudo-labeling, where a model trained on PQA-L generates predicted labels for PQA-U instances and then trains on its own confident predictions.

### PQA-Artificial (PQA-A)

The artificially generated subset contains 211,300 instances created through an automated heuristic process. The construction method worked as follows:

1. PubMed articles with statement-format titles (rather than question titles) and structured abstracts were identified.
2. Using Stanford CoreNLP for part-of-speech (POS) tagging, the system located noun phrase structures followed by verb phrases (NP-VBP/VBZ patterns) in the title.
3. Statement titles were converted to yes/no questions by moving or adding copulas ("is," "are") or auxiliary verbs ("does," "do") to the front of the sentence.
4. The yes/no answer was determined automatically based on the negation status of the verb in the original title. If the original statement was affirmative, the answer was "yes"; if it contained a negation, the answer was "no."[1]

For example:
- "Spontaneous electrocardiogram alterations predict ventricular fibrillation..." becomes "Do spontaneous electrocardiogram alterations predict ventricular fibrillation...?" with the answer "yes."
- "Liver grafts from selected older donors do not have significantly more..." becomes "Do liver grafts from selected older donors have significantly more...?" with the answer "no."[1]

The label distribution in PQA-A is heavily skewed:

| Answer Label | Percentage |
|---|---|
| Yes | 92.8% |
| No | 7.2% |
| Maybe | 0.0% |

The absence of "maybe" labels and the strong skew toward "yes" are expected consequences of the heuristic approach. Most published research findings are positive (affirming the hypothesis), and the automated method cannot capture the nuance of inconclusive or conditional results.

PQA-A is split into 200,000 training instances and 11,300 validation instances.[1]

## Dataset Statistics

The following table summarizes key statistics across all three subsets:[1]

| Property | PQA-L | PQA-U | PQA-A |
|---|---|---|---|
| Number of instances | 1,000 | 61,200 | 211,300 |
| Average question length (words) | 14.4 | ~15 | 16.3 |
| Average context length (words) | 237 | ~238 | 239 |
| Average long answer length (words) | 41 | ~43 | 46 |
| Expert-annotated labels | Yes | No | No (heuristic) |
| Label classes | yes / no / maybe | N/A | yes / no |
| Total size | ~273,500 instances | | |

### Question Types

Analysis of the questions in PubMedQA reveals several common research question patterns:[1]

| Question Type | Percentage |
|---|---|
| Factor influence (does X affect Y?) | 36.5% |
| Therapy evaluation (is treatment X effective?) | 26.0% |
| Statement verification (is X true?) | 18.0% |
| Relational queries (is X associated with Y?) | 18.0% |

### Reasoning Patterns

The dataset also characterizes the types of reasoning required:[1]

| Reasoning Pattern | Percentage |
|---|---|
| Inter-group comparisons | 57.5% |
| Interpreting subgroup statistics | 16.5% |
| Single-group statistics analysis | 16.0% |
| Other | 10.0% |

The dominance of inter-group comparisons reflects the nature of biomedical research, where clinical studies typically compare treatment groups against control groups.

## Evaluation Settings

PubMedQA defines two distinct evaluation settings that test different aspects of model capability.

### Reasoning-Required Setting

In the reasoning-required setting, models receive only the question and the context (the abstract without its conclusion). The model must reason over the evidence presented in the abstract body to determine whether the answer is yes, no, or maybe. This is the primary evaluation setting used for the official leaderboard, as it tests genuine comprehension and inference rather than simple pattern matching.

Human performance in this setting is 78.0% accuracy and 72.2% macro-F1, reflecting the inherent difficulty of drawing conclusions from incomplete scientific evidence without seeing the authors' own conclusions.[1]

### Reasoning-Free Setting

In the reasoning-free setting, models receive the question and the long answer (the abstract's conclusion) instead of the context. Since conclusions typically state the answer explicitly ("Our findings suggest that..." or "No significant difference was observed..."), this setting is considerably easier and primarily serves as a diagnostic tool.

Human performance in the reasoning-free setting reaches 90.4% accuracy and 84.2% macro-F1, confirming that the conclusion usually contains enough information to determine the answer directly.[1]

### Evaluation Metrics

The primary evaluation metrics for PubMedQA are:

- **Accuracy**: The percentage of correctly classified instances on the PQA-L test set (500 instances).
- **Macro-F1**: The unweighted average of per-class F1 scores across yes, no, and maybe. This metric penalizes models that perform poorly on minority classes (particularly "maybe").

## Original Baseline Models (2019)

The original PubMedQA paper evaluated several baseline models using a multi-phase fine-tuning pipeline.

### Multi-Phase Training Pipeline

The training pipeline consists of three phases designed to maximize the use of all three dataset subsets:

**Phase I (Pre-training on PQA-A)**: The model is fine-tuned on the large PQA-A dataset using question and context as input. This provides the model with a broad foundation of biomedical reasoning patterns, though the labels are noisy.

**Phase II (Bootstrapping with PQA-U)**: This phase uses a three-step bootstrapping process:
1. Fine-tune on PQA-A in the reasoning-free setting (using question and long answer as input).
2. Fine-tune on PQA-L in the reasoning-free setting.
3. Use the resulting model to generate pseudo-labels for PQA-U, keeping only confident predictions while maintaining label proportions that match PQA-L.

**Final Phase (Fine-tuning on PQA-L)**: The model from Phase I is further fine-tuned on the PQA-L training set, now supplemented with bootstrapped PQA-U instances, in the reasoning-required setting.[1]

### Additional Supervision (A.S.)

The paper introduced an auxiliary training signal alongside the main question answering loss. The model was trained with two loss functions:

- **QA Loss**: Standard cross-entropy loss for the yes/no/maybe classification.
- **Bag-of-Words Loss**: A binary bag-of-words prediction task that trained the model to predict which words appear in the long answer, given only the question and context. This encouraged the model to develop internal representations of what the conclusion would say, even when the conclusion was not provided as input.

The combined loss function was: L = L_QA + beta * L_BoW, where beta was set to zero in the reasoning-free setting (since the long answer was already available as input).[1]

### Baseline Results

The following table presents results from the original paper on the PQA-L test set in the reasoning-required setting:

| Model | Accuracy | Macro-F1 |
|---|---|---|
| Majority baseline | 55.2% | N/A |
| Shallow Features (w/o A.S.) | 53.9% | 36.1% |
| Shallow Features (w/ A.S.) | 53.6% | 35.9% |
| BiLSTM (w/o A.S.) | 55.2% | 24.0% |
| BiLSTM (w/ A.S.) | 55.2% | 23.9% |
| ESIM w/ BioELMo (w/o A.S.) | 53.9% | 32.4% |
| ESIM w/ BioELMo (w/ A.S.) | 54.0% | 31.1% |
| [BioBERT](/wiki/biobert) (w/o A.S.) | 57.0% | 28.5% |
| [BioBERT](/wiki/biobert) (w/ A.S.) | 57.3% | 28.7% |

These are "Final Only" results (trained only on PQA-L without multi-phase training). With the full multi-phase pipeline, performance improved substantially:

| Model (Multi-Phase) | Accuracy | Macro-F1 |
|---|---|---|
| BiLSTM (w/o A.S.) | 59.8% | 41.9% |
| BiLSTM (w/ A.S.) | 58.9% | 41.1% |
| ESIM w/ BioELMo (w/o A.S.) | 62.1% | 45.8% |
| ESIM w/ BioELMo (w/ A.S.) | 63.7% | 47.9% |
| [BioBERT](/wiki/biobert) (w/o A.S.) | 67.7% | 52.4% |
| [BioBERT](/wiki/biobert) (w/ A.S.) | 68.1% | 52.7% |
| **Human (single, reasoning-required)** | **78.0%** | **72.2%** |

The best-performing baseline, [BioBERT](/wiki/biobert) with additional supervision in the multi-phase setting, achieved 68.1% accuracy and 52.7% macro-F1.[1][9] This remained nearly 10 percentage points below single human performance, highlighting the difficulty of biomedical reasoning for machine learning models at the time.

## Leaderboard and State-of-the-Art Results

Since its release, PubMedQA has served as a standard benchmark in the evaluation of biomedical and medical [language models](/wiki/language_model). The official leaderboard, maintained by Qiao Jin at pubmedqa.github.io, tracks submissions in the reasoning-required setting.[10] Over time, advances in [pre-trained language models](/wiki/pre-training), domain-specific fine-tuning, and sophisticated prompting strategies have pushed performance well beyond the original baselines.

### Notable Model Performances

The following table summarizes key results on the PubMedQA leaderboard (reasoning-required setting, PQA-L test set of 500 instances):

| Model | Organization | Accuracy | Year | Notes |
|---|---|---|---|---|
| BioBERT (multi-phase, w/ A.S.) | Original paper | 68.1% | 2019 | First strong baseline [1] |
| [BioGPT](/wiki/biogpt) (base) | [Microsoft](/wiki/microsoft) Research | 78.2% | 2022 | Generative pre-trained [transformer](/wiki/transformer) for biomedical text [5] |
| Flan-PaLM 540B | [Google](/wiki/google) | 79.0% | 2022 | Instruction-tuned PaLM at 540B parameters [3] |
| BioMedLM (2.7B) | Stanford CRFM | 74.4% | 2022 | Smaller domain-specific model |
| [Claude](/wiki/claude) 3 | [Anthropic](/wiki/anthropic) | 79.7% | 2024 | General-purpose LLM |
| [Med-PaLM](/wiki/med_palm) 2 | [Google](/wiki/google) Research & [DeepMind](/wiki/deepmind) | 81.8% | 2023 | With self-consistency (11 samples) [4] |
| [BioGPT](/wiki/biogpt)-Large (1.5B) | [Microsoft](/wiki/microsoft) Research | 81.0% | 2022 | Scaled-up BioGPT [5] |
| Palmyra-Med (40B) | Writer Inc. | 81.1% | 2023 | Domain-specific medical LLM [10] |
| MEDITRON (70B) | EPFL | 81.6% | 2023 | Open-source medical LLM with chain-of-thought + self-consistency [6] |
| [GPT-4](/wiki/gpt-4) (Medprompt) | [Microsoft](/wiki/microsoft) | 82.0% | 2023 | Advanced prompting strategy without fine-tuning [7] |

As of the last leaderboard update in April 2024, GPT-4 with Medprompt from Microsoft holds the top position at 82.0% accuracy.[7][10] This result surpasses single human performance (78.0%), though it is worth noting that the Medprompt strategy uses ensemble-style inference with multiple prompting phases at test time.

### Key Trends in Performance

Several patterns emerge from the progression of PubMedQA scores:

**Domain-specific pre-training pays off.** Models like [BioGPT](/wiki/biogpt), BioMedLM, and MEDITRON, which were pre-trained or fine-tuned on biomedical corpora, consistently outperform general-purpose models of similar size.[5][6] BioGPT-Large achieved 81.0% with only 1.5 billion parameters, competitive with much larger general models.[5]

**Prompting strategies matter.** [Med-PaLM](/wiki/med_palm) 2 improved from 75.0% with standard prompting to 81.8% with self-consistency prompting (sampling 11 responses and taking the majority vote).[4] Similarly, GPT-4's Medprompt strategy boosted performance to 82.0% without any domain-specific fine-tuning, using chain-of-thought prompting with automatically selected in-context examples.[7]

**Scale alone is not sufficient.** The base GPT-4 model in a zero-shot setting achieves roughly 75% accuracy on PubMedQA, below several smaller domain-specialized models.[7] This suggests that biomedical reasoning requires either domain-specific training data or carefully designed inference strategies.

**The human-machine gap has closed.** While the original 2019 baselines lagged human performance by nearly 10 points, modern models have surpassed the single-human benchmark of 78.0%. However, the authors of [Med-PaLM](/wiki/med_palm) 2 noted that many remaining errors on PubMedQA may be attributable to label noise in the relatively small 500-instance test set, raising questions about the benchmark's ceiling.[4]

### Frontier Models and Benchmark Saturation (2025-2026)

The official leaderboard has remained largely static since 2024: its most recent submission, Reka Core, was logged on April 18, 2024 at 74.6% accuracy, and GPT-4 with Medprompt remains the top entry at 82.0%.[10] Evaluation of newer frontier and open-weight models has instead shifted into the standardized evaluation harnesses reported in model technical papers, which generally do not use the heavy test-time ensembling behind the top leaderboard submissions.

In the [MedGemma](/wiki/medgemma) technical report (Google Research and Google DeepMind, July 2025), PubMedQA was scored under a uniform protocol.[13] Under that protocol [OpenAI o3](/wiki/o3) led at 80.0%, followed by [GPT-4o](/wiki/gpt-4o) at 78.4%, [DeepSeek-R1](/wiki/deepseek_r1) at 77.2%, and [Gemini](/wiki/gemini) 2.5 Flash and 2.5 Pro at 76.2% and 75.8% respectively. The open medical model MedGemma 27B reached 76.8% (with test-time scaling) and MedGemma 4B reached 73.4%, each improving on the corresponding general-purpose Gemma 3 model.[13]

| Model | PubMedQA accuracy | Category |
|---|---|---|
| OpenAI o3 | 80.0% | Proprietary |
| GPT-4o | 78.4% | Proprietary |
| DeepSeek-R1 | 77.2% | Open weight |
| MedGemma 27B (test-time scaling) | 76.8% | Open, medical |
| Gemini 2.5 Flash | 76.2% | Proprietary |
| Gemini 2.5 Pro | 75.8% | Proprietary |
| MedGemma 4B | 73.4% | Open, medical |
| Gemma 3 27B | 73.4% | Open |
| Gemma 3 4B | 68.4% | Open |

These standardized scores cluster in the mid-to-high 70s, near or slightly above the single-annotator human score of 78.0% and several points below the 82.0% leaderboard peak set with specialized test-time strategies.[13] The pattern is consistent with a broader observation that PubMedQA has approached saturation: a 2025 study of biology and medicine benchmarks reported that PubMedQA, together with the MMLU and WMDP biology subsets, plateaus well below 100% accuracy, which the authors attributed to benchmark saturation and errors in the underlying answer labels rather than to model limitations.[15]

## Comparison with Related Benchmarks

PubMedQA occupies a specific niche within the landscape of medical and biomedical [NLP](/wiki/natural_language_processing) benchmarks. The following table highlights how it compares with other commonly used datasets:

| Benchmark | Domain | Task Type | Size | Answer Format |
|---|---|---|---|---|
| PubMedQA | Biomedical research | Research QA | 273K total (1K labeled) | Yes / No / Maybe |
| [MedQA](/wiki/medqa) (USMLE) | Clinical medicine | Exam QA | ~12,700 | 4-option multiple choice |
| [BioASQ](/wiki/bioasq) | Biomedical | Multi-type QA | Varies by challenge year | Factoid, list, yes/no, summary |
| [MMLU](/wiki/mmlu) (Medical subsets) | Medical knowledge | Exam QA | ~1,000+ per subset | 4-option multiple choice |
| MedMCQA | Medical entrance exams | Exam QA | ~194K | 4-option multiple choice |
| [HealthSearchQA](/wiki/healthsearchqa) | Consumer health | Open-domain QA | ~3,375 | Free-text |

PubMedQA is distinctive in several ways. First, it tests reasoning over primary research literature rather than textbook knowledge. Second, its three-way answer format (yes/no/maybe) is more nuanced than binary classification, capturing the reality that scientific evidence is sometimes inconclusive. Third, the inclusion of the long answer (conclusion) enables both reasoning-required and reasoning-free evaluation paradigms.

PubMedQA is included in several composite evaluation suites. It appears in BLURB (the Biomedical Language Understanding and Reasoning Benchmark maintained by [Microsoft](/wiki/microsoft) Research), which aggregates multiple biomedical NLP tasks into a unified leaderboard.[8] It is also part of the Open Medical-LLM Leaderboard on [Hugging Face](/wiki/hugging_face), which benchmarks language models across several medical question answering datasets.[18]

PubMedQA has also been adapted for [retrieval-augmented generation](/wiki/retrieval_augmented_generation) evaluation. The MIRAGE benchmark (2024), co-developed by PubMedQA creator Qiao Jin, includes a PubMedQA* variant whose 500 test questions have their gold contexts removed, forcing systems to retrieve supporting evidence themselves; the accompanying MedRAG toolkit improved the accuracy of six language models by up to 18 percentage points over chain-of-thought prompting alone.[16]

## Applications and Use Cases

### Model Development and Evaluation

PubMedQA's primary application is as an evaluation benchmark for biomedical NLP systems. Researchers developing new language models, fine-tuning strategies, or prompting techniques for the medical domain routinely report PubMedQA scores alongside results on MedQA, BioASQ, and other benchmarks. The dataset's inclusion in composite benchmarks like BLURB ensures that it remains central to biomedical NLP research.

### Clinical Decision Support

While PubMedQA itself is a research benchmark, the capability it tests (answering biomedical research questions from abstracts) has direct relevance to clinical decision support systems. A model that performs well on PubMedQA has demonstrated the ability to extract and synthesize evidence from medical literature, a skill that could be applied in evidence-based medicine tools, systematic review assistants, and clinical knowledge retrieval systems.

### Literature Review Automation

The task format of PubMedQA closely mirrors what a researcher does when scanning PubMed abstracts to answer a specific research question. Organizations evaluating language models for automated literature review applications often look at PubMedQA scores (alongside BioASQ scores) as a proxy for how well a model might perform on real-world literature synthesis tasks.

### Training Resource

Beyond evaluation, the PQA-A and PQA-U subsets serve as training resources. The 211,300 artificially labeled instances in PQA-A provide a large-scale, if noisy, signal for pre-training or fine-tuning biomedical QA models. The 61,200 unlabeled instances in PQA-U enable semi-supervised learning approaches that can improve model performance without additional human annotation.

## Limitations and Criticisms

### Small Test Set

The official PubMedQA test set contains only 500 instances. With top models now achieving accuracy above 80%, the difference between models often falls within the margin of statistical uncertainty. A single misclassified instance changes accuracy by 0.2 percentage points. The authors of Med-PaLM 2 noted that remaining errors may be partly due to label noise, suggesting that the benchmark's effective ceiling could be lower than 100%.[4]

### Label Noise

Despite the careful annotation process, the three-way classification (yes/no/maybe) involves subjective judgment, particularly for the "maybe" category. Different domain experts may disagree on whether a study with borderline statistical significance warrants a "yes" or "maybe" label. The authors estimated that annotation error "could be as low as 1%," but this estimate was based on internal consistency checks rather than external validation.[1]

### Skewed PQA-A Labels

The artificial subset has a 92.8% "yes" label rate and contains no "maybe" instances.[1] Models pre-trained heavily on PQA-A may develop a bias toward predicting "yes," which could hurt performance on the more balanced PQA-L test set. Researchers using PQA-A for training should account for this distributional mismatch.

### Limited Question Diversity

All PubMedQA questions are derived from PubMed article titles, which tend to follow specific rhetorical patterns common in biomedical research writing. The dataset may not fully represent the range of questions that a clinician or researcher might ask about a scientific paper. More natural, free-form questions about biomedical literature are not covered.

### English Only

PubMedQA contains only English-language content, reflecting the dominance of English in international biomedical publishing. This limits its applicability as a benchmark for multilingual biomedical NLP systems.

### Benchmark Contamination

Because PubMedQA questions are drawn verbatim from indexed PubMed article titles, models and agents with live web access can sometimes retrieve the source article, including its conclusion, during inference. A 2026 study of deep research agents reported that one system retrieved leaked ground-truth answers on 65 of 100 sampled PubMedQA questions, compared with none on the licensing-exam-derived MedQA, which inflates measured accuracy and complicates fair comparison of retrieval-enabled systems.[14]

## Technical Details

### Data Format and Access

PubMedQA is distributed in JSON format. The official data files include:

- `ori_pqal.json`: The labeled subset (PQA-L)
- `ori_pqau.json`: The unlabeled subset (PQA-U)
- `ori_pqaa.json`: The artificial subset (PQA-A)
- `test_ground_truth.json`: Ground truth labels for the 500-instance test set[11]

The dataset is available through multiple channels:

- **Official website**: pubmedqa.github.io [10]
- **GitHub repository**: github.com/pubmedqa/pubmedqa (PQA-L included directly; PQA-U and PQA-A available via Google Drive) [11]
- **Hugging Face Datasets**: Available as `qiaojin/PubMedQA` with all three subsets in Parquet format (~300 MB total) [12]

### Instance Format

Each instance is keyed by its PubMed ID (PMID) and contains the following fields:

| Field | Description |
|---|---|
| pubid | PubMed article identifier (integer) |
| question | Research question in yes/no/maybe format (text) |
| context.contexts | Relevant abstract passage(s) with conclusion removed (text) |
| context.labels | Section labels from the structured abstract (text) |
| context.meshes | MeSH (Medical Subject Headings) descriptors (text) |
| long_answer | The conclusion section serving as the natural language answer (text) |
| final_decision | Answer label: "yes," "no," or "maybe" (text) |

PQA-L instances additionally include `reasoning_required_pred` and `reasoning_free_pred` fields that store the individual annotator predictions under each evaluation setting.[11]

### Evaluation Script

The official repository includes an `evaluation.py` script that accepts model predictions in JSON format (where the key is a PMID and the value is "yes," "no," or "maybe") and computes accuracy and macro-F1 against the test set ground truth.[11]

### Usage in Hugging Face

The dataset can be loaded with the Hugging Face Datasets library:

```python
from datasets import load_dataset

# Load the labeled subset
pqa_labeled = load_dataset("qiaojin/PubMedQA", "pqa_labeled", split="train")

# Load the artificial subset
pqa_artificial = load_dataset("qiaojin/PubMedQA", "pqa_artificial", split="train")

# Load the unlabeled subset
pqa_unlabeled = load_dataset("qiaojin/PubMedQA", "pqa_unlabeled", split="train")
```

As of early 2026, the dataset sees over 20,000 monthly downloads on Hugging Face and has been used to fine-tune more than 99 models.[12] By June 2026 the Hugging Face mirror reported roughly 27,000 monthly downloads, more than 110 derived models, and 19 associated Spaces.[12]

## Impact and Legacy

PubMedQA has had a significant influence on the development of biomedical [NLP](/wiki/natural_language_processing) and medical [AI](/wiki/artificial_intelligence). It was among the first benchmarks to demonstrate that biomedical question answering requires not just language understanding but genuine scientific reasoning. The dataset's design, which separates evidence from conclusions, created a clean framework for testing whether models can draw inferences from experimental results.

The benchmark helped catalyze the development of domain-specific biomedical language models. [BioGPT](/wiki/biogpt), MEDITRON, Palmyra-Med, and other biomedical LLMs were all evaluated on PubMedQA as part of their release.[5][6] The dataset's inclusion in BLURB and the Open Medical-LLM Leaderboard ensures its continued relevance in the field.

PubMedQA also contributed to the broader understanding of how [large language models](/wiki/large_language_model) handle specialized scientific reasoning. The rapid progression from 68.1% accuracy (BioBERT, 2019) to 82.0% (GPT-4 Medprompt, 2023) tracks the broader arc of LLM capabilities, while the remaining gap to perfect performance highlights the continuing challenges of biomedical text comprehension.[1][7]

The dataset's creators, particularly Qiao Jin, have continued to maintain the benchmark and update the leaderboard. Jin went on to contribute to other influential medical AI projects, including work at the National Institutes of Health on applying large language models to biomedical research tasks. At the National Center for Biotechnology Information within the National Library of Medicine, Jin co-developed the MedCPT model for biomedical information retrieval, the TrialGPT framework for matching patients to clinical trials (published in Nature Communications in 2024), and the MIRAGE benchmark for retrieval-augmented medical question answering, which reuses the PubMedQA test set with contexts removed.[16][17]

## See Also

- [BioASQ](/wiki/bioasq)
- [MedQA](/wiki/medqa)
- [BioBERT](/wiki/biobert)
- [BioGPT](/wiki/biogpt)
- [Med-PaLM](/wiki/med_palm)
- [MMLU](/wiki/mmlu)
- [Question Answering](/wiki/question_answering)
- [Biomedical NLP](/wiki/biomedical_nlp)
- [PubMed](/wiki/pubmed)

## References

1. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., & Lu, X. (2019). "PubMedQA: A Dataset for Biomedical Research Question Answering." *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*. Association for Computational Linguistics. https://aclanthology.org/D19-1259/

2. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W., & Lu, X. (2019). "PubMedQA: A Dataset for Biomedical Research Question Answering." arXiv preprint arXiv:1909.06146. https://arxiv.org/abs/1909.06146

3. Singhal, K., Azizi, S., Tu, T., et al. (2023). "Large Language Models Encode Clinical Knowledge." *Nature*, 620, 172-180. https://www.nature.com/articles/s41586-023-06291-2

4. Singhal, K., Tu, T., Gottweis, J., et al. (2025). "Toward Expert-Level Medical Question Answering with Large Language Models." *Nature Medicine*, 31, 943-950. https://www.nature.com/articles/s41591-024-03423-7

5. Luo, R., Sun, L., Xia, Y., et al. (2022). "BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining." *Briefings in Bioinformatics*, 23(6), bbac409. https://academic.oup.com/bib/article/23/6/bbac409/6713511

6. Chen, Z., Cano, A. H., Romanou, A., et al. (2023). "MEDITRON-70B: Scaling Medical Pretraining for Large Language Models." arXiv preprint arXiv:2311.16079. https://arxiv.org/abs/2311.16079

7. Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). "Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine." arXiv preprint arXiv:2311.16452. https://arxiv.org/abs/2311.16452

8. Gu, Y., Tinn, R., Cheng, H., et al. (2021). "Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing." *ACM Transactions on Computing for Healthcare*, 3(1), 1-23.

9. Lee, J., Yoon, W., Kim, S., et al. (2020). "BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining." *Bioinformatics*, 36(4), 1234-1240.

10. PubMedQA Official Leaderboard. https://pubmedqa.github.io/

11. PubMedQA GitHub Repository. https://github.com/pubmedqa/pubmedqa

12. PubMedQA on Hugging Face Datasets. https://huggingface.co/datasets/qiaojin/PubMedQA

13. Sellergren, A., Golden, D., Yang, L., et al. (2025). "MedGemma Technical Report." arXiv preprint arXiv:2507.05201. https://arxiv.org/abs/2507.05201

14. Wang, Y., Zhang, X., Yao, K., et al. (2026). "Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation." arXiv preprint arXiv:2606.05241. https://arxiv.org/abs/2606.05241

15. Justen, L. (2025). "LLMs Outperform Experts on Challenging Biology Benchmarks." arXiv preprint arXiv:2505.06108. https://arxiv.org/abs/2505.06108

16. Xiong, G., Jin, Q., Lu, Z., & Zhang, A. (2024). "Benchmarking Retrieval-Augmented Generation for Medicine." arXiv preprint arXiv:2402.13178. https://arxiv.org/abs/2402.13178

17. Jin, Q., et al. (2024). "Matching Patients to Clinical Trials with Large Language Models." *Nature Communications*, 15, 9074. https://www.nature.com/articles/s41467-024-53081-z

18. Open Medical-LLM Leaderboard. Hugging Face. https://huggingface.co/blog/leaderboard-medicalllm

