# MedQA

> Source: https://aiwiki.ai/wiki/medqa
> Updated: 2026-06-27
> Categories: AI Benchmarks, Healthcare AI, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

MedQA is a large-scale, open-domain medical question answering benchmark of multiple-choice questions taken from real medical licensing examinations, introduced by Di Jin and colleagues at MIT in 2020. The dataset contains 61,097 questions in total across three languages, including 12,723 English questions drawn from the United States Medical Licensing Examination (USMLE), 34,251 Simplified Chinese questions from the Mainland China licensing exam, and 14,123 Traditional Chinese questions from the Taiwan exam.[1] The English (USMLE) subset is by far the most widely used and has become a de facto standard for measuring how well [large language models](/wiki/llm) handle real-world medical knowledge and clinical reasoning, with scores climbing from a 36.7% baseline in 2020 to over 96% for the strongest 2025 models.[1][2]

The benchmark is notable for being, in the authors' words, "the first free-form multiple-choice OpenQA dataset for solving medical problems," collected directly from professional medical board exams.[1] Each question presents a clinical vignette, a question stem, and four answer options (an original 5-option version was also released), mirroring the format physicians face when sitting for licensure. MedQA is the headline metric reported in landmark [medical AI](/wiki/ai_in_healthcare) papers including Google's [Med-PaLM](/wiki/med_palm) and [Med-PaLM 2](/wiki/med_palm_2), Microsoft's [GPT-4](/wiki/gpt-4) medical evaluation, and Med-Gemini.

## What is MedQA?

MedQA is a large-scale open-domain question answering dataset composed of multiple-choice questions drawn from medical licensing examinations. Created by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits at MIT, MedQA was introduced in a 2020 preprint and formally published in the journal *Applied Sciences* in 2021.[1] It is widely regarded as one of the most important [benchmarks](/wiki/benchmark) for evaluating how well [large language models](/wiki/llm) can handle real-world medical knowledge and clinical reasoning.

The dataset is notable for being the first free-form multiple-choice open-domain question answering (OpenQA) dataset specifically designed for medical problem solving.[1] Unlike earlier medical [NLP](/wiki/nlp) benchmarks that focused on narrower tasks such as named entity recognition or relation extraction, MedQA tests a system's ability to read a clinical scenario, retrieve relevant medical knowledge, and select the correct answer from several options. This design closely mirrors the challenge that medical students and physicians face when taking board certification exams.

## Why was MedQA created?

Before MedQA, most question answering datasets in the biomedical domain were either closed-domain (where the answer could be found in a given passage) or limited to yes/no and short-answer formats. Datasets like [SQuAD](/wiki/squad) and BioASQ tested reading comprehension but did not require models to reason across broad medical knowledge the way a licensing exam does.

The authors of MedQA recognized that medical board examinations represent a natural, high-quality source of challenging questions. These exams are designed by medical education experts to test clinical competency across a wide range of disciplines, from anatomy and pharmacology to ethics and patient management. By collecting questions from three different national medical licensing systems, the MedQA team created a benchmark that is both multilingual and clinically rigorous.

The core research question behind MedQA is straightforward: can a machine answer the same questions that human doctors must answer to earn their medical licenses? As the authors framed it, the questions "require a deep understanding of the related medical concepts to answer," and the original paper noted that "even the current best method can only achieve 36.7% accuracy" on the English set, "suggesting a great gap between the AI system and human-level performance" at the time.[1] That figure has since risen dramatically thanks to advances in [large language models](/wiki/llm).

## How is MedQA structured?

### Overview

MedQA contains a total of 61,097 multiple-choice questions spanning three languages:[1]

| Language | Exam Source | Abbreviation | Total Questions |
|---|---|---|---|
| English | United States Medical Licensing Examination (USMLE) | USMLE | 12,723 |
| Simplified Chinese | Mainland China Medical Licensing Examination | MCMLE | 34,251 |
| Traditional Chinese | Taiwan Medical Licensing Examination | TWMLE | 14,123 |

Each question follows the standard format used in medical licensing exams: a clinical vignette (also called a case stem) presents a patient scenario with relevant history, symptoms, lab results, and other clinical details, followed by a question and a set of answer choices. The English (USMLE) subset includes both 4-option and 5-option variants. The original dataset provides 5 answer options per question, matching the format of actual USMLE exams, while a simplified 4-option version was also released and has become the more commonly used variant for benchmarking purposes.[1][8]

### USMLE Subset

The USMLE subset of MedQA is the most widely used portion of the dataset in the research community. The USMLE is a three-step examination program used in the United States to assess clinical competency and grant medical licensure:

- **Step 1** tests foundational biomedical science concepts (anatomy, biochemistry, pathology, pharmacology, physiology, microbiology).
- **Step 2 CK (Clinical Knowledge)** tests clinical sciences and the ability to apply medical knowledge to patient care.
- **Step 3** tests whether physicians can apply medical knowledge and clinical understanding to unsupervised patient care.

The English questions in MedQA are sourced from both official USMLE practice materials and commercial exam-preparation question banks. Of the approximately 12,723 questions, around 300 come from official USMLE tutorials, while the remainder are drawn from widely used exam-preparation websites that cover all three steps.[1]

The official train/dev/test split for the USMLE subset is:[1][8]

| Split | Number of Questions |
|---|---|
| Training | 10,178 |
| Development (Validation) | 1,272 |
| Test | 1,273 |

This roughly follows an 80/10/10 split ratio. The test set of 1,273 questions is the standard evaluation set used in virtually all published results on MedQA.

### Chinese Subsets

The MCMLE (Mainland China) subset is the largest portion of MedQA, with 34,251 questions in Simplified Chinese drawn from China's national medical licensing examination. The TWMLE (Taiwan) subset contains 14,123 questions in Traditional Chinese from Taiwan's medical licensing examination.[1] Both follow approximately the same 80/10/10 train/dev/test split as the USMLE subset.

These multilingual subsets make MedQA one of the few medical QA benchmarks that supports cross-lingual evaluation, enabling researchers to study how well models transfer medical knowledge across languages.

## What is the MedQA textbook corpus?

In addition to the questions themselves, the MedQA authors collected and released a large-scale textbook corpus intended to serve as a knowledge source for question answering systems. The idea is that models can retrieve relevant passages from these textbooks to help answer the exam questions, similar to how medical students study from textbooks before taking their exams.

### English Textbook Collection

The English corpus consists of 18 widely used medical textbooks that are standard reading material for USMLE preparation.[1] These textbooks cover the major medical disciplines:

| Textbook | Subject Area | Approximate Records |
|---|---|---|
| Harrison's Principles of Internal Medicine | Internal Medicine | 20,583 |
| Schwartz's Principles of Surgery | Surgery | 7,803 |
| Adams and Victor's Principles of Neurology | Neurology | 7,732 |
| Williams Obstetrics | Obstetrics | 5,392 |
| Alberts' Molecular Biology of the Cell | Cell Biology | 4,275 |
| Robbins Pathologic Basis of Disease | Pathology | 3,156 |
| Janeway's Immunobiology | Immunology | 2,996 |
| Ross Histology | Histology | 2,685 |
| Physiology (Levy) | Physiology | 2,627 |
| Nelson Textbook of Pediatrics | Pediatrics | 2,575 |
| Gray's Anatomy | Anatomy | 1,736 |
| Lippincott Illustrated Reviews: Biochemistry | Biochemistry | 1,193 |
| First Aid for the USMLE Step 2 CK | Exam Preparation | 800 |
| First Aid for the USMLE Step 1 | Exam Preparation | 489 |
| Pathoma (Husain) | Pathology | 280 |

These textbooks collectively form a comprehensive medical knowledge base spanning anatomy, biochemistry, cell biology, histology, immunology, internal medicine, neurology, obstetrics, pathology, pediatrics, pharmacology, physiology, and surgery.

### Chinese Textbook Collection

The MCMLE subset is accompanied by 33 Simplified Chinese medical textbooks that are officially designated study materials for the Chinese medical licensing examination.[1] Since Taiwanese medical students use many of the same textbooks as American students, the TWMLE subset shares its document collection with both the USMLE and MCMLE corpora.

## What does a MedQA question look like?

Each question in MedQA follows a standardized structure typical of medical licensing examinations. A typical USMLE-style question includes:

1. **Clinical vignette**: A paragraph describing a patient's demographics, presenting complaint, medical history, physical examination findings, and relevant laboratory or imaging results.
2. **Question stem**: A specific question asking for the most likely diagnosis, next best step in management, underlying mechanism, or expected finding.
3. **Answer choices**: Four or five options, only one of which is correct.

For example, a question might present a 45-year-old woman with fatigue, weight gain, cold intolerance, and elevated TSH levels, then ask which condition is the most likely diagnosis. The answer choices would include hypothyroidism (correct), hyperthyroidism, Cushing syndrome, and Addison disease.

This format requires more than simple factual recall. Many questions demand multi-step clinical reasoning: the model must interpret the clinical scenario, integrate multiple pieces of information, consider differential diagnoses, and select the best answer. This is what makes MedQA a particularly challenging benchmark for [AI](/wiki/ai) systems.

## Baseline Methods and Early Results

### Original Baselines (2020-2021)

In their original paper, Jin et al. evaluated MedQA using a two-stage pipeline approach common in open-domain question answering:[1]

1. **Document retrieval**: A retriever (either rule-based using TF-IDF or neural) selects relevant passages from the textbook corpus.
2. **Machine comprehension**: A reading comprehension model processes the retrieved passages along with the question to select an answer.

The best-performing system at the time of publication achieved only 36.7% accuracy on the English (USMLE) test set.[1] For comparison, random guessing on a 4-option multiple-choice test would yield 25% [accuracy](/wiki/accuracy). The Traditional Chinese and Simplified Chinese subsets saw best results of 42.0% and 70.1%, respectively.[1]

The authors identified document retrieval as the primary bottleneck. The retrieval component struggled with multi-hop reasoning, where answering a question requires combining information from multiple documents or passages. Medical questions often require this type of reasoning because a single textbook passage rarely contains all the information needed to answer a complex clinical question.

| Subset | Best Accuracy (2021) | Random Baseline |
|---|---|---|
| USMLE (English) | 36.7% | 25.0% |
| TWMLE (Traditional Chinese) | 42.0% | 25.0% |
| MCMLE (Simplified Chinese) | 70.1% | 25.0% |

The relatively higher performance on the Simplified Chinese subset may reflect the larger training set size (over 27,000 training questions) compared to the English subset (approximately 10,000 training questions).

## How do AI models score on MedQA?

The emergence of [large language models](/wiki/llm) transformed performance on MedQA. Unlike the retrieval-and-comprehension pipeline used in the original baselines, LLMs encode vast medical knowledge within their parameters during pretraining, allowing them to answer questions without explicit document retrieval.

### Timeline of Key Results

The following table summarizes the performance of major models on the MedQA USMLE test set (4-option version unless noted otherwise):

| Model | Organization | Year | MedQA Accuracy | Notes |
|---|---|---|---|---|
| Best baseline (Jin et al.) | MIT | 2021 | 36.7% | Retrieval + comprehension pipeline[1] |
| PubMedGPT / BioMedLM (2.7B) | Stanford | 2022 | 50.3% | Domain-specific pretraining |
| ChatGPT ([GPT-3.5](/wiki/gpt-3)) | [OpenAI](/wiki/openai) | 2023 | 50.8% (zero-shot) | First widely tested LLM[6] |
| ChatGPT (GPT-3.5, few-shot) | [OpenAI](/wiki/openai) | 2023 | 60.2% | Met USMLE passing threshold[6] |
| Flan-[PaLM](/wiki/palm) | Google | 2023 | 67.6% | Surpassed prior SOTA by 17%+[3] |
| Med-PaLM | Google | 2023 | 67.2% | First AI to "pass" USMLE-style questions[3] |
| [GPT-4](/wiki/gpt-4) | [OpenAI](/wiki/openai) | 2023 | 86.1% | Exceeded passing score by 20+ points[2] |
| Med-PaLM 2 | Google | 2023 | 86.5% | Instruction-tuned medical model[4] |
| Med-Gemini | Google | 2024 | 91.1% | Uncertainty-guided search strategy[5] |
| [Claude](/wiki/claude) Opus 4.1 (Thinking) | [Anthropic](/wiki/anthropic) | 2025 | 93.6% | Strongest non-OpenAI model[9] |
| [Gemini](/wiki/gemini) 2.5 Pro | Google | 2025 | 93.1% | Advanced [reasoning](/wiki/reasoning) capabilities[9] |
| GPT-5 | [OpenAI](/wiki/openai) | 2025 | 96.3% | Near-ceiling performance[9] |
| o3 | [OpenAI](/wiki/openai) | 2025 | 96.1% | Reasoning-focused model[9] |
| o1 | [OpenAI](/wiki/openai) | 2025 | 96.5% | Highest reported score[9] |

### What is the USMLE passing threshold?

A score of approximately 60% is the commonly cited threshold for passing the USMLE (though the actual exam uses a scaled scoring system rather than a raw percentage). This passing threshold serves as a meaningful reference point for MedQA results.

ChatGPT (GPT-3.5) was notable for being among the first general-purpose LLMs to approach and reach this threshold, achieving 60.2% with few-shot prompting techniques.[6] This result, published in early 2023, generated significant interest in the medical AI community because it demonstrated that a general-purpose language model could perform at the level required of a medical student.

### GPT-4 and Med-PaLM 2: Breaking the 85% Barrier

The release of [GPT-4](/wiki/gpt-4) in March 2023 represented a major leap. In the paper "Capabilities of GPT-4 on Medical Challenge Problems" by Harsha Nori and colleagues at Microsoft Research, GPT-4 achieved 86.1% on the MedQA USMLE test set without any medical-specific [fine-tuning](/wiki/fine_tuning).[2] This was particularly striking because GPT-4 is a general-purpose model with no special medical training.

Around the same time, Google reported that [Med-PaLM 2](/wiki/med_palm_2), a version of [PaLM](/wiki/palm) 2 specifically fine-tuned for medical applications, achieved 86.5% on MedQA, marking the first time an AI system reached expert-level test-taker performance on the benchmark.[4] Med-PaLM 2 also demonstrated strong performance on physician evaluations, with its answers being preferred over physician-written answers by a panel of doctors on eight of nine evaluation axes in a study of over 1,000 consumer medical questions.[4]

### Med-Gemini: Crossing 90%

In 2024, Google introduced Med-Gemini, a medical adaptation of the [Gemini](/wiki/gemini) model family. Med-Gemini achieved 91.1% accuracy on MedQA through a novel uncertainty-guided search strategy.[5] The key innovation was training the model to recognize when it was uncertain about an answer and automatically retrieve supporting clinical literature before committing to a response. This was accomplished through two purpose-built training datasets: MedQA-R, which teaches step-by-step clinical reasoning, and MedQA-RS, which trains the model to search for evidence when confidence is low.[5]

### Approaching the Ceiling: 2025 Results

By 2025, the leading models had pushed MedQA accuracy above 95%. OpenAI's o1 reasoning model achieved the highest reported score of 96.5%, followed closely by GPT-5 at 96.3% and o3 at 96.1%.[9] These scores are well above average physician performance on USMLE-style questions and suggest that the MedQA benchmark may be approaching saturation for the most capable models.

## How is MedQA evaluated?

### Standard Evaluation Protocol

The standard evaluation protocol for MedQA uses the 1,273-question English (USMLE) test set. Models are presented with each question and its answer choices, and accuracy is computed as the percentage of questions answered correctly. Most published results use the 4-option version of the dataset, though some studies report results on the 5-option version as well.[8]

### Prompting Strategies

Different prompting strategies can significantly affect model performance. Common approaches include:

- **Zero-shot**: The model receives only the question and answer choices with no additional examples or instructions.
- **Few-shot**: The model is given several example questions with correct answers before the test question, helping it understand the expected format and reasoning pattern.
- **[Chain-of-thought](/wiki/chain_of_thought)**: The model is prompted to explain its reasoning step by step before selecting an answer, which often improves accuracy on complex clinical questions.
- **Self-consistency / Ensemble**: Multiple responses are generated and the most common answer is selected, reducing the impact of random variation.

These prompting techniques can produce substantial differences in measured performance. For instance, GPT-3.5 improved from roughly 50.8% (zero-shot) to 60.2% (few-shot with ensemble methods) on MedQA.[6]

### Comparison with Other Medical Benchmarks

MedQA is often evaluated alongside other medical question answering benchmarks to provide a more comprehensive picture of model capabilities:

| Benchmark | Description | Question Count | Format |
|---|---|---|---|
| MedQA (USMLE) | US medical licensing exam questions | 1,273 (test) | Multiple choice (4-5 options) |
| [MMLU](/wiki/mmlu) (Medical subset) | Medical topics from Massive Multitask Language Understanding | ~1,089 | Multiple choice (4 options) |
| MedMCQA | Indian medical entrance exam (AIIMS/PGI) | ~4,183 (test) | Multiple choice (4 options) |
| [PubMedQA](/wiki/pubmedqa) | Biomedical research question answering | 500 (test) | Yes/No/Maybe |

## Significance and Impact

### Advancing Medical AI Research

MedQA has played a central role in driving progress in [medical AI](/wiki/ai_in_healthcare). The benchmark has served as a primary evaluation tool in some of the most influential papers on LLMs in healthcare, including Google's Med-PaLM and Med-PaLM 2 papers, Microsoft's GPT-4 medical evaluation, and Google's Med-Gemini study.[2][3][4][5] Its adoption across major AI research labs has made it a de facto standard for measuring medical question answering capabilities.

### Demonstrating LLM Potential in Medicine

The trajectory of MedQA scores tells a compelling story about the pace of progress in AI. In just four years, performance went from 36.7% (barely above random guessing for a 4-choice test) to over 96%.[1][9] This rapid improvement has fueled both excitement and caution about the potential role of AI in clinical practice.

### Influence on Medical Education Discussions

The fact that AI systems can now outperform the average medical student (and many physicians) on USMLE-style questions has prompted discussions about the future of medical education and assessment. Some educators have questioned whether the USMLE format adequately tests the skills that matter most in clinical practice, given that AI can now excel at it.

## What are the limitations of MedQA?

Despite its widespread use, MedQA has several important limitations that researchers and practitioners should keep in mind.

### Benchmark Saturation

With leading models scoring above 95% on MedQA, the benchmark has limited ability to differentiate between top-performing systems.[9] This saturation means that MedQA may no longer be the most informative benchmark for evaluating frontier medical AI models. Researchers have begun developing more challenging benchmarks such as MedXpertQA (introduced in 2025) to address this gap.

### Multiple-Choice Format Limitations

The multiple-choice format constrains what MedQA can measure. Real clinical practice requires open-ended reasoning, information gathering through patient interviews, and decision-making under uncertainty. A model that excels at selecting the best answer from four choices may still struggle with generating a complete differential diagnosis or management plan from scratch. The multiple-choice format also means that partial credit is not possible; a model that narrows the options to two reasonable choices but selects the wrong one receives the same score as one that guesses randomly.

### Limited Assessment of Clinical Skills

Medical competency involves much more than answering knowledge-based questions. MedQA primarily tests the "knows" and "knows how" levels of Miller's Pyramid of clinical competence, neglecting the "shows how" and "does" levels that involve practical skills, communication, and hands-on patient care. Models that perform well on MedQA may not be capable of conducting a patient interview, performing a physical examination, or communicating a diagnosis empathetically.

### Knowledge-Practice Gap

Studies have shown a significant gap between AI performance on knowledge benchmarks like MedQA (where leading models achieve 84-96% accuracy) and performance on practice-based assessments that simulate real clinical encounters (where the same models often score only 45-69%). This knowledge-practice gap suggests that MedQA scores alone should not be used to assess a model's readiness for clinical deployment.

### Data Contamination Concerns

As MedQA has become a widely used benchmark, concerns about data contamination have grown. The questions in the dataset may appear in the training data of large language models, either directly or in paraphrased form, which could artificially inflate benchmark scores. Some researchers have called for dynamic benchmarking approaches with regularly refreshed question sets to mitigate this issue.

### Narrow Scope

MedQA focuses exclusively on the type of knowledge tested in medical licensing exams. It does not cover many aspects of medical AI that are equally important, such as medical image interpretation, clinical note generation, patient communication, or ethical decision-making in novel situations. Performance on MedQA should be considered alongside results from other benchmarks and clinical evaluations.

## Related Benchmarks and Successor Efforts

Several benchmarks have been developed to address the limitations of MedQA or to complement it:

- **MedMCQA**: A larger-scale medical multiple-choice dataset (approximately 194,000 questions) drawn from Indian medical entrance exams (AIIMS and PGI). It covers 21 medical subjects and provides a broader range of difficulty levels.
- **[PubMedQA](/wiki/pubmedqa)**: A biomedical question answering dataset where questions are derived from PubMed article titles and the task is to answer yes, no, or maybe based on the corresponding abstract.
- **[MMLU](/wiki/mmlu) (Medical Subset)**: The medical-related subsets of the Massive Multitask Language Understanding benchmark, covering topics such as clinical knowledge, medical genetics, anatomy, and professional medicine.
- **MedQA-CS**: An extension that evaluates clinical skills using an Objective Structured Clinical Examination (OSCE) framework, addressing the gap between knowledge assessment and practical clinical competence.
- **MedXpertQA**: A more challenging benchmark introduced in 2025 that includes expert-level medical reasoning questions designed to differentiate between the most capable models.
- **[HealthBench](/wiki/healthbench)**: OpenAI's benchmark for evaluating LLMs on realistic clinical queries, focusing on practical healthcare applications rather than exam-style questions.

## Technical Details

### Data Format

MedQA questions are stored in JSONL (JSON Lines) format, where each line represents a single question as a JSON object. The standard fields include the question text, a dictionary of answer options, the correct answer key, and optional metadata such as the relevant medical subject or USMLE step.

The dataset is organized into regional folders (US, Mainland, Taiwan), and each region includes a "qbank" file containing all questions as well as separate files for the official train, dev, and test splits.

### How can I access the MedQA dataset?

MedQA is publicly available through several channels:

- **GitHub**: The official repository (github.com/jind11/MedQA) contains the full dataset, textbook corpus, and baseline code.[7]
- **Hugging Face**: Multiple versions of the dataset are available on the Hugging Face Hub, including the popular "GBaker/MedQA-USMLE-4-options" dataset that provides the 4-option English variant in a ready-to-use format.[8]
- **Papers with Code**: The MedQA-USMLE page on Papers with Code tracks the latest benchmark results and links to relevant papers.

### License and Usage

The dataset is released for research purposes. Researchers have used MedQA for a wide variety of tasks beyond simple benchmarking, including studying model calibration, analyzing error patterns, developing [retrieval-augmented generation](/wiki/retrieval_augmented_generation) systems, and fine-tuning domain-specific medical models.

## ELI5: MedQA Explained Simply

Imagine the big test that doctors in the United States have to pass before they are allowed to treat patients. It is full of tricky questions like "A patient comes in feeling tired, cold, and gaining weight, with a certain blood test result. What is wrong with them?" MedQA is a giant collection of about 12,723 of these English exam questions (plus tens of thousands more in Chinese), gathered so that scientists can quiz computer programs the same way they quiz future doctors. When MedQA first came out in 2020, the best computer could only get about 37 out of 100 right. By 2025, the smartest AI programs were scoring above 96 out of 100, which is better than most real medical students. That is why MedQA became one of the most famous ways to check how good AI is at medicine.

## Citation

The original MedQA paper should be cited as:

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2021). What Disease Does This Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. *Applied Sciences*, 11(14), 6421.

The paper was first posted as a preprint on arXiv (arXiv:2009.13081) in September 2020 and subsequently published in the MDPI journal *Applied Sciences* in July 2021.[1]

## See Also

- [Benchmark](/wiki/benchmark)
- [Med-PaLM](/wiki/med_palm)
- [Med-PaLM 2](/wiki/med_palm_2)
- [AI in Healthcare](/wiki/ai_in_healthcare)
- [MMLU](/wiki/mmlu)
- [PubMedQA](/wiki/pubmedqa)
- [HealthBench](/wiki/healthbench)
- [GPT-4](/wiki/gpt-4)
- [PaLM](/wiki/palm)
- [Gemini](/wiki/gemini)
- [LLM](/wiki/llm)
- [NLP](/wiki/nlp)
- [Accuracy](/wiki/accuracy)
- [BERT](/wiki/bert)
- [Transformer](/wiki/transformer)

## References

1. Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2021). What Disease Does This Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. *Applied Sciences*, 11(14), 6421. https://doi.org/10.3390/app11146421
2. Nori, H., King, N., McKinney, S.M., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375.
3. Singhal, K., Azizi, S., Tu, T., et al. (2023). Large Language Models Encode Clinical Knowledge. *Nature*, 620, 172-180.
4. Singhal, K., Tu, T., Gottweis, J., et al. (2025). Toward Expert-Level Medical Question Answering with Large Language Models. *Nature Medicine*, 31, 943-950.
5. Saab, K., Tu, T., Weng, W.-H., et al. (2024). Capabilities of Gemini Models in Medicine. arXiv:2404.18416.
6. Kung, T.H., Cheatham, M., Medenilla, A., et al. (2023). Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. *PLOS Digital Health*, 2(2), e0000198.
7. MedQA GitHub Repository. https://github.com/jind11/MedQA
8. GBaker/MedQA-USMLE-4-options, Hugging Face Datasets. https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options
9. Vals.ai MedQA Benchmark Leaderboard. https://www.vals.ai/benchmarks/medqa
10. Nori, H., Lee, Y.T., Zhang, S., et al. (2023). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. arXiv:2311.16452.
