MedQA is a large-scale open-domain question answering dataset composed of multiple-choice questions drawn from medical licensing examinations. Created by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits at MIT, MedQA was introduced in a 2020 preprint and formally published in the journal Applied Sciences in 2021. It is widely regarded as one of the most important benchmarks for evaluating how well large language models can handle real-world medical knowledge and clinical reasoning.
The dataset is notable for being the first free-form multiple-choice open-domain question answering (OpenQA) dataset specifically designed for medical problem solving. Unlike earlier medical NLP benchmarks that focused on narrower tasks such as named entity recognition or relation extraction, MedQA tests a system's ability to read a clinical scenario, retrieve relevant medical knowledge, and select the correct answer from several options. This design closely mirrors the challenge that medical students and physicians face when taking board certification exams.
Before MedQA, most question answering datasets in the biomedical domain were either closed-domain (where the answer could be found in a given passage) or limited to yes/no and short-answer formats. Datasets like SQuAD and BioASQ tested reading comprehension but did not require models to reason across broad medical knowledge the way a licensing exam does.
The authors of MedQA recognized that medical board examinations represent a natural, high-quality source of challenging questions. These exams are designed by medical education experts to test clinical competency across a wide range of disciplines, from anatomy and pharmacology to ethics and patient management. By collecting questions from three different national medical licensing systems, the MedQA team created a benchmark that is both multilingual and clinically rigorous.
The core research question behind MedQA is straightforward: can a machine answer the same questions that human doctors must answer to earn their medical licenses? When the dataset was first released, the answer was a clear "not yet," with the best automated systems achieving only 36.7% accuracy on the English test set. That figure has since risen dramatically thanks to advances in large language models.
MedQA contains a total of 61,097 multiple-choice questions spanning three languages:
| Language | Exam Source | Abbreviation | Total Questions |
|---|---|---|---|
| English | United States Medical Licensing Examination (USMLE) | USMLE | 12,723 |
| Simplified Chinese | Mainland China Medical Licensing Examination | MCMLE | 34,251 |
| Traditional Chinese | Taiwan Medical Licensing Examination | TWMLE | 14,123 |
Each question follows the standard format used in medical licensing exams: a clinical vignette (also called a case stem) presents a patient scenario with relevant history, symptoms, lab results, and other clinical details, followed by a question and a set of answer choices. The English (USMLE) subset includes both 4-option and 5-option variants. The original dataset provides 5 answer options per question, matching the format of actual USMLE exams, while a simplified 4-option version was also released and has become the more commonly used variant for benchmarking purposes.
The USMLE subset of MedQA is the most widely used portion of the dataset in the research community. The USMLE is a three-step examination program used in the United States to assess clinical competency and grant medical licensure:
The English questions in MedQA are sourced from both official USMLE practice materials and commercial exam-preparation question banks. Of the approximately 12,723 questions, around 300 come from official USMLE tutorials, while the remainder are drawn from widely used exam-preparation websites that cover all three steps.
The official train/dev/test split for the USMLE subset is:
| Split | Number of Questions |
|---|---|
| Training | 10,178 |
| Development (Validation) | 1,272 |
| Test | 1,273 |
This roughly follows an 80/10/10 split ratio. The test set of 1,273 questions is the standard evaluation set used in virtually all published results on MedQA.
The MCMLE (Mainland China) subset is the largest portion of MedQA, with 34,251 questions in Simplified Chinese drawn from China's national medical licensing examination. The TWMLE (Taiwan) subset contains 14,123 questions in Traditional Chinese from Taiwan's medical licensing examination. Both follow approximately the same 80/10/10 train/dev/test split as the USMLE subset.
These multilingual subsets make MedQA one of the few medical QA benchmarks that supports cross-lingual evaluation, enabling researchers to study how well models transfer medical knowledge across languages.
In addition to the questions themselves, the MedQA authors collected and released a large-scale textbook corpus intended to serve as a knowledge source for question answering systems. The idea is that models can retrieve relevant passages from these textbooks to help answer the exam questions, similar to how medical students study from textbooks before taking their exams.
The English corpus consists of 18 widely used medical textbooks that are standard reading material for USMLE preparation. These textbooks cover the major medical disciplines:
| Textbook | Subject Area | Approximate Records |
|---|---|---|
| Harrison's Principles of Internal Medicine | Internal Medicine | 20,583 |
| Schwartz's Principles of Surgery | Surgery | 7,803 |
| Adams and Victor's Principles of Neurology | Neurology | 7,732 |
| Williams Obstetrics | Obstetrics | 5,392 |
| Alberts' Molecular Biology of the Cell | Cell Biology | 4,275 |
| Robbins Pathologic Basis of Disease | Pathology | 3,156 |
| Janeway's Immunobiology | Immunology | 2,996 |
| Ross Histology | Histology | 2,685 |
| Physiology (Levy) | Physiology | 2,627 |
| Nelson Textbook of Pediatrics | Pediatrics | 2,575 |
| Gray's Anatomy | Anatomy | 1,736 |
| Lippincott Illustrated Reviews: Biochemistry | Biochemistry | 1,193 |
| First Aid for the USMLE Step 2 CK | Exam Preparation | 800 |
| First Aid for the USMLE Step 1 | Exam Preparation | 489 |
| Pathoma (Husain) | Pathology | 280 |
These textbooks collectively form a comprehensive medical knowledge base spanning anatomy, biochemistry, cell biology, histology, immunology, internal medicine, neurology, obstetrics, pathology, pediatrics, pharmacology, physiology, and surgery.
The MCMLE subset is accompanied by 33 Simplified Chinese medical textbooks that are officially designated study materials for the Chinese medical licensing examination. Since Taiwanese medical students use many of the same textbooks as American students, the TWMLE subset shares its document collection with both the USMLE and MCMLE corpora.
Each question in MedQA follows a standardized structure typical of medical licensing examinations. A typical USMLE-style question includes:
For example, a question might present a 45-year-old woman with fatigue, weight gain, cold intolerance, and elevated TSH levels, then ask which condition is the most likely diagnosis. The answer choices would include hypothyroidism (correct), hyperthyroidism, Cushing syndrome, and Addison disease.
This format requires more than simple factual recall. Many questions demand multi-step clinical reasoning: the model must interpret the clinical scenario, integrate multiple pieces of information, consider differential diagnoses, and select the best answer. This is what makes MedQA a particularly challenging benchmark for AI systems.
In their original paper, Jin et al. evaluated MedQA using a two-stage pipeline approach common in open-domain question answering:
The best-performing system at the time of publication achieved only 36.7% accuracy on the English (USMLE) test set. For comparison, random guessing on a 4-option multiple-choice test would yield 25% accuracy. The Traditional Chinese and Simplified Chinese subsets saw best results of 42.0% and 70.1%, respectively.
The authors identified document retrieval as the primary bottleneck. The retrieval component struggled with multi-hop reasoning, where answering a question requires combining information from multiple documents or passages. Medical questions often require this type of reasoning because a single textbook passage rarely contains all the information needed to answer a complex clinical question.
| Subset | Best Accuracy (2021) | Random Baseline |
|---|---|---|
| USMLE (English) | 36.7% | 25.0% |
| TWMLE (Traditional Chinese) | 42.0% | 25.0% |
| MCMLE (Simplified Chinese) | 70.1% | 25.0% |
The relatively higher performance on the Simplified Chinese subset may reflect the larger training set size (over 27,000 training questions) compared to the English subset (approximately 10,000 training questions).
The emergence of large language models transformed performance on MedQA. Unlike the retrieval-and-comprehension pipeline used in the original baselines, LLMs encode vast medical knowledge within their parameters during pretraining, allowing them to answer questions without explicit document retrieval.
The following table summarizes the performance of major models on the MedQA USMLE test set (4-option version unless noted otherwise):
| Model | Organization | Year | MedQA Accuracy | Notes |
|---|---|---|---|---|
| Best baseline (Jin et al.) | MIT | 2021 | 36.7% | Retrieval + comprehension pipeline |
| PubMedGPT / BioMedLM (2.7B) | Stanford | 2022 | 50.3% | Domain-specific pretraining |
| ChatGPT (GPT-3.5) | OpenAI | 2023 | 50.8% (zero-shot) | First widely tested LLM |
| ChatGPT (GPT-3.5, few-shot) | OpenAI | 2023 | 60.2% | Met USMLE passing threshold |
| Flan-PaLM | 2023 | 67.6% | Surpassed prior SOTA by 17%+ | |
| Med-PaLM | 2023 | 67.2% | First AI to "pass" USMLE-style questions | |
| GPT-4 | OpenAI | 2023 | 86.1% | Exceeded passing score by 20+ points |
| Med-PaLM 2 | 2023 | 86.5% | Instruction-tuned medical model | |
| Med-Gemini | 2024 | 91.1% | Uncertainty-guided search strategy | |
| Claude Opus 4.1 (Thinking) | Anthropic | 2025 | 93.6% | Strongest non-OpenAI model |
| Gemini 2.5 Pro | 2025 | 93.1% | Advanced reasoning capabilities | |
| GPT-5 | OpenAI | 2025 | 96.3% | Near-ceiling performance |
| o3 | OpenAI | 2025 | 96.1% | Reasoning-focused model |
| o1 | OpenAI | 2025 | 96.5% | Highest reported score |
A score of approximately 60% is the commonly cited threshold for passing the USMLE (though the actual exam uses a scaled scoring system rather than a raw percentage). This passing threshold serves as a meaningful reference point for MedQA results.
ChatGPT (GPT-3.5) was notable for being among the first general-purpose LLMs to approach and reach this threshold, achieving 60.2% with few-shot prompting techniques. This result, published in early 2023, generated significant interest in the medical AI community because it demonstrated that a general-purpose language model could perform at the level required of a medical student.
The release of GPT-4 in March 2023 represented a major leap. In the paper "Capabilities of GPT-4 on Medical Challenge Problems" by Harsha Nori and colleagues at Microsoft Research, GPT-4 achieved 86.1% on the MedQA USMLE test set without any medical-specific fine-tuning. This was particularly striking because GPT-4 is a general-purpose model with no special medical training.
Around the same time, Google reported that Med-PaLM 2, a version of PaLM 2 specifically fine-tuned for medical applications, achieved 86.5% on MedQA. Med-PaLM 2 also demonstrated strong performance on physician evaluations, with its answers being preferred over physician-written answers by a panel of doctors on eight of nine evaluation axes in a study of over 1,000 consumer medical questions.
In 2024, Google introduced Med-Gemini, a medical adaptation of the Gemini model family. Med-Gemini achieved 91.1% accuracy on MedQA through a novel uncertainty-guided search strategy. The key innovation was training the model to recognize when it was uncertain about an answer and automatically retrieve supporting clinical literature before committing to a response. This was accomplished through two purpose-built training datasets: MedQA-R, which teaches step-by-step clinical reasoning, and MedQA-RS, which trains the model to search for evidence when confidence is low.
By 2025, the leading models had pushed MedQA accuracy above 95%. OpenAI's o1 reasoning model achieved the highest reported score of 96.5%, followed closely by GPT-5 at 96.3% and o3 at 96.1%. These scores are well above average physician performance on USMLE-style questions and suggest that the MedQA benchmark may be approaching saturation for the most capable models.
The standard evaluation protocol for MedQA uses the 1,273-question English (USMLE) test set. Models are presented with each question and its answer choices, and accuracy is computed as the percentage of questions answered correctly. Most published results use the 4-option version of the dataset, though some studies report results on the 5-option version as well.
Different prompting strategies can significantly affect model performance. Common approaches include:
These prompting techniques can produce substantial differences in measured performance. For instance, GPT-3.5 improved from roughly 50.8% (zero-shot) to 60.2% (few-shot with ensemble methods) on MedQA.
MedQA is often evaluated alongside other medical question answering benchmarks to provide a more comprehensive picture of model capabilities:
| Benchmark | Description | Question Count | Format |
|---|---|---|---|
| MedQA (USMLE) | US medical licensing exam questions | 1,273 (test) | Multiple choice (4-5 options) |
| MMLU (Medical subset) | Medical topics from Massive Multitask Language Understanding | ~1,089 | Multiple choice (4 options) |
| MedMCQA | Indian medical entrance exam (AIIMS/PGI) | ~4,183 (test) | Multiple choice (4 options) |
| PubMedQA | Biomedical research question answering | 500 (test) | Yes/No/Maybe |
MedQA has played a central role in driving progress in medical AI. The benchmark has served as a primary evaluation tool in some of the most influential papers on LLMs in healthcare, including Google's Med-PaLM and Med-PaLM 2 papers, Microsoft's GPT-4 medical evaluation, and Google's Med-Gemini study. Its adoption across major AI research labs has made it a de facto standard for measuring medical question answering capabilities.
The trajectory of MedQA scores tells a compelling story about the pace of progress in AI. In just four years, performance went from 36.7% (barely above random guessing for a 4-choice test) to over 96%. This rapid improvement has fueled both excitement and caution about the potential role of AI in clinical practice.
The fact that AI systems can now outperform the average medical student (and many physicians) on USMLE-style questions has prompted discussions about the future of medical education and assessment. Some educators have questioned whether the USMLE format adequately tests the skills that matter most in clinical practice, given that AI can now excel at it.
Despite its widespread use, MedQA has several important limitations that researchers and practitioners should keep in mind.
With leading models scoring above 95% on MedQA, the benchmark has limited ability to differentiate between top-performing systems. This saturation means that MedQA may no longer be the most informative benchmark for evaluating frontier medical AI models. Researchers have begun developing more challenging benchmarks such as MedXpertQA (introduced in 2025) to address this gap.
The multiple-choice format constrains what MedQA can measure. Real clinical practice requires open-ended reasoning, information gathering through patient interviews, and decision-making under uncertainty. A model that excels at selecting the best answer from four choices may still struggle with generating a complete differential diagnosis or management plan from scratch. The multiple-choice format also means that partial credit is not possible; a model that narrows the options to two reasonable choices but selects the wrong one receives the same score as one that guesses randomly.
Medical competency involves much more than answering knowledge-based questions. MedQA primarily tests the "knows" and "knows how" levels of Miller's Pyramid of clinical competence, neglecting the "shows how" and "does" levels that involve practical skills, communication, and hands-on patient care. Models that perform well on MedQA may not be capable of conducting a patient interview, performing a physical examination, or communicating a diagnosis empathetically.
Studies have shown a significant gap between AI performance on knowledge benchmarks like MedQA (where leading models achieve 84-96% accuracy) and performance on practice-based assessments that simulate real clinical encounters (where the same models often score only 45-69%). This knowledge-practice gap suggests that MedQA scores alone should not be used to assess a model's readiness for clinical deployment.
As MedQA has become a widely used benchmark, concerns about data contamination have grown. The questions in the dataset may appear in the training data of large language models, either directly or in paraphrased form, which could artificially inflate benchmark scores. Some researchers have called for dynamic benchmarking approaches with regularly refreshed question sets to mitigate this issue.
MedQA focuses exclusively on the type of knowledge tested in medical licensing exams. It does not cover many aspects of medical AI that are equally important, such as medical image interpretation, clinical note generation, patient communication, or ethical decision-making in novel situations. Performance on MedQA should be considered alongside results from other benchmarks and clinical evaluations.
Several benchmarks have been developed to address the limitations of MedQA or to complement it:
MedQA questions are stored in JSONL (JSON Lines) format, where each line represents a single question as a JSON object. The standard fields include the question text, a dictionary of answer options, the correct answer key, and optional metadata such as the relevant medical subject or USMLE step.
The dataset is organized into regional folders (US, Mainland, Taiwan), and each region includes a "qbank" file containing all questions as well as separate files for the official train, dev, and test splits.
MedQA is publicly available through several channels:
The dataset is released for research purposes. Researchers have used MedQA for a wide variety of tasks beyond simple benchmarking, including studying model calibration, analyzing error patterns, developing retrieval-augmented generation systems, and fine-tuning domain-specific medical models.
The original MedQA paper should be cited as:
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2021). What Disease Does This Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11(14), 6421.
The paper was first posted as a preprint on arXiv (arXiv:2009.13081) in September 2020 and subsequently published in the MDPI journal Applied Sciences in July 2021.