Med-PaLM
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,240 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,240 words
Add missing citations, update stale details, or suggest a clearer explanation.
Med-PaLM is a large language model from Google Research, built in collaboration with DeepMind, that was tuned to answer medical questions. It was first described in a preprint released in late December 2022 and detailed in a peer-reviewed paper, "Large language models encode clinical knowledge," published in the journal Nature in 2023.[1][2] The work is best known for an accompanying benchmark, MultiMedQA, and for being part of the first study in which a large language model surpassed the rough passing mark on questions in the style of the United States Medical Licensing Examination (USMLE).[3] Med-PaLM was a research artifact rather than a clinical product, and its authors were explicit that it remained inferior to physicians and was not ready for real-world use.[1]
By late 2022, general-purpose large language models had shown that broad capabilities could emerge from scale, but medicine posed a harder test. Clinical questions demand factual accuracy, careful reasoning, an awareness of scientific consensus, and sensitivity to potential harm, and existing automated metrics captured none of these well. Google Research and DeepMind set out to measure how much clinical knowledge a model could encode and to probe whether that knowledge could be expressed safely in long-form answers.[1]
The team's base model was PaLM, a 540-billion-parameter model introduced by Google earlier in 2022, and its instruction-tuned variant, Flan-PaLM. Med-PaLM grew out of further work to align Flan-PaLM specifically to the medical domain. The first public description came as an arXiv preprint submitted on December 26, 2022, with a team of roughly 30 authors led by Karan Singhal, Shekoofeh Azizi, and Tao Tu.[2] The peer-reviewed version appeared in Nature (volume 620, pages 172 to 180), published online on July 12, 2023.[1]
A central contribution of the work was MultiMedQA, a benchmark assembled to evaluate medical question answering across professional exams, research literature, and everyday consumer queries. It combined six existing open datasets with one new dataset created by the authors.[1] The components were:
| Dataset | Source / domain | Format |
|---|---|---|
| MedQA | USMLE-style exam questions | Multiple choice |
| MedMCQA | Indian medical entrance exams (AIIMS, NEET-PG) | Multiple choice |
| PubMedQA | Biomedical research abstracts | Yes / no / maybe |
| MMLU clinical topics | Knowledge across clinical subjects | Multiple choice |
| LiveQA | Consumer health questions | Long-form answer |
| MedicationQA | Consumer drug questions | Long-form answer |
| HealthSearchQA | Commonly searched consumer questions | Long-form answer |
HealthSearchQA was the dataset the team introduced and released, comprising 3,173 consumer health questions curated from commonly searched queries about medical conditions and symptoms.[1][4] Alongside the datasets, the authors proposed a human evaluation framework in which clinicians and lay raters scored long-form answers along axes such as factual accuracy, comprehension, reasoning, completeness, agreement with scientific consensus, and the likelihood and extent of possible harm.[1]
Med-PaLM was not trained from scratch. It began with Flan-PaLM, the instruction-tuned version of PaLM, which had already learned to follow natural-language instructions across many tasks. On the multiple-choice portions of MultiMedQA, the researchers reported strong results from Flan-PaLM using a mix of prompting strategies, including few-shot, chain-of-thought, and self-consistency prompting.[1]
To improve the quality and safety of long-form answers, the team developed a technique they called instruction prompt tuning. It is a parameter-efficient method that aligns a model to a new domain using only a small number of examples. Rather than updating the model's billions of weights, instruction prompt tuning learns a "soft prompt," a short set of trainable vectors that is prepended to the input. That learned prompt is followed by the usual hard prompt of human-written instructions and examples. The soft prompt was trained on a few hundred clinician-curated exemplars, which were written by a panel of physicians to demonstrate appropriate, helpful, and safe responses.[1] Applying this technique to Flan-PaLM produced Med-PaLM. Because only the soft prompt is learned, the approach is lightweight and avoids the cost of fully fine-tuning a 540-billion-parameter model.[1]
On the multiple-choice benchmarks, Flan-PaLM 540B set a new state of the art across every MultiMedQA dataset. Its headline figure was 67.6% accuracy on MedQA (USMLE-style questions), which surpassed the prior best result by more than 17 percentage points.[1][2] Google described this as the first time an AI system had reached a "passing" level, defined as above roughly 60%, on questions of this style.[3] The authors also reported that comprehension, knowledge recall, and reasoning improved with model scale, comparing variants at 8 billion, 62 billion, and 540 billion parameters.[1]
The more striking finding came from the human evaluation of long-form answers, where instruction prompt tuning made a large difference. When clinicians judged whether answers aligned with the scientific consensus, Med-PaLM's answers were rated as consistent 92.6% of the time, compared with only 61.9% for Flan-PaLM. Physician-written answers scored 92.9% by the same measure, placing Med-PaLM close to the clinician baseline on that axis.[1][5]
| Metric (long-form answers) | Flan-PaLM | Med-PaLM | Clinicians |
|---|---|---|---|
| Agreement with scientific consensus | 61.9% | 92.6% | 92.9% |
These numbers illustrated the paper's main argument: scaling and instruction tuning give a model substantial clinical knowledge, but careful alignment is needed before that knowledge can be expressed in answers that are reliably safe and accurate.[1]
The authors were direct about the gaps that remained. Across the human evaluation framework, Med-PaLM's long-form answers still trailed clinician-written ones on several dimensions, and the study described it as performing "encouragingly" while remaining "inferior to clinicians."[1] Raters identified answers that contained incorrect or incomplete content and, in some cases, responses with the potential to cause harm.[1]
Independent commentators echoed these cautions. In expert reaction collected by the UK's Science Media Centre, researchers noted that accuracy still sat below the level of human specialists, that hallucinations were likely to persist because of the statistical nature of language models, and that benchmarks reflect bias present in training data.[5] Others pointed out a more fundamental limit, observing that practicing medicine involves diagnosis, examination, and treatment over time, not only answering exam-style questions, and that real patients do not select an optimal prompting strategy. Google framed Med-PaLM as a research model that needed considerable further work before it could be used in clinical settings.[3][5]
Google built on this work with Med-PaLM 2, announced in March 2023. The successor model reported markedly higher accuracy on MedQA-style questions, reaching about 85%, which Google characterized as an 18% improvement over the original Med-PaLM and as "expert" examinee level.[3] The research line later fed into Google's broader effort to develop medically capable models within the Gemini family. Med-PaLM and its benchmark, MultiMedQA, remain widely referenced reference points for evaluating medical question answering by large language models.[1][3]