Med-PaLM

Google Healthcare AI Large Language Models

10 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,966 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Med-PaLM is a large language model from Google Research, built in collaboration with DeepMind, that is specialized for answering medical questions. Introduced through the 2023 Nature paper "Large language models encode clinical knowledge," Med-PaLM was part of the first study in which an AI system scored above the rough 60% passing mark on questions in the style of the United States Medical Licensing Examination (USMLE), reaching 67.6% accuracy on the MedQA benchmark.^[1]^[2]^[3] It was first described in an arXiv preprint in December 2022, introduced the MultiMedQA evaluation benchmark, and was built by adapting Google's PaLM model to medicine using a technique called instruction prompt tuning.^[1]^[2] Med-PaLM was a research artifact rather than a clinical product, and its authors were explicit that it remained inferior to physicians and was not ready for real-world use.^[1]

What is Med-PaLM?

Med-PaLM is a medically tuned variant of Google's PaLM family of large language models, designed to encode and recall clinical knowledge and to produce safe, accurate long-form answers to medical questions. By late 2022, general-purpose large language models had shown that broad capabilities could emerge from scale, but medicine posed a harder test. Clinical questions demand factual accuracy, careful reasoning, an awareness of scientific consensus, and sensitivity to potential harm, and existing automated metrics captured none of these well. Google Research and DeepMind set out to measure how much clinical knowledge a model could encode and to probe whether that knowledge could be expressed safely in long-form answers.^[1]

The team's base model was PaLM, a 540-billion-parameter model introduced by Google earlier in 2022, and its instruction-tuned variant, Flan-PaLM. Med-PaLM grew out of further work to align Flan-PaLM specifically to the medical domain. The first public description came as an arXiv preprint submitted on December 26, 2022, with a team of roughly 30 authors led by Karan Singhal, Shekoofeh Azizi, and Tao Tu.^[2] The peer-reviewed version appeared in Nature (volume 620, pages 172 to 180), published online on July 12, 2023.^[1]

What is the MultiMedQA benchmark?

A central contribution of the work was MultiMedQA, a benchmark assembled to evaluate medical question answering across professional exams, research literature, and everyday consumer queries. It combined six existing open datasets with one new dataset created by the authors.^[1] The components were:

Dataset	Source / domain	Format
MedQA	USMLE-style exam questions	Multiple choice
MedMCQA	Indian medical entrance exams (AIIMS, NEET-PG)	Multiple choice
PubMedQA	Biomedical research abstracts	Yes / no / maybe
MMLU clinical topics	Knowledge across clinical subjects	Multiple choice
LiveQA	Consumer health questions	Long-form answer
MedicationQA	Consumer drug questions	Long-form answer
HealthSearchQA	Commonly searched consumer questions	Long-form answer

HealthSearchQA was the dataset the team introduced and released, comprising 3,173 consumer health questions curated from commonly searched queries about medical conditions and symptoms.^[1]^[4] Alongside the datasets, the authors proposed a human evaluation framework in which clinicians and lay raters scored long-form answers along axes such as factual accuracy, comprehension, reasoning, completeness, agreement with scientific consensus, and the likelihood and extent of possible harm.^[1]

How does Med-PaLM work (Flan-PaLM and instruction prompt tuning)?

Med-PaLM was not trained from scratch. It began with Flan-PaLM, the instruction-tuned version of PaLM, which had already learned to follow natural-language instructions across many tasks. On the multiple-choice portions of MultiMedQA, the researchers reported strong results from Flan-PaLM using a mix of prompting strategies, including few-shot, chain-of-thought, and self-consistency prompting.^[1]

To improve the quality and safety of long-form answers, the team developed a technique they called instruction prompt tuning, which the paper describes as "a parameter-efficient approach for aligning LLMs to new domains using a few exemplars."^[2] It is a method that aligns a model to a new domain using only a small number of examples. Rather than updating the model's billions of weights, instruction prompt tuning learns a "soft prompt," a short set of trainable vectors that is prepended to the input. That learned prompt is followed by the usual hard prompt of human-written instructions and examples. The soft prompt was trained on a few hundred clinician-curated exemplars, which were written by a panel of physicians to demonstrate appropriate, helpful, and safe responses.^[1] Applying this technique to Flan-PaLM produced Med-PaLM. Because only the soft prompt is learned, the approach is lightweight and avoids the cost of fully fine-tuning a 540-billion-parameter model.^[1]

How well does Med-PaLM do on medical exams?

On the multiple-choice benchmarks, Flan-PaLM 540B set a new state of the art across every MultiMedQA dataset. The paper reports that "Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%."^[2] Google described this as the first time an AI system had reached a "passing" level, defined as above roughly 60%, on questions of this style.^[3] The authors also reported that comprehension, knowledge recall, and reasoning improved with model scale, comparing variants at 8 billion, 62 billion, and 540 billion parameters.^[1]

The more striking finding came from the human evaluation of long-form answers, where instruction prompt tuning made a large difference. When clinicians judged whether answers aligned with the scientific consensus, Med-PaLM's answers were rated as consistent 92.6% of the time, compared with only 61.9% for Flan-PaLM. Physician-written answers scored 92.9% by the same measure, placing Med-PaLM close to the clinician baseline on that axis.^[1]^[5]

Metric (long-form answers)	Flan-PaLM	Med-PaLM	Clinicians
Agreement with scientific consensus	61.9%	92.6%	92.9%
MedQA (USMLE-style, multiple choice)	67.6%	n/a (Flan-PaLM score)	passing approx. 60%

These numbers illustrated the paper's main argument: scaling and instruction tuning give a model substantial clinical knowledge, but careful alignment is needed before that knowledge can be expressed in answers that are reliably safe and accurate.^[1]

What are the limitations of Med-PaLM?

The authors were direct about the gaps that remained. The paper concludes that "the resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians."^[2] Across the human evaluation framework, Med-PaLM's long-form answers still trailed clinician-written ones on several dimensions, and raters identified answers that contained incorrect or incomplete content and, in some cases, responses with the potential to cause harm.^[1]

Independent commentators echoed these cautions. In expert reaction collected by the UK's Science Media Centre, researchers noted that accuracy still sat below the level of human specialists, that hallucinations were likely to persist because of the statistical nature of language models, and that benchmarks reflect bias present in training data.^[5] Others pointed out a more fundamental limit, observing that practicing medicine involves diagnosis, examination, and treatment over time, not only answering exam-style questions, and that real patients do not select an optimal prompting strategy. Google framed Med-PaLM as a research model that needed considerable further work before it could be used in clinical settings.^[3]^[5]

What is Med-PaLM 2?

Google built on this work with Med-PaLM 2, announced in March 2023 and described in the paper "Towards Expert-Level Medical Question Answering with Large Language Models" (arXiv, May 16, 2023; published in Nature Medicine in 2025).^[3]^[6] Med-PaLM 2 combined a stronger base model, medical-domain fine-tuning, and new reasoning strategies such as ensemble refinement. Its headline result was a large jump on exam-style questions: according to the paper, "Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art."^[6] Google characterized this as "expert" examinee level, the first time an AI system reached that bar on USMLE-style questions.^[3]

Med-PaLM 2 also improved on the qualitative side. In a pairwise ranking study using 1,066 consumer medical questions, a panel of physicians preferred Med-PaLM 2's long-form answers to answers written by other physicians on eight of nine evaluation axes, including factual accuracy and relevance.^[6] The model was also judged less likely to produce harmful content than comparison systems in the study.^[6]

Model	Year	MedQA (USMLE-style)	Note
Flan-PaLM (Med-PaLM study)	2022 to 2023	67.6%	First to pass the approx. 60% mark
Med-PaLM 2	2023	up to 86.5%	"Expert" examinee level, +19% over Med-PaLM

What is Med-PaLM M (multimodal Med-PaLM)?

Med-PaLM Multimodal (Med-PaLM M) extended the line beyond text. Described in the July 2023 paper "Towards Generalist Biomedical AI," Med-PaLM M is a single large multimodal model that, with one shared set of weights, can flexibly encode and interpret biomedical data spanning clinical language, medical imaging, and genomics.^[7] To evaluate it, the authors curated MultiMedBench, a benchmark of 14 diverse biomedical tasks including medical question answering, chest X-ray and mammography interpretation, radiology report generation and summarization, and genomic variant calling.^[7]

The paper reports that Med-PaLM M "reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin."^[7] In a side-by-side clinician review of 246 retrospective chest X-rays, radiologists preferred Med-PaLM M's generated reports over those written by human radiologists in up to 40.50% of cases, which the authors presented as early evidence of potential clinical utility.^[7] Med-PaLM M was framed as a proof of concept for a generalist biomedical AI system rather than a deployable product.^[7]

Legacy

The research line later fed into Google's broader effort to develop medically capable models within the Gemini family, including the MedGemma and MedLM offerings. Med-PaLM and its benchmark, MultiMedQA, remain widely referenced reference points for evaluating medical question answering by large language models, and the instruction prompt tuning method introduced alongside Med-PaLM influenced later work on parameter-efficient domain alignment.^[1]^[3] For broader context on how these systems are deployed and regulated, see AI in healthcare.

ELI5

Med-PaLM is a computer program from Google that learned a lot about medicine by reading huge amounts of text. It can take the same kind of multiple-choice tests that doctors take to get licensed, and it was the first program of its kind to "pass" that test by getting more than 60% of the answers right. A newer version, Med-PaLM 2, did even better and scored as high as 86.5%, which is close to what a top human test-taker gets. There is even a version called Med-PaLM M that can look at things like chest X-rays, not just words. But the people who built it are careful to say it is a research tool, not a real doctor, and it can still make mistakes, so it should not be used to treat patients on its own.

References

Singhal, K., Azizi, S., Tu, T., et al. "Large language models encode clinical knowledge." *Nature* 620, 172 to 180 (2023). https://www.nature.com/articles/s41586-023-06291-2 ↩
Singhal, K., et al. "Large Language Models Encode Clinical Knowledge." arXiv preprint, submitted December 26, 2022. https://arxiv.org/abs/2212.13138 ↩
Google, "The Check Up: our latest health AI developments," The Keyword (Google blog), March 14, 2023. https://blog.google/technology/health/ai-llm-medpalm-research-thecheckup/ ↩
"HealthSearchQA dataset." Hugging Face. https://huggingface.co/datasets/katielink/healthsearchqa ↩
Science Media Centre, "Expert reaction to study presenting Med-PaLM, a large language model (LLM) for answering medical questions," July 12, 2023. https://www.sciencemediacentre.org/expert-reaction-to-study-presenting-med-palm-a-large-language-model-llm-for-answering-medical-questions-and-a-benchmark-for-assessing-how-well-llms-can-answer-medical-questions/ ↩
Singhal, K., Tu, T., Gottweis, J., et al. "Toward expert-level medical question answering with large language models." *Nature Medicine* 31, 943 to 950 (2025); arXiv preprint "Towards Expert-Level Medical Question Answering with Large Language Models," submitted May 16, 2023. https://arxiv.org/abs/2305.09617 ↩
Tu, T., Azizi, S., Driess, D., et al. "Towards Generalist Biomedical AI." arXiv preprint, submitted July 26, 2023. https://arxiv.org/abs/2307.14334 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Google Research Jason Wei Med-PaLM 2 MedGemma MedQA PubMedQA

What is Med-PaLM?

What is the MultiMedQA benchmark?

How does Med-PaLM work (Flan-PaLM and instruction prompt tuning)?

How well does Med-PaLM do on medical exams?

What are the limitations of Med-PaLM?

What is Med-PaLM 2?

What is Med-PaLM M (multimodal Med-PaLM)?

Legacy

ELI5

See also

References

Improve this article

Related Articles

Med-PaLM 2

MedGemma

BioBERT

BioGPT

LaMDA

Bard

What links here

Related Articles

Med-PaLM 2

MedGemma

BioBERT

BioGPT

LaMDA

Bard

What links here