Med-PaLM 2
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,393 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,393 words
Add missing citations, update stale details, or suggest a clearer explanation.
Med-PaLM 2 is a medical large language model developed by Google Research and Google DeepMind, built on the PaLM 2 foundation model and tuned to answer questions about medicine and health. It was the first AI system reported to reach an "expert" test-taker level on questions styled after the United States Medical Licensing Examination (USMLE), and it served as the research foundation for Google's later commercial healthcare offering, MedLM. Google positioned it as a research and limited-access system rather than a clinical product, and the team repeatedly cautioned that it was not ready for use in patient care.
Med-PaLM 2 is the second generation of Google's Med-PaLM line. The original Med-PaLM, described in late 2022, adapted Google's first PaLM model to the medical domain through instruction prompt tuning and a curated set of expert demonstrations. It became the first model to pass a benchmark of USMLE-style questions, reaching roughly 67% accuracy on the MedQA dataset, just above the commonly cited passing threshold of about 60% [1][2]. Med-PaLM also introduced an evaluation rubric in which clinicians rated long-form answers along axes such as scientific consensus, reasoning, and potential for harm. That groundwork carried directly into the second generation. For the lineage and the first-generation methods, see Med-PaLM.
Med-PaLM 2 replaces the underlying model with PaLM 2, the general-purpose language model Google unveiled at its I/O developer conference in May 2023. On top of that stronger base, the Med-PaLM team applied medical domain finetuning and new prompting strategies. The most notable of these was "ensemble refinement," in which the model generates several reasoning paths for a question and then conditions on those candidate answers to produce a refined final response. The published work also describes a chain-of-retrieval approach for grounding answers in relevant context [1][3]. The combination of a better base model, targeted finetuning, and improved inference-time reasoning is what the authors credit for the jump in performance over the first Med-PaLM.
Med-PaLM 2 scored up to 86.5% on MedQA, the dataset of USMLE-style multiple-choice questions. That was an improvement of more than 19 percentage points over the original Med-PaLM and set a new state of the art at the time [1][3]. Google framed crossing this threshold as the first time a large language model performed at an expert test-taker level on the benchmark. The model also performed at or near the state of the art on other medical question-answering datasets, and it was reported as the first AI system to reach a passing score on MedMCQA, a set drawn from Indian AIIMS and NEET medical entrance examinations, where it scored 72.3% [4].
| Benchmark | Med-PaLM 2 | Notes |
|---|---|---|
| MedQA (USMLE-style) | up to 86.5% | More than 19 points above Med-PaLM [1][3] |
| MedMCQA (AIIMS/NEET) | 72.3% | Reported as first passing score [4] |
| MMLU clinical topics, PubMedQA | At or near state of the art | Specific figures vary by configuration [1] |
Multiple-choice accuracy was only part of the assessment. In a pairwise study of 1,066 consumer medical questions, a panel of physicians compared Med-PaLM 2's long-form answers against answers written by other physicians. The model's responses were preferred on eight of nine axes related to clinical utility, a result the authors reported as statistically significant [1][3]. In the peer-reviewed version of the work, a pilot using real-world questions found that specialists preferred Med-PaLM 2 answers to those from generalist physicians about 65% of the time, while both specialist and generalist raters judged the model's answers to be as safe as physician answers [3]. The team also built adversarial question sets designed to surface weaknesses, and Med-PaLM 2 showed marked gains over its predecessor on those harder probes.
The research was first posted as a preprint, "Towards Expert-Level Medical Question Answering with Large Language Models," on 16 May 2023, with Karan Singhal as lead author and roughly thirty collaborators from Google [1]. A peer-reviewed version, "Toward expert-level medical question answering with large language models," appeared in Nature Medicine in 2025 (published online in January 2025 and printed in the March 2025 issue) [3].
Google first discussed Med-PaLM 2 publicly at The Check Up, its annual health event, on 14 March 2023, where company leaders demonstrated answers to questions such as warning signs of pneumonia and showed that the model could match or exceed clinician-written answers in some cases [5]. On 13 to 14 April 2023, Google Cloud announced that it would open limited access to Med-PaLM 2 for a small group of customers to explore use cases and give feedback, while stressing a focus on safety, equity, and evaluation of unfair bias [2].
In July 2023, the Wall Street Journal reported that Google had been testing Med-PaLM 2 with hospital customers since around April, including the Mayo Clinic, and other coverage noted HCA Healthcare among the systems experimenting with Google's language-model technology [6][7]. Greg Corrado, a senior Google research director who worked on the project, was widely quoted as saying he did not feel the technology was yet at a place where he would want it in his own family's healthcare journey, even as he described its potential to expand the areas of healthcare where AI can help [6]. Google productized the underlying model in December 2023 as MedLM, a family of healthcare foundation models built on Med-PaLM 2 and offered to allowlisted Google Cloud customers through Vertex AI, initially in two sizes for different tasks [8].
A separate research effort, Med-PaLM M, extended the line into multimodal inputs such as chest X-rays and mammograms; it is a distinct model from the text-focused Med-PaLM 2 rather than a direct upgrade of it [4].
Google was consistent that Med-PaLM 2 was a research system and not a finished clinical tool. At The Check Up, the company acknowledged that significant gaps remained between benchmark performance and real-world medical use and that the model did not meet its internal bar for a clinical product [5]. Strong exam scores do not guarantee safe behavior on novel or ambiguous cases, and the evaluation work specifically constructed adversarial questions to expose failure modes. Like other large language models, Med-PaLM 2 can produce fluent but incorrect statements, and the team highlighted ongoing concerns about factual accuracy, reasoning errors, and potential bias or harm. Access was deliberately restricted to vetted partners under feedback agreements rather than offered as an open or consumer service, reflecting the regulatory and patient-safety stakes of medical advice.
Med-PaLM 2 sat at the leading edge of Google's medical AI work in 2023, but the field moved quickly. The commercial MedLM line was slated to incorporate models based on Gemini, Google's next-generation multimodal family, as those became available. Google's open medical models then shifted to the Gemma architecture: MedGemma, announced in 2025, is a collection of Gemma-based variants for medical text and image understanding, released as open weights for developers to build on. Later MedGemma releases reported MedQA scores in the high 80s for the larger text variant, illustrating how the expert-level performance Med-PaLM 2 first demonstrated has since been delivered in smaller, more deployable, and openly available models. Med-PaLM 2 remains significant as the system that crossed the expert-level threshold on USMLE-style questions and bridged Google's medical research from PaLM-era models toward this newer generation.