MedHELM
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,418 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,418 words
Add missing citations, update stale details, or suggest a clearer explanation.
MedHELM (Holistic Evaluation of Large Language Models for Medical Tasks) is a benchmark and evaluation framework that measures how well large language models perform on realistic clinical work. It extends HELM, the Holistic Evaluation of Language Models framework created by the Stanford Center for Research on Foundation Models (CRFM), into the medical domain. Rather than scoring models only on medical-licensing-exam questions, MedHELM organizes evaluation around a clinician-validated taxonomy of tasks that clinicians actually do, and it grades open-ended outputs with an ensemble of model judges that was calibrated against ratings from practicing clinicians [1][2].
The framework was introduced in the 2025 paper "MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks," first posted to arXiv on May 26, 2025 (revised June 2, 2025) and subsequently published in Nature Medicine in 2025 [1][3]. As of mid-2026, MedHELM maintains a public leaderboard that is continually updated with new frontier models, and a runnable codebase that hospitals can use to evaluate models on their own patient data without that data leaving their network [2][4]. The benchmark covers 5 broad categories, 22 subcategories, and 121 distinct clinical tasks, served by 35 underlying benchmarks [1][2].
The central problem MedHELM addresses is that most medical-LLM evaluation had relied on multiple-choice questions drawn from medical licensing exams, such as MedQA (which is built from United States Medical Licensing Examination, or USMLE, style questions) and similar datasets. The paper's authors observe that "while large language models achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice" [1]. Exam questions are clean, self-contained, and have a single correct answer, whereas real clinical work involves messy electronic health record data, open-ended writing, ambiguity, communication with patients, and administrative tasks that no multiple-choice item captures.
A consequence highlighted by the work is that strong exam performance does not reliably predict how a model handles realistic clinical tasks. Headline numbers on benchmarks like MedQA had saturated near ceiling, leaving little ability to distinguish models or to tell whether a model is actually fit for deployment in a hospital workflow. MedHELM was designed to fill that gap by grounding evaluation in tasks that clinicians validated as representative of their day-to-day work, and by spanning the full breadth of clinical activity rather than diagnosis-style reasoning alone. This positions MedHELM within the broader effort to make AI in healthcare evaluation reflect clinical utility instead of test-taking ability [1][2].
The defining feature of MedHELM is its clinician-validated taxonomy of clinical tasks. The taxonomy was developed in collaboration with 29 clinicians spanning 14 medical specialties, who reviewed and validated the structure; the authors report a 96.7 percent agreement rate when clinicians assigned subcategories to their parent categories, and a mean comprehensiveness rating of 4.21 out of 5 [1]. The taxonomy is organized into five top-level categories, each subdividing into subcategories and then into specific tasks [1][2].
| Category | Representative scope |
|---|---|
| Clinical Decision Support | Diagnostic decisions, treatment planning, risk prediction, knowledge support |
| Clinical Note Generation | Visit notes, procedure notes, diagnostic reports, care plans |
| Patient Communication and Education | Patient education materials, care instructions, patient messaging, accessibility |
| Medical Research Assistance | Literature research, clinical data analysis, documentation, quality assurance, trial enrollment |
| Administration and Workflow | Scheduling, financial and billing tasks, workflow organization, care coordination |
In total these five categories contain 22 subcategories and 121 individual tasks, making MedHELM substantially broader than prior medical benchmarks that focused on a narrow slice of clinical reasoning [1][2]. The taxonomy is intended to be extensible, so that new tasks and benchmarks can be slotted into the existing structure over time.
To populate the taxonomy, MedHELM assembles a suite of 35 benchmarks: 17 that already existed and 18 that were newly formulated for the project (the new set comprising roughly 5 reformulations of prior data and 13 entirely new benchmarks) [1]. The benchmarks vary in how openly they can be accessed, reflecting the reality that much clinically meaningful data is sensitive: the paper describes a mix of public, gated (credential-required, such as EHRSHOT), and private datasets, the last of which include real clinical data held inside Stanford Health Care [1][4]. Public examples used in the framework include PubMedQA for biomedical question answering [4]. The tasks split into closed-ended items (around 22) that have well-defined answers and open-ended items (around 13) that require generated free text [1].
Scoring is matched to the task type. Closed-ended tasks use objective metrics such as exact-match accuracy and micro-averaged F1 for multi-label classification [1][2]. For open-ended generation, MedHELM uses an "LLM-jury": an ensemble of three model judges (in the original paper, GPT-4o, Claude 3.7 Sonnet, and Llama 3.3 70B) that rate each response on 1-to-5 Likert scales for accuracy, completeness, and clarity [1]. To validate this automated approach, the authors collected clinician ratings on subsets of benchmarks (for example ACI-Bench and MEDIQA-QA). The LLM-jury reached an intraclass correlation (ICC) of 0.47 with clinician ratings, which slightly exceeded the average clinician-to-clinician agreement of 0.43 and outperformed traditional text-overlap metrics such as ROUGE-L (0.36) and BERTScore (0.44) [1]. Models are compared using pairwise win rates and macro-averaged scores, and the framework also tracks estimated computational cost so that performance can be weighed against price [1].
The original paper evaluated nine frontier LLMs: DeepSeek R1, OpenAI's o3-mini, Claude 3.7 Sonnet, Claude 3.5 Sonnet, GPT-4o, Gemini 2.0 Flash, GPT-4o mini, Llama 3.3 70B Instruct, and Gemini 1.5 Pro [1]. Advanced reasoning models led the rankings: DeepSeek R1 won about 66 percent of head-to-head comparisons (macro-average 0.75) and o3-mini won about 64 percent while posting the highest macro-average of 0.77, driven by strong clinical-decision-support results [1]. A notable cost finding is that Claude 3.5 Sonnet "achieved comparable results at 40% lower estimated computational cost," reaching a similar win rate to the reasoning models without their reasoning-token overhead [1].
Performance was uneven across the realistic clinical spectrum, which is one of the paper's main messages. Models scored highest on Clinical Note Generation and on Patient Communication and Education, performed moderately on Medical Research Assistance and Clinical Decision Support, and scored lowest on Administration and Workflow [1]. This pattern reinforces the conclusion that strong aggregate or exam scores do not translate uniformly to every clinical task, and that a single headline number can hide weaknesses in operationally important areas. Because MedHELM is maintained as a living leaderboard, its rankings evolve as newer models are added; by 2026 the public leaderboard had been extended with later frontier systems beyond the nine in the original paper [2].
MedHELM is a direct medical specialization of HELM, inheriting HELM's philosophy of holistic, transparent, and reproducible evaluation across many scenarios rather than a single score, and it is distributed through the same open HELM codebase maintained by Stanford CRFM [2][4]. Its distinguishing contributions are the clinician-grounded taxonomy, the inclusion of private real-world clinical datasets alongside public ones, and the validated LLM-jury method for grading open-ended medical text [1]. The work was a large multi-institutional effort, led by Nigam H. Shah of Stanford and with first author Suhana Bedi, involving dozens of co-authors and collaboration among CRFM, Stanford Health Care's Technology and Digital Solutions group, Microsoft Health and Life Sciences, and Stanford clinical faculty across multiple departments [1][2].
In the landscape of clinical-AI evaluation, MedHELM is positioned as a successor to exam-style benchmarks like MedQA: where those measured medical knowledge recall, MedHELM measures performance on the tasks clinicians perform, and it explicitly demonstrates that the two do not always agree. Its design also enables hospitals to run the same standardized evaluation on their own infrastructure against local inference servers, so that institutions can assess candidate models on their own patient data privately [2][4]. As of 2026, stewardship of MedHELM moved toward an independent community model, while remaining openly licensed, broadening participation beyond its original creators [2].