HealthBench

HealthBench
Overview
Full name	HealthBench
Abbreviation	HealthBench
Description	Open-source benchmark for evaluating large language models on realistic, multi-turn healthcare conversations using physician-written rubrics
Release date	May 12, 2025 (paper May 13, 2025)
Latest version	1.0
Authors	Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal
Organization	OpenAI
Technical details
Type	Healthcare AI, medical question answering, clinical conversation evaluation
Modality	Text, multi-turn dialogue
Task format	Open-ended responses graded against physician-authored rubrics
Number of conversations	5,000
Rubric criteria	48,562 unique criteria
Average turns per conversation	2.6
Evaluation axes	Accuracy, completeness, communication quality, instruction following, context awareness
Themes	Emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty, response depth
Languages	49 (English, Spanish, French, Mandarin, Hindi, Arabic, Amharic, Nepali, Swahili, and others)
Specialties	26 medical specialties
Grader	GPT-4.1 used as the model grader
Performance
Baseline score	0.16 (GPT-3.5 Turbo)
State-of-the-art at release	0.60 (OpenAI o3)
HealthBench Hard top score at release	0.32 (OpenAI o3)
Saturated	No
Resources
Website	openai.com/index/healthbench
Paper	arXiv:2505.08775
Code	OpenAI simple-evals on GitHub
License	MIT License for the evaluation code

HealthBench is an open-source benchmark released by OpenAI on May 12, 2025, that evaluates how large language models handle realistic, multi-turn healthcare conversations. The benchmark contains 5,000 conversations between a model and a simulated patient or healthcare professional, and each conversation is paired with a custom rubric written by a practicing physician. Across the dataset there are 48,562 unique rubric criteria, so a typical conversation is graded against roughly 11 to 12 specific behaviors rather than a single right answer.^[1]^[2]

HealthBench moves the field away from multiple-choice exam questions and toward open-ended dialogue evaluation. The paper, HealthBench: Evaluating Large Language Models Towards Improved Human Health, was posted to arXiv on May 13, 2025 (arXiv:2505.08775) with Karan Singhal as senior author and Rahul K. Arora and Jason Wei as lead authors. The benchmark code ships in OpenAI's public evaluations repository, simple-evals.^[1]^[3]

Background and motivation

Before HealthBench, most AI healthcare evaluations used shorter formats: USMLE-style multiple-choice questions in MedQA, PubMedQA, or short-answer datasets such as MedMCQA. Those benchmarks had largely saturated by 2024, with frontier models scoring above 90% on MedQA. They also missed skills that matter in real clinical work: gathering missing context, talking to non-experts, escalating to emergency care, and admitting uncertainty. The HealthBench team set out to build an evaluation that reflects how clinicians and patients actually use generative AI, framed around meaningfulness, trustworthiness, and unsaturated headroom.^[2]^[3]

Construction

Physician network

HealthBench was built with 262 physicians who collectively practiced in 60 countries, trained in 26 medical specialties, and were fluent in 49 languages. The contributing group skews experienced: about 50% were independent or staff physicians, 17% fellows, 23% senior residents (PGY3 or higher), and 10% junior residents. OpenAI received over 1,000 applications and selected about 26% based on response quality during onboarding. Every contributor was paid. Development took about eleven months. Physicians wrote example conversations, drafted the rubric criteria for each example, scored model responses during pilot rounds, and validated the model grader.^[2]^[3]

Conversations and rubrics

Each example is a multi-turn dialogue. The mean conversation has 2.6 turns; roughly 58% of examples are single-turn questions and the remainder run two or more turns, which lets HealthBench probe how models handle follow-up questions and missing information. The user side is sometimes a layperson asking about symptoms and sometimes a clinician asking for help with documentation, triage, or test interpretation.^[2]

Each conversation has its own rubric written by the physician who created the example. Criteria are positive ("the response should advise calling emergency services") or negative ("the response should not recommend ibuprofen given the patient's stated kidney disease"), and each criterion carries a weight between 10 and -10 reflecting clinical importance. The average example has about 11.5 criteria, with a range of 2 to 48.^[1]^[2]

Themes

HealthBench groups examples into seven themes that map to clinically meaningful skills.

Theme	What it tests
Emergency referrals	Recognizing acute conditions and advising the user to seek urgent care without over-escalating routine cases
Context-seeking	Asking the right follow-up questions when the prompt is missing information that matters for safety
Global health	Adapting advice to different national healthcare systems and resource levels
Health data tasks	Producing structured outputs such as discharge summaries, lab interpretations, or coding tasks
Expertise-tailored communication	Matching the response register to the user (patient, nurse, physician, specialist)
Responding under uncertainty	Hedging appropriately when evidence is weak, refusing when a confident answer would be unsafe
Response depth	Choosing how much detail to give based on the user's request, not over-explaining or under-explaining

Frontier models tend to do well on emergency referrals and expertise-tailored communication while still struggling with context-seeking, global health, and responding under uncertainty.^[2]^[4]

Evaluation axes

Every rubric criterion is tagged with one of five behavioral axes. The distribution gives a sense of what HealthBench actually rewards.

Axis	Approximate share of criteria	Focus
Completeness	39%	Including all clinically important information, especially safety information
Accuracy	33%	Factual correctness aligned with current medical consensus
Context awareness	16%	Picking up on cues in the prompt and asking for missing information
Communication quality	8%	Clarity, structure, appropriate medical literacy level
Instruction following	4%	Adhering to format, length, or scope constraints set by the user

Completeness and accuracy together account for roughly two-thirds of the rubric weight, which reflects what the physician contributors flagged as most often consequential. Earlier models such as Claude 3.5 Sonnet and GPT-4o tended to be strong on communication quality but weak on completeness; o3 closed much of that gap, which is the main reason it leads the leaderboard.^[2]^[4]

Scoring methodology

HealthBench uses a model grader rather than human raters. For each model response the grader (GPT-4.1) reads the rubric criteria one by one and decides whether each is met. Met criteria add their weight; unmet positive criteria and triggered negative criteria subtract theirs. The conversation score is normalized to the maximum possible, then averaged across all 5,000 examples for the overall score.^[1]^[2]

The paper validates this approach with a meta-evaluation. Physicians independently graded a large set of model responses against the rubric criteria, producing 60,896 expert grades. The team measured how often the GPT-4.1 grader agreed with each physician (its macro-F1) and compared that to how often physicians agreed with each other. GPT-4.1 ended up with a macro-F1 of about 0.71 overall, exceeded the average physician on five of the seven themes, and placed in the upper half of physicians on six of seven themes. Physician-physician agreement ranged roughly from 55% to 75% across themes, which is a useful reminder that credentialed clinicians often disagree about what a model should have said.^[2]^[3]

Variants

HealthBench is published in three flavors that share the same dataset but emphasize different things.

Variant	Size	Purpose
HealthBench	5,000 conversations	Full evaluation across all themes and axes
HealthBench Hard	1,000 conversations	A difficulty-curated subset where frontier models score much lower; built to leave headroom for future models
HealthBench Consensus	34 physician-validated criteria across the dataset	Focused measurement of safety-critical behaviors where there is strong physician agreement on the right answer

At release, OpenAI o3 scored 0.60 on the full benchmark and only 0.32 on HealthBench Hard. Many models scored zero on HealthBench Hard, which is by design: the subset was filtered to keep examples that current frontier systems failed.^[1]^[2]

Performance at release

The headline scores from the May 2025 release paper are below. The full benchmark score is reported on a 0 to 1 scale (some sources convert to percentage).

Model	HealthBench overall	HealthBench Hard	Notes
OpenAI o3	0.60	0.32	Top score on both subsets at release
GPT-4.1	0.48	Reported lower; non-zero	Strong cost-performance ratio
GPT-4.1 nano	Above GPT-4o	Reported lower	Roughly 25 times cheaper than GPT-4o while scoring higher overall
o4-mini	Above GPT-4o	Reported lower	Smaller reasoning model evaluated alongside o3
o1	0.42	Lower	2024 reasoning model
GPT-4o (Aug 2024)	0.32	Near zero	Previous OpenAI frontier baseline
Claude 3.7 Sonnet	Reported below o3	Reported below o3	Strong communication quality, weaker on completeness
Gemini 2.5 Pro	Competitive with frontier	Reported below o3	Strong on multilingual and global health
Grok 3	Competitive on some themes	Reported below o3	Mixed performance across axes
GPT-3.5 Turbo	0.16	Near zero	Two-year-old baseline

The gap between o3 (0.60) and GPT-4o (0.32) is wider than the gap between GPT-4o and GPT-3.5 Turbo (0.16), which the OpenAI authors cite as evidence that progress on healthcare conversations sped up between late 2024 and spring 2025.^[1]^[3]

Reliability and worst-at-k

The paper also reports a worst-at-k metric, which samples a model multiple times on the same prompt and takes the lowest-scoring response. For HealthBench, k = 16 is the headline configuration. The worst-at-16 score for o3 is roughly 0.40, compared to its mean of 0.60. That drop of about a third indicates real variance: frontier models can fail badly on individual responses even when their average looks strong. By the same metric, o3 is more than twice as reliable as GPT-4o.^[2]

Physician baseline experiments

One of the more striking findings involves physicians writing responses themselves. In the first round (using September 2024 model references), physicians who could see a reference answer from GPT-4o or o1 improved on that reference about 56% of the time. In the second round (using April 2025 references from o3 and GPT-4.1), physicians improved 46.8% of the time and worsened the reference 47.7% of the time. Within statistical noise, that means by April 2025 physicians could no longer reliably add quality on top of frontier model responses on these specific prompts.^[2]^[3]

This result is narrow but important. It does not say AI is better than physicians at medicine. It says that on these particular prompts and this particular rubric, the frontier models had reached a level where the median improvement a physician could make was zero. The paper is careful to note that physicians were not seeing real patients, did not have lab data or imaging, and were graded by the same rubric used to score the models.

Reception and critiques

HealthBench drew attention from both AI research and clinical communities. Nigam Shah at Stanford Health Care described the benchmark as unprecedented in scale and directionally aligned with academic research. It was covered in MobiHealthNews, MarkTechPost, and several health policy outlets.^[5]^[6]

A 2025 critical review, A Critical Evaluation of HealthBench (arXiv:2508.00081), raised concerns that the rubric grading captures behavioral conformance more than patient-safety outcomes and that the synthetic conversations may not reflect the messier presentation of real patients. A peer-reviewed perspective in npj Digital Medicine in 2025 summarized the consensus: HealthBench advances AI evaluation in healthcare but is not yet evidence that any model is clinically ready for autonomous use.^[7]^[8]

Subsequent uptake

In the year after release, HealthBench became a standard reporting line in frontier model release notes. OpenAI's GPT-5 family, Anthropic's Claude releases, and Google DeepMind's Gemini updates all began publishing HealthBench scores. OpenAI itself launched a follow-up called HealthBench Professional in April 2026 that targets working clinicians using consultation, documentation, and research tasks. The framework is open enough that other groups can plug in their own conversations and rubrics while reusing the model-graded scoring infrastructure.^[6]^[9]

Limitations

The authors are explicit about what HealthBench does not measure. The conversations are predominantly synthetic, the dataset is text-only (no imaging, no structured EHR data), and the model grader can be gamed by responses that satisfy rubric criteria without being good. The English share of the dataset is high relative to languages spoken by contributing physicians, and several low-resource languages have only a handful of examples. None of the criteria measure long-term outcomes; only the contents of a single response or short conversation.^[2]

Significance

HealthBench is the first widely adopted healthcare AI benchmark to combine open-ended conversational evaluation, physician-written rubrics, model-graded scoring validated against expert agreement, and an explicit hard subset designed to leave headroom. It moved the standard for medical AI evaluation past saturated multiple-choice exams during a period when health-focused LLMs were improving quickly. The benchmark's longer impact will probably come from its rubric methodology, which gives developers feedback at the level of specific behaviors rather than a single opaque score. That granularity is more useful for fixing problems than a leaderboard number, and it is the part of HealthBench most likely to be copied into other domains.

References

OpenAI. "Introducing HealthBench." May 12, 2025. https://openai.com/index/healthbench/
Arora, R. K., Wei, J., Soskin Hicks, R., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., and Singhal, K. "HealthBench: Evaluating Large Language Models Towards Improved Human Health." arXiv:2505.08775, May 13, 2025. https://arxiv.org/abs/2505.08775
Singhal, Karan. "HealthBench." Personal site notes. https://www.karansinghal.com/notes/healthbench/
UK AI Safety Institute. "HealthBench evaluation in Inspect Evals." https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/healthbench/
MarkTechPost. "OpenAI Releases HealthBench: An Open-Source Benchmark for Measuring the Performance and Safety of Large Language Models in Healthcare." May 12, 2025. https://www.marktechpost.com/2025/05/12/openai-releases-healthbench-an-open-source-benchmark-for-measuring-the-performance-and-safety-of-large-language-models-in-healthcare/
arXiv:2509.02594. "OpenAI's HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries." https://arxiv.org/abs/2509.02594
arXiv:2508.00081. "A Critical Evaluation of HealthBench." https://arxiv.org/abs/2508.00081
PubMed Central. "HealthBench: Advancing AI evaluation in healthcare, but not yet clinically ready." PMC12547120. https://pmc.ncbi.nlm.nih.gov/articles/PMC12547120/
OpenAI. "HealthBench Professional." April 2026. https://cdn.openai.com/dd128428-0184-4e25-b155-3a7686c7d744/HealthBench-Professional.pdf

HealthBench

Background and motivation

Construction

Physician network

Conversations and rubrics

Themes

Evaluation axes

Scoring methodology

Variants

Performance at release

Reliability and worst-at-k

Physician baseline experiments

Reception and critiques

Subsequent uptake

Limitations

Significance

See also

References

Improve this article

Background and motivation

Construction

Physician network

Conversations and rubrics

Themes

Evaluation axes

Scoring methodology

Variants

Performance at release

Reliability and worst-at-k

Physician baseline experiments

Reception and critiques

Subsequent uptake

Limitations

Significance

See also

References

Background and motivation

Construction

Physician network

Conversations and rubrics

Themes

Evaluation axes

Scoring methodology

Variants

Performance at release

Reliability and worst-at-k

Physician baseline experiments

Reception and critiques

Subsequent uptake

Limitations

Significance

See also

References

Improve this article

Related Articles

HealthBench Hard

DeepResearch Bench

Humanity's Last Exam

MMMLU

BrowseComp

AA-LCR

Background and motivation

Construction

Physician network

Conversations and rubrics

Themes

Evaluation axes

Scoring methodology

Variants

Performance at release

Reliability and worst-at-k

Physician baseline experiments

Reception and critiques

Subsequent uptake

Limitations

Significance

See also

References

Related Articles

HealthBench Hard

DeepResearch Bench

Humanity's Last Exam

MMMLU

BrowseComp

AA-LCR