HealthBench
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 2,474 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 2,474 words
Add missing citations, update stale details, or suggest a clearer explanation.
| HealthBench | |
|---|---|
| Overview | |
| Full name | HealthBench |
| Abbreviation | HealthBench |
| Description | Open-source benchmark for evaluating large language models on realistic, multi-turn healthcare conversations using physician-written rubrics |
| Release date | May 12, 2025 (paper May 13, 2025) |
| Latest version | 1.0 |
| Authors | Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal |
| Organization | OpenAI |
| Technical details | |
| Type | Healthcare AI, medical question answering, clinical conversation evaluation |
| Modality | Text, multi-turn dialogue |
| Task format | Open-ended responses graded against physician-authored rubrics |
| Number of conversations | 5,000 |
| Rubric criteria | 48,562 unique criteria |
| Average turns per conversation | 2.6 |
| Evaluation axes | Accuracy, completeness, communication quality, instruction following, context awareness |
| Themes | Emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty, response depth |
| Languages | 49 (English, Spanish, French, Mandarin, Hindi, Arabic, Amharic, Nepali, Swahili, and others) |
| Specialties | 26 medical specialties |
| Grader | GPT-4.1 used as the model grader |
| Performance | |
| Baseline score | 0.16 (GPT-3.5 Turbo) |
| State-of-the-art at release | 0.60 (OpenAI o3) |
| HealthBench Hard top score at release | 0.32 (OpenAI o3) |
| Saturated | No |
| Resources | |
| Website | openai.com/index/healthbench |
| Paper | arXiv:2505.08775 |
| Code | OpenAI simple-evals on GitHub |
| License | MIT License for the evaluation code |
HealthBench is an open-source benchmark released by OpenAI on May 12, 2025, that evaluates how large language models handle realistic, multi-turn healthcare conversations. The benchmark contains 5,000 conversations between a model and a simulated patient or healthcare professional, and each conversation is paired with a custom rubric written by a practicing physician. Across the dataset there are 48,562 unique rubric criteria, so a typical conversation is graded against roughly 11 to 12 specific behaviors rather than a single right answer.[1][2]
HealthBench moves the field away from multiple-choice exam questions and toward open-ended dialogue evaluation. The paper, HealthBench: Evaluating Large Language Models Towards Improved Human Health, was posted to arXiv on May 13, 2025 (arXiv:2505.08775) with Karan Singhal as senior author and Rahul K. Arora and Jason Wei as lead authors. The benchmark code ships in OpenAI's public evaluations repository, simple-evals.[1][3]
Before HealthBench, most AI healthcare evaluations used shorter formats: USMLE-style multiple-choice questions in MedQA, PubMedQA, or short-answer datasets such as MedMCQA. Those benchmarks had largely saturated by 2024, with frontier models scoring above 90% on MedQA. They also missed skills that matter in real clinical work: gathering missing context, talking to non-experts, escalating to emergency care, and admitting uncertainty. The HealthBench team set out to build an evaluation that reflects how clinicians and patients actually use generative AI, framed around meaningfulness, trustworthiness, and unsaturated headroom.[2][3]
HealthBench was built with 262 physicians who collectively practiced in 60 countries, trained in 26 medical specialties, and were fluent in 49 languages. The contributing group skews experienced: about 50% were independent or staff physicians, 17% fellows, 23% senior residents (PGY3 or higher), and 10% junior residents. OpenAI received over 1,000 applications and selected about 26% based on response quality during onboarding. Every contributor was paid. Development took about eleven months. Physicians wrote example conversations, drafted the rubric criteria for each example, scored model responses during pilot rounds, and validated the model grader.[2][3]
Each example is a multi-turn dialogue. The mean conversation has 2.6 turns; roughly 58% of examples are single-turn questions and the remainder run two or more turns, which lets HealthBench probe how models handle follow-up questions and missing information. The user side is sometimes a layperson asking about symptoms and sometimes a clinician asking for help with documentation, triage, or test interpretation.[2]
Each conversation has its own rubric written by the physician who created the example. Criteria are positive ("the response should advise calling emergency services") or negative ("the response should not recommend ibuprofen given the patient's stated kidney disease"), and each criterion carries a weight between 10 and -10 reflecting clinical importance. The average example has about 11.5 criteria, with a range of 2 to 48.[1][2]
HealthBench groups examples into seven themes that map to clinically meaningful skills.
| Theme | What it tests |
|---|---|
| Emergency referrals | Recognizing acute conditions and advising the user to seek urgent care without over-escalating routine cases |
| Context-seeking | Asking the right follow-up questions when the prompt is missing information that matters for safety |
| Global health | Adapting advice to different national healthcare systems and resource levels |
| Health data tasks | Producing structured outputs such as discharge summaries, lab interpretations, or coding tasks |
| Expertise-tailored communication | Matching the response register to the user (patient, nurse, physician, specialist) |
| Responding under uncertainty | Hedging appropriately when evidence is weak, refusing when a confident answer would be unsafe |
| Response depth | Choosing how much detail to give based on the user's request, not over-explaining or under-explaining |
Frontier models tend to do well on emergency referrals and expertise-tailored communication while still struggling with context-seeking, global health, and responding under uncertainty.[2][4]
Every rubric criterion is tagged with one of five behavioral axes. The distribution gives a sense of what HealthBench actually rewards.
| Axis | Approximate share of criteria | Focus |
|---|---|---|
| Completeness | 39% | Including all clinically important information, especially safety information |
| Accuracy | 33% | Factual correctness aligned with current medical consensus |
| Context awareness | 16% | Picking up on cues in the prompt and asking for missing information |
| Communication quality | 8% | Clarity, structure, appropriate medical literacy level |
| Instruction following | 4% | Adhering to format, length, or scope constraints set by the user |
Completeness and accuracy together account for roughly two-thirds of the rubric weight, which reflects what the physician contributors flagged as most often consequential. Earlier models such as Claude 3.5 Sonnet and GPT-4o tended to be strong on communication quality but weak on completeness; o3 closed much of that gap, which is the main reason it leads the leaderboard.[2][4]
HealthBench uses a model grader rather than human raters. For each model response the grader (GPT-4.1) reads the rubric criteria one by one and decides whether each is met. Met criteria add their weight; unmet positive criteria and triggered negative criteria subtract theirs. The conversation score is normalized to the maximum possible, then averaged across all 5,000 examples for the overall score.[1][2]
The paper validates this approach with a meta-evaluation. Physicians independently graded a large set of model responses against the rubric criteria, producing 60,896 expert grades. The team measured how often the GPT-4.1 grader agreed with each physician (its macro-F1) and compared that to how often physicians agreed with each other. GPT-4.1 ended up with a macro-F1 of about 0.71 overall, exceeded the average physician on five of the seven themes, and placed in the upper half of physicians on six of seven themes. Physician-physician agreement ranged roughly from 55% to 75% across themes, which is a useful reminder that credentialed clinicians often disagree about what a model should have said.[2][3]
HealthBench is published in three flavors that share the same dataset but emphasize different things.
| Variant | Size | Purpose |
|---|---|---|
| HealthBench | 5,000 conversations | Full evaluation across all themes and axes |
| HealthBench Hard | 1,000 conversations | A difficulty-curated subset where frontier models score much lower; built to leave headroom for future models |
| HealthBench Consensus | 34 physician-validated criteria across the dataset | Focused measurement of safety-critical behaviors where there is strong physician agreement on the right answer |
At release, OpenAI o3 scored 0.60 on the full benchmark and only 0.32 on HealthBench Hard. Many models scored zero on HealthBench Hard, which is by design: the subset was filtered to keep examples that current frontier systems failed.[1][2]
The headline scores from the May 2025 release paper are below. The full benchmark score is reported on a 0 to 1 scale (some sources convert to percentage).
| Model | HealthBench overall | HealthBench Hard | Notes |
|---|---|---|---|
| OpenAI o3 | 0.60 | 0.32 | Top score on both subsets at release |
| GPT-4.1 | 0.48 | Reported lower; non-zero | Strong cost-performance ratio |
| GPT-4.1 nano | Above GPT-4o | Reported lower | Roughly 25 times cheaper than GPT-4o while scoring higher overall |
| o4-mini | Above GPT-4o | Reported lower | Smaller reasoning model evaluated alongside o3 |
| o1 | 0.42 | Lower | 2024 reasoning model |
| GPT-4o (Aug 2024) | 0.32 | Near zero | Previous OpenAI frontier baseline |
| Claude 3.7 Sonnet | Reported below o3 | Reported below o3 | Strong communication quality, weaker on completeness |
| Gemini 2.5 Pro | Competitive with frontier | Reported below o3 | Strong on multilingual and global health |
| Grok 3 | Competitive on some themes | Reported below o3 | Mixed performance across axes |
| GPT-3.5 Turbo | 0.16 | Near zero | Two-year-old baseline |
The gap between o3 (0.60) and GPT-4o (0.32) is wider than the gap between GPT-4o and GPT-3.5 Turbo (0.16), which the OpenAI authors cite as evidence that progress on healthcare conversations sped up between late 2024 and spring 2025.[1][3]
The paper also reports a worst-at-k metric, which samples a model multiple times on the same prompt and takes the lowest-scoring response. For HealthBench, k = 16 is the headline configuration. The worst-at-16 score for o3 is roughly 0.40, compared to its mean of 0.60. That drop of about a third indicates real variance: frontier models can fail badly on individual responses even when their average looks strong. By the same metric, o3 is more than twice as reliable as GPT-4o.[2]
One of the more striking findings involves physicians writing responses themselves. In the first round (using September 2024 model references), physicians who could see a reference answer from GPT-4o or o1 improved on that reference about 56% of the time. In the second round (using April 2025 references from o3 and GPT-4.1), physicians improved 46.8% of the time and worsened the reference 47.7% of the time. Within statistical noise, that means by April 2025 physicians could no longer reliably add quality on top of frontier model responses on these specific prompts.[2][3]
This result is narrow but important. It does not say AI is better than physicians at medicine. It says that on these particular prompts and this particular rubric, the frontier models had reached a level where the median improvement a physician could make was zero. The paper is careful to note that physicians were not seeing real patients, did not have lab data or imaging, and were graded by the same rubric used to score the models.
HealthBench drew attention from both AI research and clinical communities. Nigam Shah at Stanford Health Care described the benchmark as unprecedented in scale and directionally aligned with academic research. It was covered in MobiHealthNews, MarkTechPost, and several health policy outlets.[5][6]
A 2025 critical review, A Critical Evaluation of HealthBench (arXiv:2508.00081), raised concerns that the rubric grading captures behavioral conformance more than patient-safety outcomes and that the synthetic conversations may not reflect the messier presentation of real patients. A peer-reviewed perspective in npj Digital Medicine in 2025 summarized the consensus: HealthBench advances AI evaluation in healthcare but is not yet evidence that any model is clinically ready for autonomous use.[7][8]
In the year after release, HealthBench became a standard reporting line in frontier model release notes. OpenAI's GPT-5 family, Anthropic's Claude releases, and Google DeepMind's Gemini updates all began publishing HealthBench scores. OpenAI itself launched a follow-up called HealthBench Professional in April 2026 that targets working clinicians using consultation, documentation, and research tasks. The framework is open enough that other groups can plug in their own conversations and rubrics while reusing the model-graded scoring infrastructure.[6][9]
The authors are explicit about what HealthBench does not measure. The conversations are predominantly synthetic, the dataset is text-only (no imaging, no structured EHR data), and the model grader can be gamed by responses that satisfy rubric criteria without being good. The English share of the dataset is high relative to languages spoken by contributing physicians, and several low-resource languages have only a handful of examples. None of the criteria measure long-term outcomes; only the contents of a single response or short conversation.[2]
HealthBench is the first widely adopted healthcare AI benchmark to combine open-ended conversational evaluation, physician-written rubrics, model-graded scoring validated against expert agreement, and an explicit hard subset designed to leave headroom. It moved the standard for medical AI evaluation past saturated multiple-choice exams during a period when health-focused LLMs were improving quickly. The benchmark's longer impact will probably come from its rubric methodology, which gives developers feedback at the level of specific behaviors rather than a single opaque score. That granularity is more useful for fixing problems than a leaderboard number, and it is the part of HealthBench most likely to be copied into other domains.