HealthBench Hard
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 ยท 2,298 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 ยท 2,298 words
Add missing citations, update stale details, or suggest a clearer explanation.
| HealthBench Hard | |
|---|---|
| Overview | |
| Full name | HealthBench Hard |
| Abbreviation | HealthBench Hard |
| Description | A 1,000 example curated subset of HealthBench, selected because frontier models score poorly on it |
| Release date | 2025-05-12 |
| Latest version | 1.0 |
| Benchmark updated | 2025-05 |
| Authors | OpenAI Research Team |
| Organization | OpenAI |
| Technical Details | |
| Type | Healthcare AI, Multi-turn dialogue, Clinical reasoning |
| Modality | Text |
| Task format | Multi-turn healthcare conversations with rubric-based evaluation |
| Number of tasks | 1,000 conversations |
| Total examples | 1,000 (subset of 5,000 in main HealthBench) |
| Evaluation metric | Rubric-based scoring with a model grader (default GPT-4.1) |
| Domains | Emergency referrals, global health, clinical data, context seeking, hedging, communication, complex responses |
| Languages | English (drawn from main HealthBench, which spans 49 languages) |
| Performance | |
| Human performance | Not yet published for the Hard subset |
| Baseline | Near 0% on many examples (older models) |
| SOTA score (May 2025 paper) | 32% (o3) |
| SOTA model (May 2025 paper) | o3 |
| Reported score (Aug 2025) | 46.2% (GPT-5) |
| Saturated | No |
| Resources | |
| Website | Official announcement |
| Paper | arXiv:2505.08775 |
| GitHub | openai/simple-evals |
| Dataset | openai/healthbench |
| License | MIT |
| Predecessor | HealthBench (full 5,000 example set) |
HealthBench Hard is a 1,000 example curated subset of the HealthBench benchmark released by OpenAI on May 12, 2025[1][2]. The subset was carved out of HealthBench's full 5,000 conversation dataset by selecting cases where current frontier large language models scored poorly, including examples on which leading systems achieved zero credit[1][3]. At release, o3 held the top score at 32% on HealthBench Hard, compared to 60% on full HealthBench, a gap OpenAI framed as "plenty of headroom for the next generation of models"[1][3]. By August 2025, GPT-5 had pushed the score to 46.2%, but the subset remains far from saturation[4].
HealthBench Hard plays two roles: a frontier-model differentiator (full HealthBench compresses scores between strong systems near the top; Hard spreads them out) and a future-proofing instrument. Released alongside HealthBench Consensus (a high-precision subset filtered to physician-validated criteria), the two derivative variants "respectively aim to be highly validated and unsaturated"[3].
HealthBench Hard is a strict subset, not a separate dataset. Every conversation is also in the main HealthBench, with the same multi-turn structure, physician-written rubrics, five axes, and seven themes[3]. OpenAI filtered 5,000 conversations down to 1,000 examples on which a panel of frontier models did badly, biased toward cases where multiple leading systems failed at once[1][3].
Unlike medical AI benchmarks built on multiple choice questions (for example MedQA or USMLE-style tests), HealthBench evaluates models through realistic, open ended conversations. Full HealthBench was built with 262 physicians from 60 countries across 26 specialties, producing 48,562 rubric criteria, and HealthBench Hard inherits that scaffolding[1][3].
Frontier model scores on full HealthBench rose from 16% (GPT-3.5 Turbo) up to 60% (o3) inside roughly two years, with sharp acceleration during 2025 as o3, o4-mini, and GPT-4.1 came online. If top systems cluster above 70%, the benchmark stops differentiating models. HealthBench Hard keeps a tail of difficulty available; OpenAI described the goal as "a worthy target for model improvements for months to come"[1].
The selection process is documented in Appendix C of the HealthBench paper[3]:
HealthBench Hard is empirically curated, not flagged by physicians: hardness is defined by how badly a basket of frontier models did. The subset is biased toward failure modes systemic to current LLMs, not what clinicians intuitively call hard. Many conversations involve subtle requirements that models miss, rather than rare diseases.
Difficulty does not come from rubric volume. Hard conversations average 11.8 rubric criteria per example (median 11.0), nearly identical to the full HealthBench average of 11.4. The axis distribution is stable: completeness at ~39.1% of points, accuracy at ~29.7%[5][6]. The difference is in themes:
| Theme | Hard share | Full HealthBench share |
|---|---|---|
| theme:global_health | 28.0% | 21.9% |
| theme:context_seeking | 17.9% | 11.9% |
| theme:emergency_referrals | 6.6% | similar |
| theme:complex_responses | enriched | lower |
| theme:hedging | enriched | lower |
Conversations frontier models flunk are disproportionately ones requiring low resource reasoning (global health), clarifying questions (context seeking), or careful uncertainty (hedging). Riegler concludes hardness "stems not from higher criterion volume or drastically different thematic focus, but rather from the intrinsic complexity of the prompts or the specific nuances required by the criteria"[5].
| Aspect | Specification |
|---|---|
| Average turns | ~2.6 |
| Avg rubric criteria per example | 11.8 (median 11.0) |
| Total rubric criteria | ~11,800 |
| Languages | English (drawn from multilingual base) |
| License | MIT |
| Axis | What it grades |
|---|---|
| Accuracy | Factual and clinical correctness |
| Completeness | Coverage of rubric-required aspects |
| Communication quality | Clarity, tone, structure |
| Context awareness | Use or seeking of correct context |
| Instruction following | Compliance with user requests |
| Theme | Focus |
|---|---|
| emergency_referrals | Recognizing urgent-care escalation |
| context_seeking | Asking clarifying questions |
| global_health | Low resource or non-Western reasoning |
| health_data_tasks | Clinical data, summaries, notes |
| expertise_tailored_communication | Calibrating depth to user expertise |
| responding_under_uncertainty | Hedging on incomplete evidence |
| response_depth | Choosing length and detail |
The Hard subset opens up the gap between models that look close on the full benchmark[1][3][8]:
| Model | Full HealthBench | HealthBench Hard | Gap |
|---|---|---|---|
| o3 | 60% | 32% (top score) | 28 points |
| GPT-4.1 | 49% to 53% | High 20s | ~25 points |
| o4-mini | High 50s | Mid to high 20s | ~25 to 30 points |
| o1 | ~49% | Lower than GPT-4.1 | ~20 points |
| GPT-4o (Aug 2024) | 32% | Substantially lower | Notable drop |
| GPT-3.5 Turbo | 16% | Near zero on many examples | Most extreme |
| Grok 3 | n/a | ~0.226 (22.6%) | n/a |
| Claude 3.7 Sonnet | n/a | ~20% to 21% | n/a |
| Gemini 2.5 Pro | n/a | ~24% to 25% | n/a |
| Llama 4 Maverick | n/a | Comparable to GPT-4.1 levels | n/a |
The paper notes that o3 and GPT-4.1 cut error rates on HealthBench Consensus dramatically compared to GPT-4o, but on HealthBench Hard absolute scores stay low across the board. Consensus shows whether models meet a physician-validated safety bar; Hard shows whether they have anywhere left to climb[3].
| Model | HealthBench Hard score | Source |
|---|---|---|
| GPT-5 | 46.2% | OpenAI GPT-5 launch, Aug 2025[4] |
| Gemini 2.5 Pro | 0.243 (24.3%) | Inspect Evals 250-sample, Feb 2026[7] |
| Claude 3.7 Sonnet | 0.205 (20.5%) | Inspect Evals 250-sample, Feb 2026[7] |
| o1 | 0.180 (18.0%) | Inspect Evals 250-sample, Feb 2026[7] |
GPT-5's 46.2% is the largest single jump since launch. Even so, more than half of achievable rubric points remain unscored, validating OpenAI's bet that the subset would stay unsaturated through several model releases.
| Finding | Implication |
|---|---|
| 20 to 28 point drop from full HealthBench to Hard for most frontier models | Hard is a robust differentiator, not a small perturbation |
| Gains on Hard lag gains on full HealthBench | Brute-force memorization plateaus here |
| Context seeking and global health themes drive most errors | Models default to confident answers instead of asking |
| Hard scores correlate with reasoning architectures (o3, o4-mini, GPT-5) more than raw scale | Test-time reasoning helps where pattern matching does not |
The Hard subset is graded by a model grader (simple-evals defaults to GPT-4.1, with GPT-4o-mini as a faster alternative) that checks each rubric criterion. Each criterion carries a positive or negative point weight assigned by physicians[3][7]. A model's score is the proportion of achievable points earned, capped at zero on the low end. The Inspect Evals port supports optional length adjustment parameters (length_adjustment_center and length_adjustment_penalty_per_500_chars)[7] for penalizing padded responses.
| Component | Number |
|---|---|
| Physician validators | 262 |
| Countries | 60 |
| Specialties | 26 |
| Unique rubric criteria (full HealthBench) | 48,562 |
| Consensus dimensions | 34 |
HealthBench Hard uses the full physician-authored rubric set, not only consensus criteria, restricted to the 1,000 hardest conversations.
OpenAI released three views of the same evaluation:
| Variant | Examples | What it measures | Top score (May 2025) | Saturating? |
|---|---|---|---|---|
| HealthBench | 5,000 | Broad performance across rubric criteria | 60% (o3) | Slowly |
| HealthBench Consensus | 3,671 | Physician-consensus criteria (high precision) | High; errors rare for top models | Yes, on top models |
| HealthBench Hard | 1,000 | Hardest conversations for frontier models | 32% (o3) at release; 46.2% (GPT-5) by Aug 2025 | No |
Consensus spots regressions: failure on a consensus criterion is a flag worth investigating. Full HealthBench is the broad eval. Hard differentiates frontier systems and tracks marginal progress across model generations.
| Tests | Does not test |
|---|---|
| Multi-turn clinical conversation handling | Image understanding (text only) |
| Calibrated uncertainty and context seeking | Real-time clinical workflow integration |
| Global health and low-resource reasoning | Long-form chart review |
| Communication quality with diverse users | Diagnostic accuracy on private cohorts |
| Instruction following in healthcare scenarios | Prescribing legality in a jurisdiction |
HealthBench Hard's design (filter by current model failure, then publish) sidesteps the saturation problem that has hit benchmarks like MMLU and HumanEval. The trade-off: difficulty is partially defined by time of creation, and the empirical center drifts as models improve. OpenAI has not announced a refresh, but simple-evals continues to host HealthBench as a maintained reference even after the rest of simple-evals stopped getting updates in July 2025[2][9].
Third-party leaderboards include HealthBench Hard alongside the main benchmark, since two models within a couple of points on full HealthBench can sit 5 to 10 points apart on Hard. The arXiv preprint "OpenAI's HealthBench in Action" used HealthBench Hard to evaluate a clinical assistant called DR. INFO across model generations, updated into 2026[8].
On full HealthBench, top systems (o3, GPT-4.1, o4-mini) cluster within ~10 points. On Hard, the same systems can be 15+ points apart; Grok 3, Claude 3.7 Sonnet, and Gemini 2.5 Pro fall in the 20% to 25% range while OpenAI's reasoning models lead[3][7].
| Limitation | Description |
|---|---|
| Empirical curation can drift | Difficulty is anchored to late 2024 and early 2025 frontier models |
| Small dataset size | 1,000 conversations is tight for axis or theme breakdowns |
| English-leaning | Main HealthBench is multilingual; published Hard analyses focus on English |
| Model grader bias | GPT-4.1 or GPT-4o-mini as grader introduces self-favoritism risk on OpenAI models |
| No clinician baseline yet | Full HealthBench publishes physician baselines; Hard does not |
| Static dataset | Public release means conversations could leak into training data |
The grader concern is partially mitigated by HealthBench's meta-evaluation, which showed the GPT-4.1 grader's agreement with physicians on Consensus criteria fell within inter-physician agreement[3]. Whether that holds at the harder tail is open.