HealthBench Hard

HealthBench Hard
Overview
Full name	HealthBench Hard
Abbreviation	HealthBench Hard
Description	A 1,000 example curated subset of HealthBench, selected because frontier models score poorly on it
Release date	2025-05-12
Latest version	1.0
Benchmark updated	2025-05
Authors	OpenAI Research Team
Organization	OpenAI
Technical Details
Type	Healthcare AI, Multi-turn dialogue, Clinical reasoning
Modality	Text
Task format	Multi-turn healthcare conversations with rubric-based evaluation
Number of tasks	1,000 conversations
Total examples	1,000 (subset of 5,000 in main HealthBench)
Evaluation metric	Rubric-based scoring with a model grader (default GPT-4.1)
Domains	Emergency referrals, global health, clinical data, context seeking, hedging, communication, complex responses
Languages	English (drawn from main HealthBench, which spans 49 languages)
Performance
Human performance	Not yet published for the Hard subset
Baseline	Near 0% on many examples (older models)
SOTA score (May 2025 paper)	32% (o3)
SOTA model (May 2025 paper)	o3
Reported score (Aug 2025)	46.2% (GPT-5)
Saturated	No
Resources
Website	Official announcement
Paper	arXiv:2505.08775
GitHub	openai/simple-evals
Dataset	openai/healthbench
License	MIT
Predecessor	HealthBench (full 5,000 example set)

HealthBench Hard is a 1,000 example curated subset of the HealthBench benchmark released by OpenAI on May 12, 2025^[1]^[2]. The subset was carved out of HealthBench's full 5,000 conversation dataset by selecting cases where current frontier large language models scored poorly, including examples on which leading systems achieved zero credit^[1]^[3]. At release, o3 held the top score at 32% on HealthBench Hard, compared to 60% on full HealthBench, a gap OpenAI framed as "plenty of headroom for the next generation of models"^[1]^[3]. By August 2025, GPT-5 had pushed the score to 46.2%, but the subset remains far from saturation^[4].

HealthBench Hard plays two roles: a frontier-model differentiator (full HealthBench compresses scores between strong systems near the top; Hard spreads them out) and a future-proofing instrument. Released alongside HealthBench Consensus (a high-precision subset filtered to physician-validated criteria), the two derivative variants "respectively aim to be highly validated and unsaturated"^[3].

Overview

HealthBench Hard is a strict subset, not a separate dataset. Every conversation is also in the main HealthBench, with the same multi-turn structure, physician-written rubrics, five axes, and seven themes^[3]. OpenAI filtered 5,000 conversations down to 1,000 examples on which a panel of frontier models did badly, biased toward cases where multiple leading systems failed at once^[1]^[3].

Unlike medical AI benchmarks built on multiple choice questions (for example MedQA or USMLE-style tests), HealthBench evaluates models through realistic, open ended conversations. Full HealthBench was built with 262 physicians from 60 countries across 26 specialties, producing 48,562 rubric criteria, and HealthBench Hard inherits that scaffolding^[1]^[3].

Why a hard subset

Frontier model scores on full HealthBench rose from 16% (GPT-3.5 Turbo) up to 60% (o3) inside roughly two years, with sharp acceleration during 2025 as o3, o4-mini, and GPT-4.1 came online. If top systems cluster above 70%, the benchmark stops differentiating models. HealthBench Hard keeps a tail of difficulty available; OpenAI described the goal as "a worthy target for model improvements for months to come"^[1].

Selection methodology

The selection process is documented in Appendix C of the HealthBench paper^[3]:

Score every conversation in full HealthBench across a panel of frontier models.
Identify conversations where multiple state-of-the-art systems performed poorly.
Take the bottom-performing 1,000 conversations as HealthBench Hard.

HealthBench Hard is empirically curated, not flagged by physicians: hardness is defined by how badly a basket of frontier models did. The subset is biased toward failure modes systemic to current LLMs, not what clinicians intuitively call hard. Many conversations involve subtle requirements that models miss, rather than rare diseases.

What the hard examples look like

Difficulty does not come from rubric volume. Hard conversations average 11.8 rubric criteria per example (median 11.0), nearly identical to the full HealthBench average of 11.4. The axis distribution is stable: completeness at ~39.1% of points, accuracy at ~29.7%^[5]^[6]. The difference is in themes:

Theme	Hard share	Full HealthBench share
theme:global_health	28.0%	21.9%
theme:context_seeking	17.9%	11.9%
theme:emergency_referrals	6.6%	similar
theme:complex_responses	enriched	lower
theme:hedging	enriched	lower

^[5]^[6]

Conversations frontier models flunk are disproportionately ones requiring low resource reasoning (global health), clarifying questions (context seeking), or careful uncertainty (hedging). Riegler concludes hardness "stems not from higher criterion volume or drastically different thematic focus, but rather from the intrinsic complexity of the prompts or the specific nuances required by the criteria"^[5].

Dataset composition

Conversation structure

Aspect	Specification
Average turns	~2.6
Avg rubric criteria per example	11.8 (median 11.0)
Total rubric criteria	~11,800
Languages	English (drawn from multilingual base)
License	MIT

^[1]^[3]^[5]

Five evaluation axes

Axis	What it grades
Accuracy	Factual and clinical correctness
Completeness	Coverage of rubric-required aspects
Communication quality	Clarity, tone, structure
Context awareness	Use or seeking of correct context
Instruction following	Compliance with user requests

^[3]^[7]

Seven medical themes

Theme	Focus
emergency_referrals	Recognizing urgent-care escalation
context_seeking	Asking clarifying questions
global_health	Low resource or non-Western reasoning
health_data_tasks	Clinical data, summaries, notes
expertise_tailored_communication	Calibrating depth to user expertise
responding_under_uncertainty	Hedging on incomplete evidence
response_depth	Choosing length and detail

^[3]^[7]

Performance results

Frontier models in the May 2025 paper

The Hard subset opens up the gap between models that look close on the full benchmark^[1]^[3]^[8]:

Model	Full HealthBench	HealthBench Hard	Gap
o3	60%	32% (top score)	28 points
GPT-4.1	49% to 53%	High 20s	~25 points
o4-mini	High 50s	Mid to high 20s	~25 to 30 points
o1	~49%	Lower than GPT-4.1	~20 points
GPT-4o (Aug 2024)	32%	Substantially lower	Notable drop
GPT-3.5 Turbo	16%	Near zero on many examples	Most extreme
Grok 3	n/a	~0.226 (22.6%)	n/a
Claude 3.7 Sonnet	n/a	~20% to 21%	n/a
Gemini 2.5 Pro	n/a	~24% to 25%	n/a
Llama 4 Maverick	n/a	Comparable to GPT-4.1 levels	n/a

The paper notes that o3 and GPT-4.1 cut error rates on HealthBench Consensus dramatically compared to GPT-4o, but on HealthBench Hard absolute scores stay low across the board. Consensus shows whether models meet a physician-validated safety bar; Hard shows whether they have anywhere left to climb^[3].

Post-paper updates

Model	HealthBench Hard score	Source
GPT-5	46.2%	OpenAI GPT-5 launch, Aug 2025^[4]
Gemini 2.5 Pro	0.243 (24.3%)	Inspect Evals 250-sample, Feb 2026^[7]
Claude 3.7 Sonnet	0.205 (20.5%)	Inspect Evals 250-sample, Feb 2026^[7]
o1	0.180 (18.0%)	Inspect Evals 250-sample, Feb 2026^[7]

GPT-5's 46.2% is the largest single jump since launch. Even so, more than half of achievable rubric points remain unscored, validating OpenAI's bet that the subset would stay unsaturated through several model releases.

Performance insights

Finding	Implication
20 to 28 point drop from full HealthBench to Hard for most frontier models	Hard is a robust differentiator, not a small perturbation
Gains on Hard lag gains on full HealthBench	Brute-force memorization plateaus here
Context seeking and global health themes drive most errors	Models default to confident answers instead of asking
Hard scores correlate with reasoning architectures (o3, o4-mini, GPT-5) more than raw scale	Test-time reasoning helps where pattern matching does not

^[1]^[3]^[5]

Evaluation framework

Rubric-based grading

The Hard subset is graded by a model grader (simple-evals defaults to GPT-4.1, with GPT-4o-mini as a faster alternative) that checks each rubric criterion. Each criterion carries a positive or negative point weight assigned by physicians^[3]^[7]. A model's score is the proportion of achievable points earned, capped at zero on the low end. The Inspect Evals port supports optional length adjustment parameters (length_adjustment_center and length_adjustment_penalty_per_500_chars)^[7] for penalizing padded responses.

Physician panel

Component	Number
Physician validators	262
Countries	60
Specialties	26
Unique rubric criteria (full HealthBench)	48,562
Consensus dimensions	34

^[1]^[3]

HealthBench Hard uses the full physician-authored rubric set, not only consensus criteria, restricted to the 1,000 hardest conversations.

Comparison with main HealthBench and HealthBench Consensus

OpenAI released three views of the same evaluation:

Variant	Examples	What it measures	Top score (May 2025)	Saturating?
HealthBench	5,000	Broad performance across rubric criteria	60% (o3)	Slowly
HealthBench Consensus	3,671	Physician-consensus criteria (high precision)	High; errors rare for top models	Yes, on top models
HealthBench Hard	1,000	Hardest conversations for frontier models	32% (o3) at release; 46.2% (GPT-5) by Aug 2025	No

^[1]^[3]^[4]

Consensus spots regressions: failure on a consensus criterion is a flag worth investigating. Full HealthBench is the broad eval. Hard differentiates frontier systems and tracks marginal progress across model generations.

What HealthBench Hard does and does not test

Tests	Does not test
Multi-turn clinical conversation handling	Image understanding (text only)
Calibrated uncertainty and context seeking	Real-time clinical workflow integration
Global health and low-resource reasoning	Long-form chart review
Communication quality with diverse users	Diagnostic accuracy on private cohorts
Instruction following in healthcare scenarios	Prescribing legality in a jurisdiction

^[1]^[3]^[7]

Significance and reception

HealthBench Hard's design (filter by current model failure, then publish) sidesteps the saturation problem that has hit benchmarks like MMLU and HumanEval. The trade-off: difficulty is partially defined by time of creation, and the empirical center drifts as models improve. OpenAI has not announced a refresh, but simple-evals continues to host HealthBench as a maintained reference even after the rest of simple-evals stopped getting updates in July 2025^[2]^[9].

Third-party leaderboards include HealthBench Hard alongside the main benchmark, since two models within a couple of points on full HealthBench can sit 5 to 10 points apart on Hard. The arXiv preprint "OpenAI's HealthBench in Action" used HealthBench Hard to evaluate a clinical assistant called DR. INFO across model generations, updated into 2026^[8].

Frontier-model differentiator

On full HealthBench, top systems (o3, GPT-4.1, o4-mini) cluster within ~10 points. On Hard, the same systems can be 15+ points apart; Grok 3, Claude 3.7 Sonnet, and Gemini 2.5 Pro fall in the 20% to 25% range while OpenAI's reasoning models lead^[3]^[7].

Limitations

Limitation	Description
Empirical curation can drift	Difficulty is anchored to late 2024 and early 2025 frontier models
Small dataset size	1,000 conversations is tight for axis or theme breakdowns
English-leaning	Main HealthBench is multilingual; published Hard analyses focus on English
Model grader bias	GPT-4.1 or GPT-4o-mini as grader introduces self-favoritism risk on OpenAI models
No clinician baseline yet	Full HealthBench publishes physician baselines; Hard does not
Static dataset	Public release means conversations could leak into training data

^[1]^[3]^[5]

The grader concern is partially mitigated by HealthBench's meta-evaluation, which showed the GPT-4.1 grader's agreement with physicians on Consensus criteria fell within inter-physician agreement^[3]. Whether that holds at the harder tail is open.

Future directions

Periodic refresh, replacing now-easy conversations with new model-failure cases.
Multilingual Hard subsets across HealthBench's 49 languages.
Specialty-specific Hard subsets (cardiology-Hard, pediatrics-Hard).
Physician baseline scoring to calibrate the 32% to 46% range.
Image-augmented variants beyond the text-only foundation.

References

OpenAI. "Introducing HealthBench." May 12, 2025. https://openai.com/index/healthbench/
OpenAI. "openai/simple-evals" GitHub repository. https://github.com/openai/simple-evals
Arora, R. K. et al. "HealthBench: Evaluating Large Language Models Towards Improved Human Health." arXiv:2505.08775, May 2025. https://arxiv.org/abs/2505.08775
OpenAI. "Introducing GPT-5." GPT-5 launch materials reporting 46.2% on HealthBench Hard, Aug 2025. https://openai.com/index/introducing-gpt-5/
Riegler, M. A. "A closer look at OpenAI's new HealthBench evaluation benchmark," Medium, May 2025. https://medium.com/@michael_79773/a-closer-look-at-openais-new-healthbench-evaluation-benchmark-ed3455110a29
"openai-healthbench-analysis" repository. https://github.com/kelkalot/openai-healthbench-analysis
UK Government BEIS Inspect Evals. "HealthBench (including HealthBench Hard variant)." https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/healthbench/
"OpenAI's HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries," arXiv:2509.02594, revised Feb 2026. https://arxiv.org/abs/2509.02594
MarkTechPost. "OpenAI Releases HealthBench," May 12, 2025. https://www.marktechpost.com/2025/05/12/openai-releases-healthbench-an-open-source-benchmark-for-measuring-the-performance-and-safety-of-large-language-models-in-healthcare/

Overview

Why a hard subset

Selection methodology

What the hard examples look like

Dataset composition

Conversation structure

Five evaluation axes

Seven medical themes

Performance results

Frontier models in the May 2025 paper

Post-paper updates

Performance insights

Evaluation framework

Rubric-based grading

Physician panel

Comparison with main HealthBench and HealthBench Consensus

What HealthBench Hard does and does not test

Significance and reception

Frontier-model differentiator

Limitations

Future directions

See also

References

Improve this article

Related Articles

HealthBench

Humanity's Last Exam

BrowseComp

AA-LCR

GSO

AIME 2025

Overview

Why a hard subset

Selection methodology

What the hard examples look like

Dataset composition

Conversation structure

Five evaluation axes

Seven medical themes

Performance results

Frontier models in the May 2025 paper

Post-paper updates

Performance insights

Evaluation framework

Rubric-based grading

Physician panel

Comparison with main HealthBench and HealthBench Consensus

What HealthBench Hard does and does not test

Significance and reception

Frontier-model differentiator

Limitations

Future directions

See also

References

Related Articles

HealthBench

Humanity's Last Exam

BrowseComp

AA-LCR

GSO

AIME 2025