# HealthBench Hard

> Source: https://aiwiki.ai/wiki/healthbench_hard
> Updated: 2026-05-10
> Categories: AI Benchmarks, Healthcare AI, OpenAI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

| HealthBench Hard |
| --- |
| Overview |
| Full name | HealthBench Hard |
| Abbreviation | HealthBench Hard |
| Description | A 1,000 example curated subset of HealthBench, selected because frontier models score poorly on it |
| Release date | 2025-05-12 |
| Latest version | 1.0 |
| Benchmark updated | 2025-05 |
| Authors | OpenAI Research Team |
| Organization | [OpenAI](/wiki/openai) |
| Technical Details |
| Type | Healthcare AI, Multi-turn dialogue, Clinical reasoning |
| Modality | Text |
| Task format | Multi-turn healthcare conversations with rubric-based evaluation |
| Number of tasks | 1,000 conversations |
| Total examples | 1,000 (subset of 5,000 in main HealthBench) |
| Evaluation metric | Rubric-based scoring with a model grader (default GPT-4.1) |
| Domains | Emergency referrals, global health, clinical data, context seeking, hedging, communication, complex responses |
| Languages | English (drawn from main HealthBench, which spans 49 languages) |
| Performance |
| Human performance | Not yet published for the Hard subset |
| Baseline | Near 0% on many examples (older models) |
| SOTA score (May 2025 paper) | 32% (o3) |
| SOTA model (May 2025 paper) | [o3](/wiki/o3) |
| Reported score (Aug 2025) | 46.2% (GPT-5) |
| Saturated | No |
| Resources |
| Website | [Official announcement](https://openai.com/index/healthbench/) |
| Paper | [arXiv:2505.08775](https://arxiv.org/abs/2505.08775) |
| GitHub | [openai/simple-evals](https://github.com/openai/simple-evals) |
| Dataset | [openai/healthbench](https://huggingface.co/datasets/openai/healthbench) |
| License | MIT |
| Predecessor | [HealthBench](/wiki/healthbench) (full 5,000 example set) |

**HealthBench Hard** is a 1,000 example curated subset of the [HealthBench](/wiki/healthbench) [benchmark](/wiki/benchmark) released by [OpenAI](/wiki/openai) on May 12, 2025[1][2]. The subset was carved out of HealthBench's full 5,000 conversation dataset by selecting cases where current frontier [large language models](/wiki/large_language_model) scored poorly, including examples on which leading systems achieved zero credit[1][3]. At release, [o3](/wiki/o3) held the top score at 32% on HealthBench Hard, compared to 60% on full HealthBench, a gap OpenAI framed as "plenty of headroom for the next generation of models"[1][3]. By August 2025, [GPT-5](/wiki/gpt-5) had pushed the score to 46.2%, but the subset remains far from saturation[4].

HealthBench Hard plays two roles: a frontier-model differentiator (full HealthBench compresses scores between strong systems near the top; Hard spreads them out) and a future-proofing instrument. Released alongside HealthBench Consensus (a high-precision subset filtered to physician-validated criteria), the two derivative variants "respectively aim to be highly validated and unsaturated"[3].

## Overview

HealthBench Hard is a strict subset, not a separate dataset. Every conversation is also in the main HealthBench, with the same multi-turn structure, physician-written rubrics, five axes, and seven themes[3]. OpenAI filtered 5,000 conversations down to 1,000 examples on which a panel of frontier models did badly, biased toward cases where multiple leading systems failed at once[1][3].

Unlike medical AI benchmarks built on multiple choice questions (for example [MedQA](/wiki/medqa) or USMLE-style tests), HealthBench evaluates models through realistic, open ended conversations. Full HealthBench was built with 262 physicians from 60 countries across 26 specialties, producing 48,562 rubric criteria, and HealthBench Hard inherits that scaffolding[1][3].

### Why a hard subset

Frontier model scores on full HealthBench rose from 16% (GPT-3.5 Turbo) up to 60% (o3) inside roughly two years, with sharp acceleration during 2025 as o3, o4-mini, and GPT-4.1 came online. If top systems cluster above 70%, the benchmark stops differentiating models. HealthBench Hard keeps a tail of difficulty available; OpenAI described the goal as "a worthy target for model improvements for months to come"[1].

## Selection methodology

The selection process is documented in Appendix C of the HealthBench paper[3]:

1. Score every conversation in full HealthBench across a panel of frontier models.
2. Identify conversations where multiple state-of-the-art systems performed poorly.
3. Take the bottom-performing 1,000 conversations as HealthBench Hard.

HealthBench Hard is empirically curated, not flagged by physicians: hardness is defined by how badly a basket of frontier models did. The subset is biased toward failure modes systemic to current LLMs, not what clinicians intuitively call hard. Many conversations involve subtle requirements that models miss, rather than rare diseases.

### What the hard examples look like

Difficulty does not come from rubric volume. Hard conversations average 11.8 rubric criteria per example (median 11.0), nearly identical to the full HealthBench average of 11.4. The axis distribution is stable: completeness at ~39.1% of points, accuracy at ~29.7%[5][6]. The difference is in themes:

| Theme | Hard share | Full HealthBench share |
| --- | --- | --- |
| theme:global_health | 28.0% | 21.9% |
| theme:context_seeking | 17.9% | 11.9% |
| theme:emergency_referrals | 6.6% | similar |
| theme:complex_responses | enriched | lower |
| theme:hedging | enriched | lower |

[5][6]

Conversations frontier models flunk are disproportionately ones requiring low resource reasoning (global health), clarifying questions (context seeking), or careful uncertainty (hedging). Riegler concludes hardness "stems not from higher criterion volume or drastically different thematic focus, but rather from the intrinsic complexity of the prompts or the specific nuances required by the criteria"[5].

## Dataset composition

### Conversation structure

| Aspect | Specification |
| --- | --- |
| Average turns | ~2.6 |
| Avg rubric criteria per example | 11.8 (median 11.0) |
| Total rubric criteria | ~11,800 |
| Languages | English (drawn from multilingual base) |
| License | MIT |

[1][3][5]

### Five evaluation axes

| Axis | What it grades |
| --- | --- |
| Accuracy | Factual and clinical correctness |
| Completeness | Coverage of rubric-required aspects |
| Communication quality | Clarity, tone, structure |
| Context awareness | Use or seeking of correct context |
| Instruction following | Compliance with user requests |

[3][7]

### Seven medical themes

| Theme | Focus |
| --- | --- |
| emergency_referrals | Recognizing urgent-care escalation |
| context_seeking | Asking clarifying questions |
| global_health | Low resource or non-Western reasoning |
| health_data_tasks | Clinical data, summaries, notes |
| expertise_tailored_communication | Calibrating depth to user expertise |
| responding_under_uncertainty | Hedging on incomplete evidence |
| response_depth | Choosing length and detail |

[3][7]

## Performance results

### Frontier models in the May 2025 paper

The Hard subset opens up the gap between models that look close on the full benchmark[1][3][8]:

| Model | Full HealthBench | HealthBench Hard | Gap |
| --- | --- | --- | --- |
| [o3](/wiki/o3) | 60% | 32% (top score) | 28 points |
| [GPT-4.1](/wiki/gpt-4.1) | 49% to 53% | High 20s | ~25 points |
| [o4-mini](/wiki/o4-mini) | High 50s | Mid to high 20s | ~25 to 30 points |
| [o1](/wiki/o1) | ~49% | Lower than GPT-4.1 | ~20 points |
| [GPT-4o](/wiki/gpt-4o) (Aug 2024) | 32% | Substantially lower | Notable drop |
| [GPT-3.5 Turbo](/wiki/gpt-3.5_turbo) | 16% | Near zero on many examples | Most extreme |
| [Grok 3](/wiki/grok_3) | n/a | ~0.226 (22.6%) | n/a |
| [Claude 3.7 Sonnet](/wiki/claude_3.7_sonnet) | n/a | ~20% to 21% | n/a |
| [Gemini 2.5 Pro](/wiki/gemini_2.5_pro) | n/a | ~24% to 25% | n/a |
| [Llama 4 Maverick](/wiki/llama_4) | n/a | Comparable to GPT-4.1 levels | n/a |

The paper notes that o3 and GPT-4.1 cut error rates on HealthBench Consensus dramatically compared to GPT-4o, but on HealthBench Hard absolute scores stay low across the board. Consensus shows whether models meet a physician-validated safety bar; Hard shows whether they have anywhere left to climb[3].

### Post-paper updates

| Model | HealthBench Hard score | Source |
| --- | --- | --- |
| [GPT-5](/wiki/gpt-5) | 46.2% | OpenAI GPT-5 launch, Aug 2025[4] |
| Gemini 2.5 Pro | 0.243 (24.3%) | Inspect Evals 250-sample, Feb 2026[7] |
| Claude 3.7 Sonnet | 0.205 (20.5%) | Inspect Evals 250-sample, Feb 2026[7] |
| o1 | 0.180 (18.0%) | Inspect Evals 250-sample, Feb 2026[7] |

GPT-5's 46.2% is the largest single jump since launch. Even so, more than half of achievable rubric points remain unscored, validating OpenAI's bet that the subset would stay unsaturated through several model releases.

### Performance insights

| Finding | Implication |
| --- | --- |
| 20 to 28 point drop from full HealthBench to Hard for most frontier models | Hard is a robust differentiator, not a small perturbation |
| Gains on Hard lag gains on full HealthBench | Brute-force memorization plateaus here |
| Context seeking and global health themes drive most errors | Models default to confident answers instead of asking |
| Hard scores correlate with reasoning architectures (o3, o4-mini, GPT-5) more than raw scale | Test-time reasoning helps where pattern matching does not |

[1][3][5]

## Evaluation framework

### Rubric-based grading

The Hard subset is graded by a model grader (simple-evals defaults to GPT-4.1, with GPT-4o-mini as a faster alternative) that checks each rubric criterion. Each criterion carries a positive or negative point weight assigned by physicians[3][7]. A model's score is the proportion of achievable points earned, capped at zero on the low end. The Inspect Evals port supports optional length adjustment parameters (`length_adjustment_center` and `length_adjustment_penalty_per_500_chars`)[7] for penalizing padded responses.

### Physician panel

| Component | Number |
| --- | --- |
| Physician validators | 262 |
| Countries | 60 |
| Specialties | 26 |
| Unique rubric criteria (full HealthBench) | 48,562 |
| Consensus dimensions | 34 |

[1][3]

HealthBench Hard uses the full physician-authored rubric set, not only consensus criteria, restricted to the 1,000 hardest conversations.

## Comparison with main HealthBench and HealthBench Consensus

OpenAI released three views of the same evaluation:

| Variant | Examples | What it measures | Top score (May 2025) | Saturating? |
| --- | --- | --- | --- | --- |
| HealthBench | 5,000 | Broad performance across rubric criteria | 60% (o3) | Slowly |
| HealthBench Consensus | 3,671 | Physician-consensus criteria (high precision) | High; errors rare for top models | Yes, on top models |
| HealthBench Hard | 1,000 | Hardest conversations for frontier models | 32% (o3) at release; 46.2% (GPT-5) by Aug 2025 | No |

[1][3][4]

Consensus spots regressions: failure on a consensus criterion is a flag worth investigating. Full HealthBench is the broad eval. Hard differentiates frontier systems and tracks marginal progress across model generations.

### What HealthBench Hard does and does not test

| Tests | Does not test |
| --- | --- |
| Multi-turn clinical conversation handling | Image understanding (text only) |
| Calibrated uncertainty and context seeking | Real-time clinical workflow integration |
| Global health and low-resource reasoning | Long-form chart review |
| Communication quality with diverse users | Diagnostic accuracy on private cohorts |
| Instruction following in healthcare scenarios | Prescribing legality in a jurisdiction |

[1][3][7]

## Significance and reception

HealthBench Hard's design (filter by current model failure, then publish) sidesteps the saturation problem that has hit benchmarks like [MMLU](/wiki/mmlu) and [HumanEval](/wiki/humaneval). The trade-off: difficulty is partially defined by time of creation, and the empirical center drifts as models improve. OpenAI has not announced a refresh, but simple-evals continues to host HealthBench as a maintained reference even after the rest of simple-evals stopped getting updates in July 2025[2][9].

Third-party leaderboards include HealthBench Hard alongside the main benchmark, since two models within a couple of points on full HealthBench can sit 5 to 10 points apart on Hard. The arXiv preprint "OpenAI's HealthBench in Action" used HealthBench Hard to evaluate a clinical assistant called DR. INFO across model generations, updated into 2026[8].

### Frontier-model differentiator

On full HealthBench, top systems (o3, GPT-4.1, o4-mini) cluster within ~10 points. On Hard, the same systems can be 15+ points apart; Grok 3, Claude 3.7 Sonnet, and Gemini 2.5 Pro fall in the 20% to 25% range while OpenAI's reasoning models lead[3][7].

## Limitations

| Limitation | Description |
| --- | --- |
| Empirical curation can drift | Difficulty is anchored to late 2024 and early 2025 frontier models |
| Small dataset size | 1,000 conversations is tight for axis or theme breakdowns |
| English-leaning | Main HealthBench is multilingual; published Hard analyses focus on English |
| Model grader bias | GPT-4.1 or GPT-4o-mini as grader introduces self-favoritism risk on OpenAI models |
| No clinician baseline yet | Full HealthBench publishes physician baselines; Hard does not |
| Static dataset | Public release means conversations could leak into training data |

[1][3][5]

The grader concern is partially mitigated by HealthBench's meta-evaluation, which showed the GPT-4.1 grader's agreement with physicians on Consensus criteria fell within inter-physician agreement[3]. Whether that holds at the harder tail is open.

## Future directions

1. Periodic refresh, replacing now-easy conversations with new model-failure cases.
2. Multilingual Hard subsets across HealthBench's 49 languages.
3. Specialty-specific Hard subsets (cardiology-Hard, pediatrics-Hard).
4. Physician baseline scoring to calibrate the 32% to 46% range.
5. Image-augmented variants beyond the text-only foundation.

## See also

- [HealthBench](/wiki/healthbench)
- [Benchmark](/wiki/benchmark)
- [OpenAI](/wiki/openai)
- [o3](/wiki/o3)
- [GPT-4.1](/wiki/gpt-4.1)
- [GPT-5](/wiki/gpt-5)
- [Large Language Model](/wiki/large_language_model)
- [MedQA](/wiki/medqa)
- [Medical AI Evaluation](/wiki/medical_ai_evaluation)

## References

1. OpenAI. "Introducing HealthBench." May 12, 2025. https://openai.com/index/healthbench/
2. OpenAI. "openai/simple-evals" GitHub repository. https://github.com/openai/simple-evals
3. Arora, R. K. et al. "HealthBench: Evaluating Large Language Models Towards Improved Human Health." arXiv:2505.08775, May 2025. https://arxiv.org/abs/2505.08775
4. OpenAI. "Introducing GPT-5." GPT-5 launch materials reporting 46.2% on HealthBench Hard, Aug 2025. https://openai.com/index/introducing-gpt-5/
5. Riegler, M. A. "A closer look at OpenAI's new HealthBench evaluation benchmark," Medium, May 2025. https://medium.com/@michael_79773/a-closer-look-at-openais-new-healthbench-evaluation-benchmark-ed3455110a29
6. "openai-healthbench-analysis" repository. https://github.com/kelkalot/openai-healthbench-analysis
7. UK Government BEIS Inspect Evals. "HealthBench (including HealthBench Hard variant)." https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/healthbench/
8. "OpenAI's HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries," arXiv:2509.02594, revised Feb 2026. https://arxiv.org/abs/2509.02594
9. MarkTechPost. "OpenAI Releases HealthBench," May 12, 2025. https://www.marktechpost.com/2025/05/12/openai-releases-healthbench-an-open-source-benchmark-for-measuring-the-performance-and-safety-of-large-language-models-in-healthcare/