| HealthBench | |
|---|---|
| Overview | |
| Full name | HealthBench |
| Abbreviation | HealthBench |
| Description | A comprehensive evaluation benchmark for AI in healthcare that assesses models in realistic medical scenarios and workflows |
| Release date | 2025-05-12 |
| Latest version | 1.0 |
| Benchmark updated | 2025-05 |
| Authors | Karan Singhal and OpenAI Health AI Team |
| Organization | OpenAI |
| Technical Details | |
| Type | Healthcare AI, Medical Question Answering, Clinical Decision Support |
| Modality | Text, Conversational |
| Task format | Multi-turn dialogue, Clinical scenarios |
| Number of tasks | 5,000 |
| Total examples | 5,000 health conversations across 26 specialties |
| Evaluation metric | Accuracy, Communication Quality, Completeness, Context-Seeking, Instruction Following, Safety |
| Domains | Emergency Medicine, Cardiology, Pediatrics, Global Health, Primary Care, And 21 other medical specialties |
| Languages | 49 languages including English, Spanish, French, Amharic, Nepali |
| Performance | |
| Human performance | Expert physician baseline |
| Baseline | 0.16 (GPT-3.5 Turbo) |
| SOTA score | 0.60 |
| SOTA model | OpenAI o3 |
| SOTA date | 2025-05 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | [Available on GitHub Repository] |
| Dataset | [Available through OpenAI Download] |
| License | CC BY-NC-4.0
|
HealthBench is a comprehensive artificial intelligence evaluation benchmark for healthcare applications, designed to assess large language models (LLMs) in realistic medical scenarios and clinical workflows. Released on May 12, 2025, by OpenAI[1], HealthBench represents a significant advancement in healthcare AI evaluation, moving beyond traditional medical exam questions to capture the complexity and nuance of real-world clinical interactions. The benchmark was developed in collaboration with 262 physicians from 60 countries, creating an unprecedented scale of medical expertise validation in AI evaluation[2].
HealthBench addresses a critical gap in healthcare AI evaluation by providing a shared standard for assessing model performance and safety in health contexts. Unlike traditional medical question-answering benchmarks that rely on exam-style questions, HealthBench evaluates models through 5,000 realistic health conversations spanning 26 medical specialties and 49 languages. The benchmark employs 48,562 unique physician-created rubric criteria to assess multiple dimensions of performance including clinical accuracy, communication quality, completeness, context-seeking ability, instruction following, and safety[2].
The development of HealthBench marks OpenAI's first major healthcare AI initiative and establishes a new standard for evaluating AI systems in medical contexts. The benchmark is particularly significant for several reasons:
HealthBench was developed through extensive collaboration with medical professionals. The development process involved:
| Contributor Type | Number | Role |
|---|---|---|
| Core Physician Contributors | 262 | Created scenarios, rubrics, and expert responses |
| Countries Represented | 60 | Provided diverse cultural and healthcare system perspectives |
| Medical Specialties | 26 | Ensured comprehensive clinical coverage |
| Additional Reviewers | 250+ | Validated and refined evaluation criteria |
The benchmark's 5,000 health conversations were carefully designed to reflect real-world clinical interactions[1]:
HealthBench covers 26 medical specialties, ensuring comprehensive evaluation across different clinical domains:
| Category | Specialties Included |
|---|---|
| Primary Care | Family Medicine, Internal Medicine, Pediatrics |
| Specialized Medicine | Cardiology, Neurology, Oncology, Endocrinology |
| Surgical Specialties | General Surgery, Orthopedics, Neurosurgery |
| Emergency Services | Emergency Medicine, Critical Care, Trauma |
| Mental Health | Psychiatry, Psychology, Addiction Medicine |
| Global Health | Tropical Medicine, Public Health, Epidemiology |
| Other Specialties | Radiology, Pathology, Anesthesiology, and more |
The benchmark organizes conversations into seven key themes[2]:
| Theme | Percentage | Description |
|---|---|---|
| Global Health | 21.9% | Healthcare in resource-limited settings, tropical diseases |
| Handling Uncertainty | 21.4% | Appropriate hedging, acknowledging limitations |
| Emergency Referrals | 15.7% | Recognizing urgent conditions requiring immediate care |
| Context-Seeking | 12.3% | Requesting necessary information for diagnosis |
| Patient Communication | 11.8% | Clear, empathetic explanations for patients |
| Health Data Tasks | 9.2% | Interpreting lab results, imaging, clinical data |
| Expertise-Tailored | 7.7% | Adapting communication for different audiences |
HealthBench employs a sophisticated rubric-based evaluation system:
1. **Custom Criteria**: Each conversation has physician-created evaluation criteria 2. **Weighted Scoring**: Criteria weighted by clinical importance 3. **Multi-dimensional**: Assesses six key dimensions of performance 4. **Total Criteria**: 48,562 unique evaluation points across all conversations
| Dimension | Description | Weight |
|---|---|---|
| **Accuracy** | Clinical correctness of medical information | High |
| **Communication Quality** | Clarity, appropriateness, and empathy in responses | High |
| **Completeness** | Thoroughness in addressing all relevant aspects | Medium |
| **Context-Seeking** | Ability to request necessary additional information | Medium |
| **Instruction Following** | Adherence to specific requirements and constraints | Medium |
| **Safety** | Avoiding potentially harmful recommendations | Critical |
The evaluation methodology uses GPT-4.1 as an automated evaluator[1]:
1. **Response Generation**: Model generates response to conversation prompt 2. **Criteria Application**: GPT-4.1 evaluates response against physician-created rubric 3. **Weighted Scoring**: Scores weighted according to clinical importance 4. **Aggregation**: Individual criteria scores combined for overall performance metric
| Rank | Model | Score | Notable Strengths | Limitations |
|---|---|---|---|---|
| 1 | OpenAI o3 | 0.60 | Comprehensive responses, safety awareness | Computational cost |
| 2 | Gemini 2.5 Pro | ~0.55 | Strong clinical reasoning | Variable across specialties |
| 3 | Grok 3 | ~0.54 | Good general knowledge | Limited context-seeking |
| 4 | Claude 3.7 Sonnet | ~0.48 | Clear communication | Lower clinical accuracy |
| 5 | GPT-4o | ~0.42 | Fast responses | 15.8% error rate |
| Baseline | GPT-3.5 Turbo | 0.16 | Basic functionality | Significant limitations |
Recent developments in healthcare AI performance on HealthBench:
Analysis of model performance reveals several important patterns[2]:
| Capability | Best Performers | Key Challenges |
|---|---|---|
| Clinical Accuracy | o3, Gemini 2.5 Pro | Rare conditions, complex cases |
| Safety Awareness | o3, GPT-4.1 | Recognizing contraindications |
| Communication | Claude models | Technical vs. lay language balance |
| Context-Seeking | o3 | Knowing when to request information |
| Multilingual | Gemini 2.5 Pro | Low-resource languages |
HealthBench evaluates models across 49 languages, revealing significant disparities:
The benchmark has received significant recognition from the medical and AI communities:
HealthBench has already influenced several research directions:
1. **On-Device Healthcare AI**: Stanford researchers adapted HealthBench for evaluating lightweight models 2. **Multilingual Medical AI**: Focus on improving performance in underserved languages 3. **Safety Research**: New methods for detecting and preventing harmful medical advice 4. **Clinical Workflow Integration**: Studies on practical deployment in healthcare settings
Despite its comprehensive nature, HealthBench has several acknowledged limitations:
| Limitation | Description | Future Work |
|---|---|---|
| Text-Only | Currently limited to conversational text | Multimodal extensions planned |
| Automated Evaluation | Relies on GPT-4.1 for scoring | Human evaluation studies ongoing |
| Static Scenarios | Fixed set of conversations | Dynamic scenario generation considered |
| Western Bias | Despite diversity, some Western medical bias | Expanding global representation |
Planned improvements and extensions to HealthBench include:
HealthBench is available under the CC BY-NC-4.0 license, allowing non-commercial use with attribution[1]:
```python
from healthbench import HealthBenchEvaluator
evaluator = HealthBenchEvaluator() results = evaluator.evaluate(
model=your_model, conversations=healthbench_dataset, metrics=['accuracy', 'safety', 'communication']
)
print(f"Overall Score: {results['overall_score']}") print(f"Safety Score: {results['safety_score']}") ```
Each conversation in HealthBench follows a structured format:
```json {
"conversation_id": "HB_001234",
"specialty": "Emergency Medicine",
"languages": ["English"],
"turns": [
{
"role": "patient",
"content": "I've had severe chest pain for 2 hours..."
},
{
"role": "assistant",
"content": "..."
}
],
"rubric": {
"criteria": [...],
"weights": [...],
"safety_critical": [...]
}
} ```
HealthBench represents a watershed moment in healthcare AI evaluation, establishing rigorous standards for assessing AI systems in medical contexts. By combining unprecedented physician involvement, realistic scenarios, comprehensive coverage, and sophisticated evaluation methodology, it provides the foundation for developing safer, more effective healthcare AI systems. The benchmark's emphasis on global health, multilingual support, and safety makes it particularly valuable for ensuring AI benefits all populations equitably.
As AI systems approach and potentially exceed human performance on HealthBench, the benchmark serves as both a measure of progress and a guide for responsible development. Its open availability and transparent methodology enable researchers worldwide to contribute to advancing healthcare AI while maintaining high standards for safety and efficacy.