# HELM (Holistic Evaluation of Language Models)

> Source: https://aiwiki.ai/wiki/helm
> Updated: 2026-06-24
> Categories: AI Benchmarks, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**HELM** (Holistic Evaluation of Language Models) is an open-source [benchmark](/wiki/benchmark) framework created by the Center for Research on Foundation Models (CRFM) at Stanford University for the reproducible and transparent evaluation of [large language models](/wiki/large_language_model) and other foundation models.[^1] First released in November 2022 alongside the arXiv preprint "Holistic Evaluation of Language Models" by Percy Liang, Rishi Bommasani, Tony Lee and 47 co-authors, HELM evaluates models across many scenarios on seven categories of metrics at once: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, all under a single uniform prompting protocol so that any two models are compared under the same conditions.[^2][^3] The authors state the goal plainly: "We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models."[^2] The launch evaluation ran 30 models from 12 organizations across 42 scenarios, conducting more than 4,900 evaluations over roughly 12 billion tokens and 17 million model API calls at a cost of about $38,000 for commercial APIs plus nearly 20,000 GPU-hours for open models.[^3][^11] Since the original release ("HELM Classic") the framework has grown into a family of leaderboards, including HELM Lite, HELM Instruct, HELM MMLU, HELM Safety, HELM Capabilities, HELM Long Context, VHELM for vision-language models, HEIM for text-to-image models, and MedHELM for medical tasks, all hosted at `crfm.stanford.edu/helm`.[^1][^4][^5][^6][^7][^8][^23]

## Infobox

| Item | Value |
|---|---|
| Full name | Holistic Evaluation of Language Models |
| Type | Open-source LLM benchmark framework and living leaderboard |
| Creator | Stanford Center for Research on Foundation Models (CRFM) |
| Lead authors | Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras (et al., 50 total)[^2] |
| First arXiv release | 16 November 2022 (arXiv:2211.09110)[^2] |
| TMLR publication | August 2023[^9] |
| Original evaluation | 30 models, 12 organizations, 42 scenarios, 7 metrics[^3] |
| GitHub | `stanford-crfm/helm` (Apache-2.0 license, ~2.8k stars)[^1] |
| Leaderboard | `crfm.stanford.edu/helm`[^4] |
| Latest framework release | v0.5.16 (30 April 2026) on GitHub and PyPI[^10] |
| Maintenance mode | Entered 1 June 2026[^1] |

## What problem does HELM solve?

By late 2022, large language models such as [GPT-3](/wiki/gpt-3), [BLOOM](/wiki/bloom), and Anthropic's earliest production model had proliferated, but their evaluation was fragmented: model creators reported scores on different subsets of benchmarks, often with non-comparable prompting conventions, and risk-oriented metrics like toxicity or bias were rarely reported alongside accuracy.[^2][^3] In the HELM paper Liang and colleagues observed that, before their work, on average prominent models had been evaluated on only 17.9% of HELM's core scenarios, leaving substantial gaps in the public record of model capabilities.[^11] As the CRFM team put it in the launch announcement, "As language models become the substrate for language technologies, the absence of an evaluation standard compromises the community's ability to see the full landscape of language models," and "Transparency is the vital first step" toward trust and standards.[^3]

HELM was developed at the CRFM, the foundation-models center launched inside [Stanford HAI](/wiki/stanford_hai) in 2021, with [Percy Liang](/wiki/percy_liang) as faculty director and a leadership team including Rishi Bommasani, Tony Lee, and [Christopher Manning](/wiki/christopher_manning) among its 50 listed authors.[^2] The paper was first posted to arXiv on 16 November 2022 and announced the next day on the CRFM blog; a substantially revised v2 followed on 1 October 2023, and the article was published in Transactions on Machine Learning Research (TMLR) in August 2023.[^2][^3][^9]

The framework was deliberately positioned as a "living benchmark": the authors released raw prompts, completions, and a modular Python toolkit so that researchers could add new scenarios, new metrics, or new models and re-run the evaluation themselves, with results aggregated into a public leaderboard at `crfm.stanford.edu/helm`.[^11][^4] The blog post framed the long-term ambition directly: "We intend for HELM to serve as a map for the world of language models, continually updated over time, through collaboration with the broader community."[^3]

## What is HELM's design philosophy?

HELM's central claim is that evaluation of language models should be *holistic*. The paper distills this into three principles.[^2][^3]

1. **Broad coverage with explicit recognition of gaps.** Rather than pick a single benchmark, HELM organizes evaluation around a taxonomy of scenarios (use cases, domains, languages, demographic groups) and a taxonomy of metrics. The taxonomy makes it possible to enumerate not only what is evaluated but also what is missing.[^2]
2. **Multi-metric measurement.** For each scenario, HELM tries to measure seven categories of metrics in the same context, rather than relegating risks such as bias or toxicity to separate, accuracy-only studies.[^2][^3]
3. **Standardization.** Every model is run on every scenario through a single Python codebase with uniform prompting (typically 5-shot in-context learning with fixed templates), so that comparisons are not confounded by prompting differences.[^2][^11]

This last point distinguishes HELM from the way many model cards report numbers. The HELM MMLU effort, for example, found that scores reported by model providers on MMLU often differed from HELM's standardized re-evaluations by as much as five percentage points, and that reported numbers were frequently higher than HELM's, suggesting advantageous prompting in vendor reports.[^7]

## What are the seven HELM metrics?

HELM Classic defines seven categories of metrics, each applied wherever feasible to each core scenario.[^2][^3]

| Metric | What it measures |
|---|---|
| Accuracy | Standard task-specific quality (exact match, F1, ROUGE, etc.) on the scenario's reference answers.[^2] |
| [Calibration](/wiki/calibration) | How well a model's expressed confidence matches its empirical correctness; computed where token-level log-probabilities are available.[^2] |
| Robustness | Performance under typo-style and equivalence-preserving perturbations of the input, and under invariance-style transformations meant to mimic real-world noise.[^2] |
| Fairness | Performance shifts when demographic features (names, dialects) in the input are changed, including comparisons across African-American English dialects and counterfactual demographic swaps.[^2] |
| Bias | Demographic representation in model outputs (e.g., gender or race associations in generation), measured independently of correctness.[^2] |
| Toxicity | Rate of harmful or insulting generations, scored automatically using a toxicity classifier on free-form outputs.[^2] |
| Efficiency | Wall-clock and idealized inference cost, allowing comparison of accuracy against compute or latency budgets.[^2] |

Because not every metric is well-defined for every scenario (calibration, for example, requires API access to token probabilities), HELM Classic reports that it achieves coverage of roughly 87.5% across the metric-by-scenario grid.[^2][^11]

## What are the 16 core scenarios?

HELM Classic defines 16 *core* scenarios that span six user-facing task families: question answering, information retrieval, summarization, sentiment analysis, toxicity detection, and miscellaneous text classification.[^11] Concrete datasets used as core scenarios include [MMLU](/wiki/mmlu), BoolQ, NarrativeQA, NaturalQuestions, QuAC, [HellaSwag](/wiki/hellaswag), OpenbookQA, [TruthfulQA](/wiki/truthfulqa), MS MARCO, CNN/DailyMail, XSum, IMDB, CivilComments, and RAFT, among others.[^12] On top of the core, the paper adds seven targeted evaluations based on 26 additional "targeted" scenarios for skills such as reasoning, knowledge, language modeling, and disinformation generation, for a total of 42 scenarios in the original release, 21 of which had not previously been used in mainstream LM evaluation.[^2][^12]

Each scenario is implemented as a Python class that produces standardized in-context-learning prompts and pairs them with reference outputs or scoring functions; the same scenario object is used to evaluate every model.[^1][^11]

## What did the initial 2022 evaluation find?

The launch evaluation in the HELM paper covered 30 prominent language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex.[^3] Notable systems included OpenAI [GPT-3](/wiki/gpt-3) and InstructGPT variants, [BLOOM](/wiki/bloom), Anthropic-LM, and Meta's OPT.[^3] In aggregate the team conducted more than 4,900 evaluations spanning roughly 12 billion tokens of inference and 17 million model API calls, raising scenario coverage of these models from an average 17.9% to 96.0%.[^3][^11] The compute bill was substantial and itemized by the authors: about $38,000 in commercial API spend plus nearly 20,000 GPU-hours for the open models.[^3] The paper distilled the results into 25 top-level findings, including the observation that no single model dominated across all metrics: even the strongest accuracy models had measurable bias and calibration deficits, and efficient smaller models were sometimes competitive on individual tasks.[^11]

A second observation from the launch was the tradeoff between instruction-tuned and base models. Instruction-tuned variants, such as the InstructGPT family, dominated open-ended generation scenarios but did not necessarily improve on multiple-choice accuracy. The HELM team also reported sizeable gaps between open and closed models on accuracy, with proprietary systems such as OpenAI's `text-davinci-002` clearly outperforming the strongest open systems available at the time, while open systems were sometimes competitive on individual scenarios such as sentiment analysis and short-form question answering.[^11] HELM's structured presentation of these tradeoffs was widely cited in subsequent work and quickly adopted as a reference design for downstream evaluations.[^11][^3]

## How does HELM Classic differ from HELM Lite?

By late 2023, HELM Classic had grown unwieldy: full evaluation of a new model required running all 42 scenarios with three random seeds and many perturbations, which was expensive both in API spend and in compute. On 19 December 2023, CRFM published **HELM Lite v1.0.0**, a deliberately stripped-down version focused on capabilities rather than the full multi-metric matrix.[^5]

HELM Lite simplifies HELM Classic in four ways:[^5]

* Uses one random seed instead of three over choices of in-context examples.
* Drops perturbation-based robustness and fairness measurements, on the grounds that they were strongly correlated with raw accuracy in HELM Classic.
* Removes the calibration metric, since several major LLM APIs (notably Anthropic and Google) had stopped exposing token log-probabilities, making it inapplicable.
* Drops perplexity and the information-retrieval scenarios, citing computational expense and decreasing relevance.

The HELM Lite scenario set contains nine benchmarks emphasizing generation rather than multiple choice: NarrativeQA, NaturalQuestions, OpenbookQA, a five-subject subset of [MMLU](/wiki/mmlu), the [MATH](/wiki/math_benchmark) and [GSM8K](/wiki/gsm8k) math benchmarks, a five-task subset of [LegalBench](/wiki/legalbench), [MedQA](/wiki/medqa), and WMT-14 machine translation across five language pairs.[^5] The launch evaluation ranked 28 model variants from 11 organizations by mean win rate, with GPT-4 on top overall; smaller models such as Writer's Palmyra-X and 01.AI's Yi-34B were noted as unexpectedly strong, and on the NarrativeQA scenario Yi-34B outperformed [GPT-4](/wiki/gpt-4).[^5] Safety evaluations, the CRFM team noted, were deliberately handled outside HELM Lite via a separate partnership with MLCommons' AI safety working group.[^5]

## HELM Instruct

On 18 February 2024 CRFM published **HELM Instruct**, an instruction-following evaluation framework with absolute (rather than pairwise) ratings, authored by Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, and Percy Liang.[^13] HELM Instruct argues that existing instruction-following evaluations were either reference-based (which assumes one correct answer) or relative (which ranks models against each other without measuring distance from perfection). It proposes three principles for instruction-following evaluation: it should be *open-ended* (admitting many valid outputs), *multidimensional* (graded on multiple axes), and *absolute* (using a 1 to 5 scale).[^13]

The framework combines:[^13]

* **7 scenarios** drawn from prompt collections including HH-RLHF, Koala Eval, Vicuna Eval, OASST1, Self-Instruct, and curated ChatGPT prompts, capped at 100 examples each.
* **4 candidate models** in the launch: GPT-4 (0314), GPT-3.5 Turbo (0613), Anthropic Claude v1.3, and Cohere-Command-XLarge-Beta.
* **5 criteria** per response: helpfulness, understandability, completeness, conciseness, and harmlessness.
* **4 evaluators**: 16 vetted Amazon Mechanical Turk workers, the Scale AI rating platform, [GPT-4](/wiki/gpt-4) (0314), and [Anthropic Claude](/wiki/claude) v1.3 acting as LM judges.

Reported findings included that GPT-4 was the best candidate overall, Claude excelled specifically on understandability and harmlessness, and that LM judges showed Pearson correlations of 0.48 to 0.72 with human raters depending on the criterion, with GPT-4 a closer match to humans than Claude.[^13]

## HELM MMLU

CRFM published **HELM MMLU** on 1 May 2024 as a leaderboard dedicated to the Massive Multitask Language Understanding ([MMLU](/wiki/mmlu)) test.[^7] The motivation was that despite MMLU's prominence, scores reported by different vendors used inconsistent prompts, formats, and answer-extraction heuristics, making cross-model comparison unreliable.[^7]

HELM MMLU re-evaluates models on all 57 MMLU subjects using a single "Multiple Choice Joint" adaptation method that instructs the model to output a single letter (A, B, C, or D). Two model-specific concessions are documented: Claude 2 is queried through Anthropic's Human/Assistant format because the API rejects other prompt shapes, and Claude 3 is given the explicit instruction "Answer with only a single letter" to suppress chain-of-thought style responses.[^7] The launch evaluated 26 models, including Claude Instant through Claude 3 Opus, Gemini 1.0 Pro, GPT-4, Llama 2 and Llama 3 variants, Mistral and Mixtral, Gemma, PaLM 2, and Qwen.[^7] Headline finding: HELM's standardized MMLU scores diverged from vendor-reported scores by up to five percentage points, and almost always in the direction of lower HELM scores, consistent with vendors having used more favorable prompting at evaluation time.[^7]

## HELM Safety

On 8 November 2024, CRFM released **HELM Safety v1.0**, a standardized safety leaderboard built on the HELM framework.[^6] The launch argued that, although capability benchmarks had converged on a small canonical set, safety evaluation was fragmented: of 102 published safety benchmarks reviewed, only 12 had been used to evaluate any state-of-the-art model, and external evaluations rarely disclosed prompts and outputs.[^6]

HELM Safety v1.0 packages five existing safety benchmarks under one harness and runs them on 24 prominent LLMs from Anthropic, OpenAI, Google, Meta, Alibaba, Cohere, Databricks, DeepSeek, and Mistral.[^6] The five constituent benchmarks span six risk categories (violence, fraud, discrimination, sexual content, harassment, deception):[^6]

| Benchmark | Coverage |
|---|---|
| [BBQ](/wiki/bbq_benchmark) | 58,492 bias-benchmark multiple-choice questions on social discrimination.[^6] |
| SimpleSafetyTests | 100 unsafe prompts covering sexual content and violence.[^6] |
| [HarmBench](/wiki/harmbench) | 321 red-team prompts on deception, fraud, violence and harassment, scored with automated graders.[^6] |
| AnthropicRedTeam | 38,961 red-team attack transcripts across multiple harm categories.[^6] |
| XSTest | 450 prompts designed to surface the helpfulness vs harmlessness tradeoff.[^6] |

Claude 3.5 Sonnet (June 2024 release) ranked first overall, with particular strength on HarmBench.[^6] The launch also documented a methodological problem: when using LLMs as safety graders, [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) refused to grade harmful outputs at rates approaching 27%, versus near-zero refusal from GPT-4o; the CRFM team interpreted this as miscalibrated refusal behavior that undermines LM-as-judge safety evaluation.[^6] Models also lost roughly 26% on average to adversarial red-teaming methods, with some models degrading by 55% under attack.[^6] HELM Safety has since absorbed additional benchmarks, including **AIR-Bench 2024** (Stanford CRFM, 2024), which derives 314 fine-grained risk categories from 8 government regulations and 16 company policies and contributes 5,694 prompts evaluated through HELM.[^14]

## HELM Capabilities

On 20 March 2025, CRFM released **HELM Capabilities v1.0.0**, the current general-capability successor to HELM Lite.[^15] HELM Capabilities is organized around five competencies, each instantiated by a single dataset:

* **General knowledge:** [MMLU-Pro](/wiki/mmlu-pro), 1,000 instances.[^15]
* **Reasoning:** [GPQA](/wiki/gpqa), 448 instances of graduate-level science problems.[^15]
* **Instruction following:** [IFEval](/wiki/ifeval), 541 instances with verifiable instruction constraints.[^15]
* **Dialogue:** [WildBench](/wiki/wildbench), 1,000 in-the-wild chat instances.[^15]
* **Mathematical reasoning:** Omni-MATH, 1,000 Olympiad-level problems.[^15]

The launch evaluated 22 models from OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek, and Amazon, among others. HELM Capabilities differs from HELM Lite chiefly in aggregation: it uses mean per-scenario score rather than mean win rate, a choice CRFM made to reduce sensitivity of model rankings to the composition of the model set being compared.[^15]

## HELM Long Context

On 29 September 2025, CRFM published the **HELM Long Context** leaderboard, which the team describes as providing "transparent, comparable and reproducible evaluations of long context capabilities of recent models."[^23] It draws five tasks from three existing long-context benchmarks: RULER SQuAD and RULER HotPotQA, the InfiniteBench (∞Bench) En.MC multiple-choice and En.Sum summarization tasks, and OpenAI-MRCR multi-round coreference resolution.[^23] The launch evaluated 10 models from five organizations (Amazon, Google, Meta, OpenAI, and Writer). OpenAI's GPT-4.1 obtained the highest mean score of 0.588 and topped both this leaderboard and HELM Capabilities, with a Spearman rank correlation of 0.90 between the two rankings; even so, the best MRCR score was only 0.256, which the authors flagged as substantial headroom on a task they call computationally simple.[^23]

## Vision, image and domain extensions

The HELM framework has been extended to non-text modalities and to specialized domains, all under the same repository (`stanford-crfm/helm`).[^1]

* **HEIM (Holistic Evaluation of Text-to-Image Models)** was published at NeurIPS 2023 (Datasets and Benchmarks) and evaluates 26 text-to-image systems on 62 scenarios across 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. HEIM reports that no single model dominates across aspects.[^16]
* **VHELM (Holistic Evaluation of Vision-Language Models)** appeared at NeurIPS 2024 and extends the HELM design to VLMs. It aggregates 21 datasets covering nine aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. The launch evaluated 22 VLMs and showed, among other things, that efficiency-focused versions (such as [Claude](/wiki/claude) 3 Haiku or [Gemini](/wiki/gemini) 1.5 Flash) tend to underperform their full counterparts specifically on bias benchmarks.[^17]
* **MedHELM** was published in Nature Medicine in 2025 and applies the HELM framework to medical tasks. It uses a clinician-validated taxonomy that organizes medical AI into five clinical categories (clinical decision support, clinical note generation, patient communication, medical research, and administration) covering 22 subcategories and 121 tasks, with 37 benchmark evaluations. The launch reported that Claude 3.5 Sonnet attained performance comparable to top frontier models at lower estimated cost.[^8]
* Additional verticals listed on the HELM leaderboard include an **Audio HELM** (Holistic Evaluation of Audio-Language Models), an **Enterprise Benchmarks** suite (including IBM's HELM-Enterprise extension for finance, legal, climate, and cybersecurity), and **ToRR (Table Reasoning and Robustness)**.[^1][^18]

## How is HELM distributed and run?

HELM is distributed as the `crfm-helm` Python package on PyPI under the Apache 2.0 license, with command-line tools `helm-run`, `helm-summarize`, and `helm-server` for executing benchmarks, summarizing results, and serving a local web UI.[^1] The GitHub repository `stanford-crfm/helm` (about 2,800 stars) shows a long history of release notes; as of the latest visible release, v0.5.16 was tagged on 30 April 2026, with monthly minor releases preceding it through 2024, 2025, and early 2026.[^10] CRFM has indicated that "HELM entered maintenance mode on June 1, 2026", meaning new feature development winds down while the existing leaderboards remain published.[^1]

Architecturally, the framework separates *scenarios*, *adapters*, *models*, and *metrics* into independently extensible Python interfaces. A scenario produces a list of instances (input, reference output, split tag); an adapter rewrites those instances into model-specific prompts (typically multiple-choice joint format or generation format with in-context examples); the model interface wraps a unified completion or chat API; and a metric consumes both the prompt and the completion to produce a numeric score. The same `helm-run` command that drives HELM Classic also drives the Lite, Capabilities, Safety, Instruct, and MMLU tracks, by swapping the scenario, adapter, and metric set.[^1] CRFM provides connectors for the [Anthropic](/wiki/anthropic) API, OpenAI API, Google Gemini, Cohere, [Hugging Face](/wiki/hugging_face) Hub model endpoints, and a number of local-model runtimes, allowing a single configuration to evaluate both closed-API and open-weight systems under the same prompting protocol.[^1]

The HELM live leaderboard at `crfm.stanford.edu/helm` hosts multiple separately-versioned tracks (Classic, Lite, Capabilities, Safety, Instruct, MMLU, Long Context, HEIM, VHELM, MedHELM, Audio, Enterprise) and exposes per-prompt and per-completion records so that raw model outputs can be inspected directly.[^4] Each evaluation run is archived under a versioned URL (for example `helm/safety/v1.8.0`) so that earlier rankings remain inspectable even after the leaderboard moves on.[^6]

## How does HELM compare with other LLM leaderboards?

HELM coexists with several other public LLM evaluation systems, each making different methodological choices.

| Leaderboard | Operator | Scoring approach | Notes |
|---|---|---|---|
| HELM (Classic, Lite, Capabilities, Safety) | Stanford CRFM | Multi-metric (accuracy + 6 others in Classic) over a fixed scenario set; uniform 5-shot prompting; raw outputs published.[^2][^5][^6][^15] | Emphasizes transparency, holistic metrics, and reproducible prompting.[^3] |
| [BIG-Bench](/wiki/big_bench) | Google / community (444 authors, 132 institutions) | Aggregates 204 tasks contributed by the community; primary metric is accuracy with task-specific variants. Published 2022 as "Beyond the Imitation Game".[^19] | Strength is task diversity; weakness is uneven task quality and per-task accuracy focus.[^19] |
| Open LLM Leaderboard (HuggingFace v1, 2023-2024) | HuggingFace | Used six accuracy benchmarks ([MMLU](/wiki/mmlu), [HellaSwag](/wiki/hellaswag), ARC, [TruthfulQA](/wiki/truthfulqa), Winogrande, [GSM8K](/wiki/gsm8k)) run through the EleutherAI LM Evaluation Harness; >7,000 open models evaluated.[^20] | Retired in June 2024 after extensive saturation and contamination concerns on its constituent datasets.[^20] |
| Open LLM Leaderboard v2 (2024 onwards) | HuggingFace | Six newer benchmarks: [IFEval](/wiki/ifeval), BBH, MATH, [GPQA](/wiki/gpqa), MuSR, [MMLU-Pro](/wiki/mmlu-pro). Single-metric accuracy-style aggregation.[^20] | Designed to harden v1 against contamination; remains accuracy-focused.[^20] |
| [Chatbot Arena](/wiki/lmsys_chatbot_arena) | LMSYS / LMArena | Crowd-sourced pairwise human preferences; Elo-style ratings.[^21] | Measures perceived conversation quality, complementary to HELM's structured scenario approach.[^21] |

The HELM team has consistently argued that its differentiator is the combination of standardized prompting plus multi-metric measurement, rather than the choice of benchmarks alone: HELM Classic, for example, runs many of the same datasets used by the [EleutherAI](/wiki/eleutherai) LM Evaluation Harness, but pairs them with calibration, robustness, fairness, bias, toxicity, and efficiency scores.[^2][^3][^20]

## Why is HELM significant?

Within two years of its release, HELM had become one of the most cited reference frameworks in LLM evaluation literature, and several of its sub-projects produced widely-quoted findings: HELM MMLU's documentation of vendor over-reporting on [MMLU](/wiki/mmlu), HELM Safety's quantification of safety degradation under adversarial prompts, and VHELM's analysis of bias regressions in lightweight VLM variants.[^6][^7][^17] Its leaderboard pages provide one of the few public, prompt-level audit trails for closed-API models, which has been useful for academic analyses of contamination, prompt sensitivity, and reasoning behaviors.[^4][^11] HELM has also served as a template for domain-specific frameworks beyond CRFM: IBM's `helm-enterprise-benchmark` reuses HELM's infrastructure to evaluate LLMs on enterprise domain datasets in finance, legal, climate, and cybersecurity.[^18]

## Limitations and criticisms

HELM's authors and external commentators have identified several limitations.

* **Coverage of languages and modalities.** HELM Classic's scenarios are predominantly English-language, although several English varieties (including African-American English) are used in the fairness perturbations.[^2][^11] HEIM, VHELM, MedHELM, and Audio HELM extend coverage to other modalities but remain largely English at launch.[^16][^17][^8]
* **Cost.** The HELM Lite announcement explicitly acknowledged that running HELM Classic on a new closed-API model was expensive and slow, motivating the Lite redesign; the original 2022 run cost about $38,000 in API spend plus nearly 20,000 GPU-hours.[^5][^3]
* **Sensitivity to prompting.** Even with HELM's uniform prompting protocol, model rankings on [MMLU](/wiki/mmlu) in particular are sensitive to prompt format, in-context example choice, and answer-extraction heuristics. HELM MMLU documents up to five-percentage-point discrepancies between its standardized re-evaluations and vendor-reported numbers.[^7] Independent research has also shown that small "cheating" models can game open leaderboards including HELM by training to leaked or near-duplicate evaluation data.[^22]
* **Saturation and contamination.** Several of HELM's original scenarios (notably [MMLU](/wiki/mmlu), [HellaSwag](/wiki/hellaswag) and [TruthfulQA](/wiki/truthfulqa)) have been shown to be partially memorized by frontier models trained on large web crawls, a concern that motivated the HuggingFace Open LLM Leaderboard v2 transition in 2024.[^20][^22] HELM Capabilities responded by switching toward harder, more recent benchmarks ([MMLU-Pro](/wiki/mmlu-pro), [GPQA](/wiki/gpqa), Omni-MATH).[^15]
* **LM-as-judge reliability.** HELM Instruct and HELM Safety both rely partially on LLM judges, but HELM Safety itself documented that Claude 3.5 Sonnet refused to grade roughly a quarter of harmful outputs, illustrating that even the framework's own internal use of LM judges has open methodological problems.[^6][^13]
* **Maintenance mode.** CRFM has indicated that the HELM framework entered maintenance mode on 1 June 2026, raising open questions about long-term updates and adoption.[^1]

## Related work

* [MMLU](/wiki/mmlu) is the single most prominent constituent benchmark inside HELM and the subject of its own HELM track.[^7]
* [MMLU-Pro](/wiki/mmlu-pro), [GPQA](/wiki/gpqa), [IFEval](/wiki/ifeval), [WildBench](/wiki/wildbench) are the five core capability benchmarks in HELM Capabilities v1.0.0.[^15]
* [BIG-Bench](/wiki/big_bench) (with [BIG-Bench Hard](/wiki/big-bench-hard) as a curated subset) is a contemporaneous community benchmark with which HELM is frequently compared.[^19]
* [BBQ](/wiki/bbq_benchmark), [HarmBench](/wiki/harmbench), and the [WMDP benchmark](/wiki/wmdp) are safety-oriented benchmarks; the first two are integrated directly into HELM Safety.[^6]
* [Chatbot Arena](/wiki/lmsys_chatbot_arena) is the leading preference-based LLM leaderboard, complementing HELM's reference-based evaluation.[^21]
* [MTEB](/wiki/mteb) and [GLUE](/wiki/glue_benchmark) are major non-HELM benchmark suites for embeddings and language understanding respectively.

## See also

* [Benchmark (AI)](/wiki/benchmark)
* [Large language model](/wiki/large_language_model)
* [Percy Liang](/wiki/percy_liang)
* [Christopher Manning](/wiki/christopher_manning)
* [Stanford HAI](/wiki/stanford_hai)
* [Foundation models](/wiki/foundation_models)
* [MMLU](/wiki/mmlu)
* [MMLU-Pro](/wiki/mmlu-pro)
* [GPQA](/wiki/gpqa)
* [IFEval](/wiki/ifeval)
* [WildBench](/wiki/wildbench)
* [HellaSwag](/wiki/hellaswag)
* [TruthfulQA](/wiki/truthfulqa)
* [GSM8K](/wiki/gsm8k)
* [MATH (benchmark)](/wiki/math_benchmark)
* [LegalBench](/wiki/legalbench)
* [MedQA](/wiki/medqa)
* [BBQ](/wiki/bbq_benchmark)
* [HarmBench](/wiki/harmbench)
* [WMDP benchmark](/wiki/wmdp)
* [BIG-Bench](/wiki/big_bench)
* [BIG-Bench Hard](/wiki/big-bench-hard)
* [Chatbot Arena](/wiki/lmsys_chatbot_arena)
* [EleutherAI](/wiki/eleutherai)
* [Hugging Face](/wiki/hugging_face)
* [Calibration (machine learning)](/wiki/calibration)
* [GPT-3](/wiki/gpt-3)
* [GPT-4](/wiki/gpt-4)
* [Claude](/wiki/claude)
* [Gemini](/wiki/gemini)
* [BLOOM](/wiki/bloom)
* [LLM Benchmarks Timeline](/wiki/llm_benchmarks_timeline)

## References

[^1]: Stanford CRFM, "stanford-crfm/helm: Holistic Evaluation of Language Models", GitHub repository (Apache-2.0 license, ~2.8k stars; maintenance-mode notice effective 2026-06-01), 2022-2026. https://github.com/stanford-crfm/helm. Accessed 2026-06-24.
[^2]: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar et al., "Holistic Evaluation of Language Models", arXiv:2211.09110, 2022-11-16 (v1) and 2023-10-01 (v2). https://arxiv.org/abs/2211.09110. Accessed 2026-06-24.
[^3]: Stanford CRFM, "Holistic Evaluation of Language Models (HELM)", CRFM blog, 2022-11-17. https://crfm.stanford.edu/2022/11/17/helm.html. Accessed 2026-06-24.
[^4]: Stanford CRFM, "Holistic Evaluation of Language Models (HELM) leaderboard", crfm.stanford.edu, 2022-2026. https://crfm.stanford.edu/helm/. Accessed 2026-06-24.
[^5]: Stanford CRFM, "HELM Lite: Lightweight and Broad Capabilities Evaluation", CRFM blog, 2023-12-19. https://crfm.stanford.edu/2023/12/19/helm-lite.html. Accessed 2026-06-24.
[^6]: Stanford CRFM, "HELM Safety v1.0", CRFM blog, 2024-11-08. https://crfm.stanford.edu/2024/11/08/helm-safety.html. Accessed 2026-06-24.
[^7]: Stanford CRFM, "HELM MMLU: Massive Multitask Language Understanding", CRFM blog, 2024-05-01. https://crfm.stanford.edu/2024/05/01/helm-mmlu.html. Accessed 2026-06-24.
[^8]: Stanford CRFM and Stanford Medicine, "Holistic evaluation of large language models for medical tasks with MedHELM", Nature Medicine, 2025. https://www.nature.com/articles/s41591-025-04151-2. Accessed 2026-06-24.
[^9]: Percy Liang et al., "Holistic Evaluation of Language Models", Transactions on Machine Learning Research (TMLR), published 2023-08. https://jmlr.org/tmlr/. Accessed 2026-06-24.
[^10]: Stanford CRFM, "Releases - stanford-crfm/helm" and "crfm-helm" on PyPI, latest release v0.5.16 dated 2026-04-30. https://github.com/stanford-crfm/helm/releases and https://pypi.org/project/crfm-helm/. Accessed 2026-06-24.
[^11]: Stanford CRFM, "HELM Classic methodology overview (paper abstract and findings)", arXiv:2211.09110 abstract page, 2022-11-16. https://arxiv.org/abs/2211.09110. Accessed 2026-06-24.
[^12]: Stanford CRFM, "Scenarios", CRFM HELM Read-the-Docs documentation, 2024. https://crfm-helm.readthedocs.io/en/latest/scenarios/. Accessed 2026-06-24.
[^13]: Stanford CRFM (Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, Percy Liang), "HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings", CRFM blog, 2024-02-18. https://crfm.stanford.edu/2024/02/18/helm-instruct.html. Accessed 2026-06-24.
[^14]: Yi Zeng et al., "AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies", arXiv:2407.17436, 2024. https://arxiv.org/abs/2407.17436. Accessed 2026-06-24.
[^15]: Stanford CRFM, "HELM Capabilities v1.0.0", CRFM blog, 2025-03-20. https://crfm.stanford.edu/2025/03/20/helm-capabilities.html. Accessed 2026-06-24.
[^16]: Tony Lee et al., "Holistic Evaluation of Text-to-Image Models", arXiv:2311.04287; NeurIPS 2023 Datasets and Benchmarks Track, 2023-11-07. https://arxiv.org/abs/2311.04287. Accessed 2026-06-24.
[^17]: Tony Lee et al., "VHELM: A Holistic Evaluation of Vision Language Models", arXiv:2410.07112; NeurIPS 2024 Datasets and Benchmarks Track, 2024. https://arxiv.org/abs/2410.07112. Accessed 2026-06-24.
[^18]: IBM Research, "IBM/helm-enterprise-benchmark", GitHub repository, 2024-2025. https://github.com/IBM/helm-enterprise-benchmark. Accessed 2026-06-24.
[^19]: BIG-bench Collaboration (Aarohi Srivastava et al.), "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models", arXiv:2206.04615, 2022. https://arxiv.org/abs/2206.04615. Accessed 2026-06-24.
[^20]: HuggingFace, "Open LLM Leaderboard v1 (archive) and v2 documentation", huggingface.co, 2023-2024. https://huggingface.co/docs/leaderboards/en/open_llm_leaderboard/archive. Accessed 2026-06-24.
[^21]: LMSYS / LMArena, "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings", lmarena.ai blog and HuggingFace Space, 2023-2025. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard. Accessed 2026-06-24.
[^22]: Norah Alzahrani et al., "When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards", arXiv:2402.01781, 2024. https://arxiv.org/abs/2402.01781. Accessed 2026-06-24.
[^23]: Stanford CRFM, "HELM Long Context", CRFM blog, 2025-09-29. https://crfm.stanford.edu/2025/09/29/helm-long-context.html. Accessed 2026-06-24.

