HELM (Holistic Evaluation of Language Models)

AI Benchmarks Model Evaluation

22 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v3 · 4,483 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HELM (Holistic Evaluation of Language Models) is an open-source benchmark framework created by the Center for Research on Foundation Models (CRFM) at Stanford University for the reproducible and transparent evaluation of large language models and other foundation models.^[1] First released in November 2022 alongside the arXiv preprint "Holistic Evaluation of Language Models" by Percy Liang, Rishi Bommasani, Tony Lee and 47 co-authors, HELM evaluates models across many scenarios on seven categories of metrics at once: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, all under a single uniform prompting protocol so that any two models are compared under the same conditions.^[2]^[3] The authors state the goal plainly: "We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models."^[2] The launch evaluation ran 30 models from 12 organizations across 42 scenarios, conducting more than 4,900 evaluations over roughly 12 billion tokens and 17 million model API calls at a cost of about $38,000 for commercial APIs plus nearly 20,000 GPU-hours for open models.^[3]^[11] Since the original release ("HELM Classic") the framework has grown into a family of leaderboards, including HELM Lite, HELM Instruct, HELM MMLU, HELM Safety, HELM Capabilities, HELM Long Context, VHELM for vision-language models, HEIM for text-to-image models, and MedHELM for medical tasks, all hosted at crfm.stanford.edu/helm.^[1]^[4]^[5]^[6]^[7]^[8]^[23]

Infobox

Item	Value
Full name	Holistic Evaluation of Language Models
Type	Open-source LLM benchmark framework and living leaderboard
Creator	Stanford Center for Research on Foundation Models (CRFM)
Lead authors	Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras (et al., 50 total)^[2]
First arXiv release	16 November 2022 (arXiv:2211.09110)^[2]
TMLR publication	August 2023^[9]
Original evaluation	30 models, 12 organizations, 42 scenarios, 7 metrics^[3]
GitHub	`stanford-crfm/helm` (Apache-2.0 license, ~2.8k stars)^[1]
Leaderboard	`crfm.stanford.edu/helm`^[4]
Latest framework release	v0.5.16 (30 April 2026) on GitHub and PyPI^[10]
Maintenance mode	Entered 1 June 2026^[1]

What problem does HELM solve?

By late 2022, large language models such as GPT-3, BLOOM, and Anthropic's earliest production model had proliferated, but their evaluation was fragmented: model creators reported scores on different subsets of benchmarks, often with non-comparable prompting conventions, and risk-oriented metrics like toxicity or bias were rarely reported alongside accuracy.^[2]^[3] In the HELM paper Liang and colleagues observed that, before their work, on average prominent models had been evaluated on only 17.9% of HELM's core scenarios, leaving substantial gaps in the public record of model capabilities.^[11] As the CRFM team put it in the launch announcement, "As language models become the substrate for language technologies, the absence of an evaluation standard compromises the community's ability to see the full landscape of language models," and "Transparency is the vital first step" toward trust and standards.^[3]

HELM was developed at the CRFM, the foundation-models center launched inside Stanford HAI in 2021, with Percy Liang as faculty director and a leadership team including Rishi Bommasani, Tony Lee, and Christopher Manning among its 50 listed authors.^[2] The paper was first posted to arXiv on 16 November 2022 and announced the next day on the CRFM blog; a substantially revised v2 followed on 1 October 2023, and the article was published in Transactions on Machine Learning Research (TMLR) in August 2023.^[2]^[3]^[9]

The framework was deliberately positioned as a "living benchmark": the authors released raw prompts, completions, and a modular Python toolkit so that researchers could add new scenarios, new metrics, or new models and re-run the evaluation themselves, with results aggregated into a public leaderboard at crfm.stanford.edu/helm.^[11]^[4] The blog post framed the long-term ambition directly: "We intend for HELM to serve as a map for the world of language models, continually updated over time, through collaboration with the broader community."^[3]

What is HELM's design philosophy?

HELM's central claim is that evaluation of language models should be holistic. The paper distills this into three principles.^[2]^[3]

Broad coverage with explicit recognition of gaps. Rather than pick a single benchmark, HELM organizes evaluation around a taxonomy of scenarios (use cases, domains, languages, demographic groups) and a taxonomy of metrics. The taxonomy makes it possible to enumerate not only what is evaluated but also what is missing.^[2]
Multi-metric measurement. For each scenario, HELM tries to measure seven categories of metrics in the same context, rather than relegating risks such as bias or toxicity to separate, accuracy-only studies.^[2]^[3]
Standardization. Every model is run on every scenario through a single Python codebase with uniform prompting (typically 5-shot in-context learning with fixed templates), so that comparisons are not confounded by prompting differences.^[2]^[11]

This last point distinguishes HELM from the way many model cards report numbers. The HELM MMLU effort, for example, found that scores reported by model providers on MMLU often differed from HELM's standardized re-evaluations by as much as five percentage points, and that reported numbers were frequently higher than HELM's, suggesting advantageous prompting in vendor reports.^[7]

What are the seven HELM metrics?

HELM Classic defines seven categories of metrics, each applied wherever feasible to each core scenario.^[2]^[3]

Metric	What it measures
Accuracy	Standard task-specific quality (exact match, F1, ROUGE, etc.) on the scenario's reference answers.^[2]
Calibration	How well a model's expressed confidence matches its empirical correctness; computed where token-level log-probabilities are available.^[2]
Robustness	Performance under typo-style and equivalence-preserving perturbations of the input, and under invariance-style transformations meant to mimic real-world noise.^[2]
Fairness	Performance shifts when demographic features (names, dialects) in the input are changed, including comparisons across African-American English dialects and counterfactual demographic swaps.^[2]
Bias	Demographic representation in model outputs (e.g., gender or race associations in generation), measured independently of correctness.^[2]
Toxicity	Rate of harmful or insulting generations, scored automatically using a toxicity classifier on free-form outputs.^[2]
Efficiency	Wall-clock and idealized inference cost, allowing comparison of accuracy against compute or latency budgets.^[2]

Because not every metric is well-defined for every scenario (calibration, for example, requires API access to token probabilities), HELM Classic reports that it achieves coverage of roughly 87.5% across the metric-by-scenario grid.^[2]^[11]

What are the 16 core scenarios?

HELM Classic defines 16 core scenarios that span six user-facing task families: question answering, information retrieval, summarization, sentiment analysis, toxicity detection, and miscellaneous text classification.^[11] Concrete datasets used as core scenarios include MMLU, BoolQ, NarrativeQA, NaturalQuestions, QuAC, HellaSwag, OpenbookQA, TruthfulQA, MS MARCO, CNN/DailyMail, XSum, IMDB, CivilComments, and RAFT, among others.^[12] On top of the core, the paper adds seven targeted evaluations based on 26 additional "targeted" scenarios for skills such as reasoning, knowledge, language modeling, and disinformation generation, for a total of 42 scenarios in the original release, 21 of which had not previously been used in mainstream LM evaluation.^[2]^[12]

Each scenario is implemented as a Python class that produces standardized in-context-learning prompts and pairs them with reference outputs or scoring functions; the same scenario object is used to evaluate every model.^[1]^[11]

What did the initial 2022 evaluation find?

The launch evaluation in the HELM paper covered 30 prominent language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex.^[3] Notable systems included OpenAI GPT-3 and InstructGPT variants, BLOOM, Anthropic-LM, and Meta's OPT.^[3] In aggregate the team conducted more than 4,900 evaluations spanning roughly 12 billion tokens of inference and 17 million model API calls, raising scenario coverage of these models from an average 17.9% to 96.0%.^[3]^[11] The compute bill was substantial and itemized by the authors: about $38,000 in commercial API spend plus nearly 20,000 GPU-hours for the open models.^[3] The paper distilled the results into 25 top-level findings, including the observation that no single model dominated across all metrics: even the strongest accuracy models had measurable bias and calibration deficits, and efficient smaller models were sometimes competitive on individual tasks.^[11]

A second observation from the launch was the tradeoff between instruction-tuned and base models. Instruction-tuned variants, such as the InstructGPT family, dominated open-ended generation scenarios but did not necessarily improve on multiple-choice accuracy. The HELM team also reported sizeable gaps between open and closed models on accuracy, with proprietary systems such as OpenAI's text-davinci-002 clearly outperforming the strongest open systems available at the time, while open systems were sometimes competitive on individual scenarios such as sentiment analysis and short-form question answering.^[11] HELM's structured presentation of these tradeoffs was widely cited in subsequent work and quickly adopted as a reference design for downstream evaluations.^[11]^[3]

How does HELM Classic differ from HELM Lite?

By late 2023, HELM Classic had grown unwieldy: full evaluation of a new model required running all 42 scenarios with three random seeds and many perturbations, which was expensive both in API spend and in compute. On 19 December 2023, CRFM published HELM Lite v1.0.0, a deliberately stripped-down version focused on capabilities rather than the full multi-metric matrix.^[5]

HELM Lite simplifies HELM Classic in four ways:^[5]

Uses one random seed instead of three over choices of in-context examples.
Drops perturbation-based robustness and fairness measurements, on the grounds that they were strongly correlated with raw accuracy in HELM Classic.
Removes the calibration metric, since several major LLM APIs (notably Anthropic and Google) had stopped exposing token log-probabilities, making it inapplicable.
Drops perplexity and the information-retrieval scenarios, citing computational expense and decreasing relevance.

The HELM Lite scenario set contains nine benchmarks emphasizing generation rather than multiple choice: NarrativeQA, NaturalQuestions, OpenbookQA, a five-subject subset of MMLU, the MATH and GSM8K math benchmarks, a five-task subset of LegalBench, MedQA, and WMT-14 machine translation across five language pairs.^[5] The launch evaluation ranked 28 model variants from 11 organizations by mean win rate, with GPT-4 on top overall; smaller models such as Writer's Palmyra-X and 01.AI's Yi-34B were noted as unexpectedly strong, and on the NarrativeQA scenario Yi-34B outperformed GPT-4.^[5] Safety evaluations, the CRFM team noted, were deliberately handled outside HELM Lite via a separate partnership with MLCommons' AI safety working group.^[5]

HELM Instruct

On 18 February 2024 CRFM published HELM Instruct, an instruction-following evaluation framework with absolute (rather than pairwise) ratings, authored by Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, and Percy Liang.^[13] HELM Instruct argues that existing instruction-following evaluations were either reference-based (which assumes one correct answer) or relative (which ranks models against each other without measuring distance from perfection). It proposes three principles for instruction-following evaluation: it should be open-ended (admitting many valid outputs), multidimensional (graded on multiple axes), and absolute (using a 1 to 5 scale).^[13]

The framework combines:^[13]

7 scenarios drawn from prompt collections including HH-RLHF, Koala Eval, Vicuna Eval, OASST1, Self-Instruct, and curated ChatGPT prompts, capped at 100 examples each.
4 candidate models in the launch: GPT-4 (0314), GPT-3.5 Turbo (0613), Anthropic Claude v1.3, and Cohere-Command-XLarge-Beta.
5 criteria per response: helpfulness, understandability, completeness, conciseness, and harmlessness.
4 evaluators: 16 vetted Amazon Mechanical Turk workers, the Scale AI rating platform, GPT-4 (0314), and Anthropic Claude v1.3 acting as LM judges.

Reported findings included that GPT-4 was the best candidate overall, Claude excelled specifically on understandability and harmlessness, and that LM judges showed Pearson correlations of 0.48 to 0.72 with human raters depending on the criterion, with GPT-4 a closer match to humans than Claude.^[13]

HELM MMLU

CRFM published HELM MMLU on 1 May 2024 as a leaderboard dedicated to the Massive Multitask Language Understanding (MMLU) test.^[7] The motivation was that despite MMLU's prominence, scores reported by different vendors used inconsistent prompts, formats, and answer-extraction heuristics, making cross-model comparison unreliable.^[7]

HELM MMLU re-evaluates models on all 57 MMLU subjects using a single "Multiple Choice Joint" adaptation method that instructs the model to output a single letter (A, B, C, or D). Two model-specific concessions are documented: Claude 2 is queried through Anthropic's Human/Assistant format because the API rejects other prompt shapes, and Claude 3 is given the explicit instruction "Answer with only a single letter" to suppress chain-of-thought style responses.^[7] The launch evaluated 26 models, including Claude Instant through Claude 3 Opus, Gemini 1.0 Pro, GPT-4, Llama 2 and Llama 3 variants, Mistral and Mixtral, Gemma, PaLM 2, and Qwen.^[7] Headline finding: HELM's standardized MMLU scores diverged from vendor-reported scores by up to five percentage points, and almost always in the direction of lower HELM scores, consistent with vendors having used more favorable prompting at evaluation time.^[7]

HELM Safety

On 8 November 2024, CRFM released HELM Safety v1.0, a standardized safety leaderboard built on the HELM framework.^[6] The launch argued that, although capability benchmarks had converged on a small canonical set, safety evaluation was fragmented: of 102 published safety benchmarks reviewed, only 12 had been used to evaluate any state-of-the-art model, and external evaluations rarely disclosed prompts and outputs.^[6]

HELM Safety v1.0 packages five existing safety benchmarks under one harness and runs them on 24 prominent LLMs from Anthropic, OpenAI, Google, Meta, Alibaba, Cohere, Databricks, DeepSeek, and Mistral.^[6] The five constituent benchmarks span six risk categories (violence, fraud, discrimination, sexual content, harassment, deception):^[6]

Benchmark	Coverage
BBQ	58,492 bias-benchmark multiple-choice questions on social discrimination.^[6]
SimpleSafetyTests	100 unsafe prompts covering sexual content and violence.^[6]
HarmBench	321 red-team prompts on deception, fraud, violence and harassment, scored with automated graders.^[6]
AnthropicRedTeam	38,961 red-team attack transcripts across multiple harm categories.^[6]
XSTest	450 prompts designed to surface the helpfulness vs harmlessness tradeoff.^[6]

Claude 3.5 Sonnet (June 2024 release) ranked first overall, with particular strength on HarmBench.^[6] The launch also documented a methodological problem: when using LLMs as safety graders, Claude 3.5 Sonnet refused to grade harmful outputs at rates approaching 27%, versus near-zero refusal from GPT-4o; the CRFM team interpreted this as miscalibrated refusal behavior that undermines LM-as-judge safety evaluation.^[6] Models also lost roughly 26% on average to adversarial red-teaming methods, with some models degrading by 55% under attack.^[6] HELM Safety has since absorbed additional benchmarks, including AIR-Bench 2024 (Stanford CRFM, 2024), which derives 314 fine-grained risk categories from 8 government regulations and 16 company policies and contributes 5,694 prompts evaluated through HELM.^[14]

HELM Capabilities

On 20 March 2025, CRFM released HELM Capabilities v1.0.0, the current general-capability successor to HELM Lite.^[15] HELM Capabilities is organized around five competencies, each instantiated by a single dataset:

General knowledge: MMLU-Pro, 1,000 instances.^[15]
Reasoning: GPQA, 448 instances of graduate-level science problems.^[15]
Instruction following: IFEval, 541 instances with verifiable instruction constraints.^[15]
Dialogue: WildBench, 1,000 in-the-wild chat instances.^[15]
Mathematical reasoning: Omni-MATH, 1,000 Olympiad-level problems.^[15]

The launch evaluated 22 models from OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek, and Amazon, among others. HELM Capabilities differs from HELM Lite chiefly in aggregation: it uses mean per-scenario score rather than mean win rate, a choice CRFM made to reduce sensitivity of model rankings to the composition of the model set being compared.^[15]

HELM Long Context

On 29 September 2025, CRFM published the HELM Long Context leaderboard, which the team describes as providing "transparent, comparable and reproducible evaluations of long context capabilities of recent models."^[23] It draws five tasks from three existing long-context benchmarks: RULER SQuAD and RULER HotPotQA, the InfiniteBench (∞Bench) En.MC multiple-choice and En.Sum summarization tasks, and OpenAI-MRCR multi-round coreference resolution.^[23] The launch evaluated 10 models from five organizations (Amazon, Google, Meta, OpenAI, and Writer). OpenAI's GPT-4.1 obtained the highest mean score of 0.588 and topped both this leaderboard and HELM Capabilities, with a Spearman rank correlation of 0.90 between the two rankings; even so, the best MRCR score was only 0.256, which the authors flagged as substantial headroom on a task they call computationally simple.^[23]

Vision, image and domain extensions

The HELM framework has been extended to non-text modalities and to specialized domains, all under the same repository (stanford-crfm/helm).^[1]

HEIM (Holistic Evaluation of Text-to-Image Models) was published at NeurIPS 2023 (Datasets and Benchmarks) and evaluates 26 text-to-image systems on 62 scenarios across 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. HEIM reports that no single model dominates across aspects.^[16]
VHELM (Holistic Evaluation of Vision-Language Models) appeared at NeurIPS 2024 and extends the HELM design to VLMs. It aggregates 21 datasets covering nine aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. The launch evaluated 22 VLMs and showed, among other things, that efficiency-focused versions (such as Claude 3 Haiku or Gemini 1.5 Flash) tend to underperform their full counterparts specifically on bias benchmarks.^[17]
MedHELM was published in Nature Medicine in 2025 and applies the HELM framework to medical tasks. It uses a clinician-validated taxonomy that organizes medical AI into five clinical categories (clinical decision support, clinical note generation, patient communication, medical research, and administration) covering 22 subcategories and 121 tasks, with 37 benchmark evaluations. The launch reported that Claude 3.5 Sonnet attained performance comparable to top frontier models at lower estimated cost.^[8]
Additional verticals listed on the HELM leaderboard include an Audio HELM (Holistic Evaluation of Audio-Language Models), an Enterprise Benchmarks suite (including IBM's HELM-Enterprise extension for finance, legal, climate, and cybersecurity), and ToRR (Table Reasoning and Robustness).^[1]^[18]

How is HELM distributed and run?

HELM is distributed as the crfm-helm Python package on PyPI under the Apache 2.0 license, with command-line tools helm-run, helm-summarize, and helm-server for executing benchmarks, summarizing results, and serving a local web UI.^[1] The GitHub repository stanford-crfm/helm (about 2,800 stars) shows a long history of release notes; as of the latest visible release, v0.5.16 was tagged on 30 April 2026, with monthly minor releases preceding it through 2024, 2025, and early 2026.^[10] CRFM has indicated that "HELM entered maintenance mode on June 1, 2026", meaning new feature development winds down while the existing leaderboards remain published.^[1]

Architecturally, the framework separates scenarios, adapters, models, and metrics into independently extensible Python interfaces. A scenario produces a list of instances (input, reference output, split tag); an adapter rewrites those instances into model-specific prompts (typically multiple-choice joint format or generation format with in-context examples); the model interface wraps a unified completion or chat API; and a metric consumes both the prompt and the completion to produce a numeric score. The same helm-run command that drives HELM Classic also drives the Lite, Capabilities, Safety, Instruct, and MMLU tracks, by swapping the scenario, adapter, and metric set.^[1] CRFM provides connectors for the Anthropic API, OpenAI API, Google Gemini, Cohere, Hugging Face Hub model endpoints, and a number of local-model runtimes, allowing a single configuration to evaluate both closed-API and open-weight systems under the same prompting protocol.^[1]

The HELM live leaderboard at crfm.stanford.edu/helm hosts multiple separately-versioned tracks (Classic, Lite, Capabilities, Safety, Instruct, MMLU, Long Context, HEIM, VHELM, MedHELM, Audio, Enterprise) and exposes per-prompt and per-completion records so that raw model outputs can be inspected directly.^[4] Each evaluation run is archived under a versioned URL (for example helm/safety/v1.8.0) so that earlier rankings remain inspectable even after the leaderboard moves on.^[6]

How does HELM compare with other LLM leaderboards?

HELM coexists with several other public LLM evaluation systems, each making different methodological choices.

Leaderboard	Operator	Scoring approach	Notes
HELM (Classic, Lite, Capabilities, Safety)	Stanford CRFM	Multi-metric (accuracy + 6 others in Classic) over a fixed scenario set; uniform 5-shot prompting; raw outputs published.^[2]^[5]^[6]^[15]	Emphasizes transparency, holistic metrics, and reproducible prompting.^[3]
BIG-Bench	Google / community (444 authors, 132 institutions)	Aggregates 204 tasks contributed by the community; primary metric is accuracy with task-specific variants. Published 2022 as "Beyond the Imitation Game".^[19]	Strength is task diversity; weakness is uneven task quality and per-task accuracy focus.^[19]
Open LLM Leaderboard (HuggingFace v1, 2023-2024)	HuggingFace	Used six accuracy benchmarks (MMLU, HellaSwag, ARC, TruthfulQA, Winogrande, GSM8K) run through the EleutherAI LM Evaluation Harness; >7,000 open models evaluated.^[20]	Retired in June 2024 after extensive saturation and contamination concerns on its constituent datasets.^[20]
Open LLM Leaderboard v2 (2024 onwards)	HuggingFace	Six newer benchmarks: IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro. Single-metric accuracy-style aggregation.^[20]	Designed to harden v1 against contamination; remains accuracy-focused.^[20]
Chatbot Arena	LMSYS / LMArena	Crowd-sourced pairwise human preferences; Elo-style ratings.^[21]	Measures perceived conversation quality, complementary to HELM's structured scenario approach.^[21]

The HELM team has consistently argued that its differentiator is the combination of standardized prompting plus multi-metric measurement, rather than the choice of benchmarks alone: HELM Classic, for example, runs many of the same datasets used by the EleutherAI LM Evaluation Harness, but pairs them with calibration, robustness, fairness, bias, toxicity, and efficiency scores.^[2]^[3]^[20]

Why is HELM significant?

Within two years of its release, HELM had become one of the most cited reference frameworks in LLM evaluation literature, and several of its sub-projects produced widely-quoted findings: HELM MMLU's documentation of vendor over-reporting on MMLU, HELM Safety's quantification of safety degradation under adversarial prompts, and VHELM's analysis of bias regressions in lightweight VLM variants.^[6]^[7]^[17] Its leaderboard pages provide one of the few public, prompt-level audit trails for closed-API models, which has been useful for academic analyses of contamination, prompt sensitivity, and reasoning behaviors.^[4]^[11] HELM has also served as a template for domain-specific frameworks beyond CRFM: IBM's helm-enterprise-benchmark reuses HELM's infrastructure to evaluate LLMs on enterprise domain datasets in finance, legal, climate, and cybersecurity.^[18]

Limitations and criticisms

HELM's authors and external commentators have identified several limitations.

Coverage of languages and modalities. HELM Classic's scenarios are predominantly English-language, although several English varieties (including African-American English) are used in the fairness perturbations.^[2]^[11] HEIM, VHELM, MedHELM, and Audio HELM extend coverage to other modalities but remain largely English at launch.^[16]^[17]^[8]
Cost. The HELM Lite announcement explicitly acknowledged that running HELM Classic on a new closed-API model was expensive and slow, motivating the Lite redesign; the original 2022 run cost about $38,000 in API spend plus nearly 20,000 GPU-hours.^[5]^[3]
Sensitivity to prompting. Even with HELM's uniform prompting protocol, model rankings on MMLU in particular are sensitive to prompt format, in-context example choice, and answer-extraction heuristics. HELM MMLU documents up to five-percentage-point discrepancies between its standardized re-evaluations and vendor-reported numbers.^[7] Independent research has also shown that small "cheating" models can game open leaderboards including HELM by training to leaked or near-duplicate evaluation data.^[22]
Saturation and contamination. Several of HELM's original scenarios (notably MMLU, HellaSwag and TruthfulQA) have been shown to be partially memorized by frontier models trained on large web crawls, a concern that motivated the HuggingFace Open LLM Leaderboard v2 transition in 2024.^[20]^[22] HELM Capabilities responded by switching toward harder, more recent benchmarks (MMLU-Pro, GPQA, Omni-MATH).^[15]
LM-as-judge reliability. HELM Instruct and HELM Safety both rely partially on LLM judges, but HELM Safety itself documented that Claude 3.5 Sonnet refused to grade roughly a quarter of harmful outputs, illustrating that even the framework's own internal use of LM judges has open methodological problems.^[6]^[13]
Maintenance mode. CRFM has indicated that the HELM framework entered maintenance mode on 1 June 2026, raising open questions about long-term updates and adoption.^[1]

MMLU is the single most prominent constituent benchmark inside HELM and the subject of its own HELM track.^[7]
MMLU-Pro, GPQA, IFEval, WildBench are the five core capability benchmarks in HELM Capabilities v1.0.0.^[15]
BIG-Bench (with BIG-Bench Hard as a curated subset) is a contemporaneous community benchmark with which HELM is frequently compared.^[19]
BBQ, HarmBench, and the WMDP benchmark are safety-oriented benchmarks; the first two are integrated directly into HELM Safety.^[6]
Chatbot Arena is the leading preference-based LLM leaderboard, complementing HELM's reference-based evaluation.^[21]
MTEB and GLUE are major non-HELM benchmark suites for embeddings and language understanding respectively.

References

Stanford CRFM, "stanford-crfm/helm: Holistic Evaluation of Language Models", GitHub repository (Apache-2.0 license, ~2.8k stars; maintenance-mode notice effective 2026-06-01), 2022-2026. https://github.com/stanford-crfm/helm. Accessed 2026-06-24. ↩
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar et al., "Holistic Evaluation of Language Models", arXiv:2211.09110, 2022-11-16 (v1) and 2023-10-01 (v2). https://arxiv.org/abs/2211.09110. Accessed 2026-06-24. ↩
Stanford CRFM, "Holistic Evaluation of Language Models (HELM)", CRFM blog, 2022-11-17. https://crfm.stanford.edu/2022/11/17/helm.html. Accessed 2026-06-24. ↩
Stanford CRFM, "Holistic Evaluation of Language Models (HELM) leaderboard", crfm.stanford.edu, 2022-2026. https://crfm.stanford.edu/helm/. Accessed 2026-06-24. ↩
Stanford CRFM, "HELM Lite: Lightweight and Broad Capabilities Evaluation", CRFM blog, 2023-12-19. https://crfm.stanford.edu/2023/12/19/helm-lite.html. Accessed 2026-06-24. ↩
Stanford CRFM, "HELM Safety v1.0", CRFM blog, 2024-11-08. https://crfm.stanford.edu/2024/11/08/helm-safety.html. Accessed 2026-06-24. ↩
Stanford CRFM, "HELM MMLU: Massive Multitask Language Understanding", CRFM blog, 2024-05-01. https://crfm.stanford.edu/2024/05/01/helm-mmlu.html. Accessed 2026-06-24. ↩
Stanford CRFM and Stanford Medicine, "Holistic evaluation of large language models for medical tasks with MedHELM", Nature Medicine, 2025. https://www.nature.com/articles/s41591-025-04151-2. Accessed 2026-06-24. ↩
Percy Liang et al., "Holistic Evaluation of Language Models", Transactions on Machine Learning Research (TMLR), published 2023-08. https://jmlr.org/tmlr/. Accessed 2026-06-24. ↩
Stanford CRFM, "Releases - stanford-crfm/helm" and "crfm-helm" on PyPI, latest release v0.5.16 dated 2026-04-30. https://github.com/stanford-crfm/helm/releases and https://pypi.org/project/crfm-helm/. Accessed 2026-06-24. ↩
Stanford CRFM, "HELM Classic methodology overview (paper abstract and findings)", arXiv:2211.09110 abstract page, 2022-11-16. https://arxiv.org/abs/2211.09110. Accessed 2026-06-24. ↩
Stanford CRFM, "Scenarios", CRFM HELM Read-the-Docs documentation, 2024. https://crfm-helm.readthedocs.io/en/latest/scenarios/. Accessed 2026-06-24. ↩
Stanford CRFM (Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, Percy Liang), "HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings", CRFM blog, 2024-02-18. https://crfm.stanford.edu/2024/02/18/helm-instruct.html. Accessed 2026-06-24. ↩
Yi Zeng et al., "AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies", arXiv:2407.17436, 2024. https://arxiv.org/abs/2407.17436. Accessed 2026-06-24. ↩
Stanford CRFM, "HELM Capabilities v1.0.0", CRFM blog, 2025-03-20. https://crfm.stanford.edu/2025/03/20/helm-capabilities.html. Accessed 2026-06-24. ↩
Tony Lee et al., "Holistic Evaluation of Text-to-Image Models", arXiv:2311.04287; NeurIPS 2023 Datasets and Benchmarks Track, 2023-11-07. https://arxiv.org/abs/2311.04287. Accessed 2026-06-24. ↩
Tony Lee et al., "VHELM: A Holistic Evaluation of Vision Language Models", arXiv:2410.07112; NeurIPS 2024 Datasets and Benchmarks Track, 2024. https://arxiv.org/abs/2410.07112. Accessed 2026-06-24. ↩
IBM Research, "IBM/helm-enterprise-benchmark", GitHub repository, 2024-2025. https://github.com/IBM/helm-enterprise-benchmark. Accessed 2026-06-24. ↩
BIG-bench Collaboration (Aarohi Srivastava et al.), "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models", arXiv:2206.04615, 2022. https://arxiv.org/abs/2206.04615. Accessed 2026-06-24. ↩
HuggingFace, "Open LLM Leaderboard v1 (archive) and v2 documentation", huggingface.co, 2023-2024. https://huggingface.co/docs/leaderboards/en/open_llm_leaderboard/archive. Accessed 2026-06-24. ↩
LMSYS / LMArena, "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings", lmarena.ai blog and HuggingFace Space, 2023-2025. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard. Accessed 2026-06-24. ↩
Norah Alzahrani et al., "When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards", arXiv:2402.01781, 2024. https://arxiv.org/abs/2402.01781. Accessed 2026-06-24. ↩
Stanford CRFM, "HELM Long Context", CRFM blog, 2025-09-29. https://crfm.stanford.edu/2025/09/29/helm-long-context.html. Accessed 2026-06-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

BBQ (Bias Benchmark for QA)Benchmark (AI)Dynabench Harness (AI)LLM Rankings LM Evaluation Harness LegalBench MedHELM Qdrant Stanford Institute for Human-Centered Artificial Intelligence Text Generation Models TruthfulQA

Infobox

What problem does HELM solve?

What is HELM's design philosophy?

What are the seven HELM metrics?

What are the 16 core scenarios?

What did the initial 2022 evaluation find?

How does HELM Classic differ from HELM Lite?

HELM Instruct

HELM MMLU

HELM Safety

HELM Capabilities

HELM Long Context

Vision, image and domain extensions

How is HELM distributed and run?

How does HELM compare with other LLM leaderboards?

Why is HELM significant?

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here