HELM (Holistic Evaluation of Language Models)
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,483 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,483 words
Add missing citations, update stale details, or suggest a clearer explanation.
HELM (Holistic Evaluation of Language Models) is an open-source benchmark framework created by the Center for Research on Foundation Models (CRFM) at Stanford University for the reproducible and transparent evaluation of large language models and other foundation models.[1] First released in November 2022 alongside the arXiv preprint "Holistic Evaluation of Language Models" by Percy Liang, Rishi Bommasani, Tony Lee and 47 co-authors, HELM evaluates models across many scenarios on seven categories of metrics at once: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, all under a single uniform prompting protocol so that any two models are compared under the same conditions.[2][3] The authors state the goal plainly: "We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models."[2] The launch evaluation ran 30 models from 12 organizations across 42 scenarios, conducting more than 4,900 evaluations over roughly 12 billion tokens and 17 million model API calls at a cost of about $38,000 for commercial APIs plus nearly 20,000 GPU-hours for open models.[3][11] Since the original release ("HELM Classic") the framework has grown into a family of leaderboards, including HELM Lite, HELM Instruct, HELM MMLU, HELM Safety, HELM Capabilities, HELM Long Context, VHELM for vision-language models, HEIM for text-to-image models, and MedHELM for medical tasks, all hosted at crfm.stanford.edu/helm.[1][4][5][6][7][8][23]
| Item | Value |
|---|---|
| Full name | Holistic Evaluation of Language Models |
| Type | Open-source LLM benchmark framework and living leaderboard |
| Creator | Stanford Center for Research on Foundation Models (CRFM) |
| Lead authors | Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras (et al., 50 total)[2] |
| First arXiv release | 16 November 2022 (arXiv:2211.09110)[2] |
| TMLR publication | August 2023[9] |
| Original evaluation | 30 models, 12 organizations, 42 scenarios, 7 metrics[3] |
| GitHub | stanford-crfm/helm (Apache-2.0 license, ~2.8k stars)[1] |
| Leaderboard | crfm.stanford.edu/helm[4] |
| Latest framework release | v0.5.16 (30 April 2026) on GitHub and PyPI[10] |
| Maintenance mode | Entered 1 June 2026[1] |
By late 2022, large language models such as GPT-3, BLOOM, and Anthropic's earliest production model had proliferated, but their evaluation was fragmented: model creators reported scores on different subsets of benchmarks, often with non-comparable prompting conventions, and risk-oriented metrics like toxicity or bias were rarely reported alongside accuracy.[2][3] In the HELM paper Liang and colleagues observed that, before their work, on average prominent models had been evaluated on only 17.9% of HELM's core scenarios, leaving substantial gaps in the public record of model capabilities.[11] As the CRFM team put it in the launch announcement, "As language models become the substrate for language technologies, the absence of an evaluation standard compromises the community's ability to see the full landscape of language models," and "Transparency is the vital first step" toward trust and standards.[3]
HELM was developed at the CRFM, the foundation-models center launched inside Stanford HAI in 2021, with Percy Liang as faculty director and a leadership team including Rishi Bommasani, Tony Lee, and Christopher Manning among its 50 listed authors.[2] The paper was first posted to arXiv on 16 November 2022 and announced the next day on the CRFM blog; a substantially revised v2 followed on 1 October 2023, and the article was published in Transactions on Machine Learning Research (TMLR) in August 2023.[2][3][9]
The framework was deliberately positioned as a "living benchmark": the authors released raw prompts, completions, and a modular Python toolkit so that researchers could add new scenarios, new metrics, or new models and re-run the evaluation themselves, with results aggregated into a public leaderboard at crfm.stanford.edu/helm.[11][4] The blog post framed the long-term ambition directly: "We intend for HELM to serve as a map for the world of language models, continually updated over time, through collaboration with the broader community."[3]
HELM's central claim is that evaluation of language models should be holistic. The paper distills this into three principles.[2][3]
This last point distinguishes HELM from the way many model cards report numbers. The HELM MMLU effort, for example, found that scores reported by model providers on MMLU often differed from HELM's standardized re-evaluations by as much as five percentage points, and that reported numbers were frequently higher than HELM's, suggesting advantageous prompting in vendor reports.[7]
HELM Classic defines seven categories of metrics, each applied wherever feasible to each core scenario.[2][3]
| Metric | What it measures |
|---|---|
| Accuracy | Standard task-specific quality (exact match, F1, ROUGE, etc.) on the scenario's reference answers.[2] |
| Calibration | How well a model's expressed confidence matches its empirical correctness; computed where token-level log-probabilities are available.[2] |
| Robustness | Performance under typo-style and equivalence-preserving perturbations of the input, and under invariance-style transformations meant to mimic real-world noise.[2] |
| Fairness | Performance shifts when demographic features (names, dialects) in the input are changed, including comparisons across African-American English dialects and counterfactual demographic swaps.[2] |
| Bias | Demographic representation in model outputs (e.g., gender or race associations in generation), measured independently of correctness.[2] |
| Toxicity | Rate of harmful or insulting generations, scored automatically using a toxicity classifier on free-form outputs.[2] |
| Efficiency | Wall-clock and idealized inference cost, allowing comparison of accuracy against compute or latency budgets.[2] |
Because not every metric is well-defined for every scenario (calibration, for example, requires API access to token probabilities), HELM Classic reports that it achieves coverage of roughly 87.5% across the metric-by-scenario grid.[2][11]
HELM Classic defines 16 core scenarios that span six user-facing task families: question answering, information retrieval, summarization, sentiment analysis, toxicity detection, and miscellaneous text classification.[11] Concrete datasets used as core scenarios include MMLU, BoolQ, NarrativeQA, NaturalQuestions, QuAC, HellaSwag, OpenbookQA, TruthfulQA, MS MARCO, CNN/DailyMail, XSum, IMDB, CivilComments, and RAFT, among others.[12] On top of the core, the paper adds seven targeted evaluations based on 26 additional "targeted" scenarios for skills such as reasoning, knowledge, language modeling, and disinformation generation, for a total of 42 scenarios in the original release, 21 of which had not previously been used in mainstream LM evaluation.[2][12]
Each scenario is implemented as a Python class that produces standardized in-context-learning prompts and pairs them with reference outputs or scoring functions; the same scenario object is used to evaluate every model.[1][11]
The launch evaluation in the HELM paper covered 30 prominent language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex.[3] Notable systems included OpenAI GPT-3 and InstructGPT variants, BLOOM, Anthropic-LM, and Meta's OPT.[3] In aggregate the team conducted more than 4,900 evaluations spanning roughly 12 billion tokens of inference and 17 million model API calls, raising scenario coverage of these models from an average 17.9% to 96.0%.[3][11] The compute bill was substantial and itemized by the authors: about $38,000 in commercial API spend plus nearly 20,000 GPU-hours for the open models.[3] The paper distilled the results into 25 top-level findings, including the observation that no single model dominated across all metrics: even the strongest accuracy models had measurable bias and calibration deficits, and efficient smaller models were sometimes competitive on individual tasks.[11]
A second observation from the launch was the tradeoff between instruction-tuned and base models. Instruction-tuned variants, such as the InstructGPT family, dominated open-ended generation scenarios but did not necessarily improve on multiple-choice accuracy. The HELM team also reported sizeable gaps between open and closed models on accuracy, with proprietary systems such as OpenAI's text-davinci-002 clearly outperforming the strongest open systems available at the time, while open systems were sometimes competitive on individual scenarios such as sentiment analysis and short-form question answering.[11] HELM's structured presentation of these tradeoffs was widely cited in subsequent work and quickly adopted as a reference design for downstream evaluations.[11][3]
By late 2023, HELM Classic had grown unwieldy: full evaluation of a new model required running all 42 scenarios with three random seeds and many perturbations, which was expensive both in API spend and in compute. On 19 December 2023, CRFM published HELM Lite v1.0.0, a deliberately stripped-down version focused on capabilities rather than the full multi-metric matrix.[5]
HELM Lite simplifies HELM Classic in four ways:[5]
The HELM Lite scenario set contains nine benchmarks emphasizing generation rather than multiple choice: NarrativeQA, NaturalQuestions, OpenbookQA, a five-subject subset of MMLU, the MATH and GSM8K math benchmarks, a five-task subset of LegalBench, MedQA, and WMT-14 machine translation across five language pairs.[5] The launch evaluation ranked 28 model variants from 11 organizations by mean win rate, with GPT-4 on top overall; smaller models such as Writer's Palmyra-X and 01.AI's Yi-34B were noted as unexpectedly strong, and on the NarrativeQA scenario Yi-34B outperformed GPT-4.[5] Safety evaluations, the CRFM team noted, were deliberately handled outside HELM Lite via a separate partnership with MLCommons' AI safety working group.[5]
On 18 February 2024 CRFM published HELM Instruct, an instruction-following evaluation framework with absolute (rather than pairwise) ratings, authored by Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, and Percy Liang.[13] HELM Instruct argues that existing instruction-following evaluations were either reference-based (which assumes one correct answer) or relative (which ranks models against each other without measuring distance from perfection). It proposes three principles for instruction-following evaluation: it should be open-ended (admitting many valid outputs), multidimensional (graded on multiple axes), and absolute (using a 1 to 5 scale).[13]
The framework combines:[13]
Reported findings included that GPT-4 was the best candidate overall, Claude excelled specifically on understandability and harmlessness, and that LM judges showed Pearson correlations of 0.48 to 0.72 with human raters depending on the criterion, with GPT-4 a closer match to humans than Claude.[13]
CRFM published HELM MMLU on 1 May 2024 as a leaderboard dedicated to the Massive Multitask Language Understanding (MMLU) test.[7] The motivation was that despite MMLU's prominence, scores reported by different vendors used inconsistent prompts, formats, and answer-extraction heuristics, making cross-model comparison unreliable.[7]
HELM MMLU re-evaluates models on all 57 MMLU subjects using a single "Multiple Choice Joint" adaptation method that instructs the model to output a single letter (A, B, C, or D). Two model-specific concessions are documented: Claude 2 is queried through Anthropic's Human/Assistant format because the API rejects other prompt shapes, and Claude 3 is given the explicit instruction "Answer with only a single letter" to suppress chain-of-thought style responses.[7] The launch evaluated 26 models, including Claude Instant through Claude 3 Opus, Gemini 1.0 Pro, GPT-4, Llama 2 and Llama 3 variants, Mistral and Mixtral, Gemma, PaLM 2, and Qwen.[7] Headline finding: HELM's standardized MMLU scores diverged from vendor-reported scores by up to five percentage points, and almost always in the direction of lower HELM scores, consistent with vendors having used more favorable prompting at evaluation time.[7]
On 8 November 2024, CRFM released HELM Safety v1.0, a standardized safety leaderboard built on the HELM framework.[6] The launch argued that, although capability benchmarks had converged on a small canonical set, safety evaluation was fragmented: of 102 published safety benchmarks reviewed, only 12 had been used to evaluate any state-of-the-art model, and external evaluations rarely disclosed prompts and outputs.[6]
HELM Safety v1.0 packages five existing safety benchmarks under one harness and runs them on 24 prominent LLMs from Anthropic, OpenAI, Google, Meta, Alibaba, Cohere, Databricks, DeepSeek, and Mistral.[6] The five constituent benchmarks span six risk categories (violence, fraud, discrimination, sexual content, harassment, deception):[6]
| Benchmark | Coverage |
|---|---|
| BBQ | 58,492 bias-benchmark multiple-choice questions on social discrimination.[6] |
| SimpleSafetyTests | 100 unsafe prompts covering sexual content and violence.[6] |
| HarmBench | 321 red-team prompts on deception, fraud, violence and harassment, scored with automated graders.[6] |
| AnthropicRedTeam | 38,961 red-team attack transcripts across multiple harm categories.[6] |
| XSTest | 450 prompts designed to surface the helpfulness vs harmlessness tradeoff.[6] |
Claude 3.5 Sonnet (June 2024 release) ranked first overall, with particular strength on HarmBench.[6] The launch also documented a methodological problem: when using LLMs as safety graders, Claude 3.5 Sonnet refused to grade harmful outputs at rates approaching 27%, versus near-zero refusal from GPT-4o; the CRFM team interpreted this as miscalibrated refusal behavior that undermines LM-as-judge safety evaluation.[6] Models also lost roughly 26% on average to adversarial red-teaming methods, with some models degrading by 55% under attack.[6] HELM Safety has since absorbed additional benchmarks, including AIR-Bench 2024 (Stanford CRFM, 2024), which derives 314 fine-grained risk categories from 8 government regulations and 16 company policies and contributes 5,694 prompts evaluated through HELM.[14]
On 20 March 2025, CRFM released HELM Capabilities v1.0.0, the current general-capability successor to HELM Lite.[15] HELM Capabilities is organized around five competencies, each instantiated by a single dataset:
The launch evaluated 22 models from OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek, and Amazon, among others. HELM Capabilities differs from HELM Lite chiefly in aggregation: it uses mean per-scenario score rather than mean win rate, a choice CRFM made to reduce sensitivity of model rankings to the composition of the model set being compared.[15]
On 29 September 2025, CRFM published the HELM Long Context leaderboard, which the team describes as providing "transparent, comparable and reproducible evaluations of long context capabilities of recent models."[23] It draws five tasks from three existing long-context benchmarks: RULER SQuAD and RULER HotPotQA, the InfiniteBench (∞Bench) En.MC multiple-choice and En.Sum summarization tasks, and OpenAI-MRCR multi-round coreference resolution.[23] The launch evaluated 10 models from five organizations (Amazon, Google, Meta, OpenAI, and Writer). OpenAI's GPT-4.1 obtained the highest mean score of 0.588 and topped both this leaderboard and HELM Capabilities, with a Spearman rank correlation of 0.90 between the two rankings; even so, the best MRCR score was only 0.256, which the authors flagged as substantial headroom on a task they call computationally simple.[23]
The HELM framework has been extended to non-text modalities and to specialized domains, all under the same repository (stanford-crfm/helm).[1]
HELM is distributed as the crfm-helm Python package on PyPI under the Apache 2.0 license, with command-line tools helm-run, helm-summarize, and helm-server for executing benchmarks, summarizing results, and serving a local web UI.[1] The GitHub repository stanford-crfm/helm (about 2,800 stars) shows a long history of release notes; as of the latest visible release, v0.5.16 was tagged on 30 April 2026, with monthly minor releases preceding it through 2024, 2025, and early 2026.[10] CRFM has indicated that "HELM entered maintenance mode on June 1, 2026", meaning new feature development winds down while the existing leaderboards remain published.[1]
Architecturally, the framework separates scenarios, adapters, models, and metrics into independently extensible Python interfaces. A scenario produces a list of instances (input, reference output, split tag); an adapter rewrites those instances into model-specific prompts (typically multiple-choice joint format or generation format with in-context examples); the model interface wraps a unified completion or chat API; and a metric consumes both the prompt and the completion to produce a numeric score. The same helm-run command that drives HELM Classic also drives the Lite, Capabilities, Safety, Instruct, and MMLU tracks, by swapping the scenario, adapter, and metric set.[1] CRFM provides connectors for the Anthropic API, OpenAI API, Google Gemini, Cohere, Hugging Face Hub model endpoints, and a number of local-model runtimes, allowing a single configuration to evaluate both closed-API and open-weight systems under the same prompting protocol.[1]
The HELM live leaderboard at crfm.stanford.edu/helm hosts multiple separately-versioned tracks (Classic, Lite, Capabilities, Safety, Instruct, MMLU, Long Context, HEIM, VHELM, MedHELM, Audio, Enterprise) and exposes per-prompt and per-completion records so that raw model outputs can be inspected directly.[4] Each evaluation run is archived under a versioned URL (for example helm/safety/v1.8.0) so that earlier rankings remain inspectable even after the leaderboard moves on.[6]
HELM coexists with several other public LLM evaluation systems, each making different methodological choices.
| Leaderboard | Operator | Scoring approach | Notes |
|---|---|---|---|
| HELM (Classic, Lite, Capabilities, Safety) | Stanford CRFM | Multi-metric (accuracy + 6 others in Classic) over a fixed scenario set; uniform 5-shot prompting; raw outputs published.[2][5][6][15] | Emphasizes transparency, holistic metrics, and reproducible prompting.[3] |
| BIG-Bench | Google / community (444 authors, 132 institutions) | Aggregates 204 tasks contributed by the community; primary metric is accuracy with task-specific variants. Published 2022 as "Beyond the Imitation Game".[19] | Strength is task diversity; weakness is uneven task quality and per-task accuracy focus.[19] |
| Open LLM Leaderboard (HuggingFace v1, 2023-2024) | HuggingFace | Used six accuracy benchmarks (MMLU, HellaSwag, ARC, TruthfulQA, Winogrande, GSM8K) run through the EleutherAI LM Evaluation Harness; >7,000 open models evaluated.[20] | Retired in June 2024 after extensive saturation and contamination concerns on its constituent datasets.[20] |
| Open LLM Leaderboard v2 (2024 onwards) | HuggingFace | Six newer benchmarks: IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro. Single-metric accuracy-style aggregation.[20] | Designed to harden v1 against contamination; remains accuracy-focused.[20] |
| Chatbot Arena | LMSYS / LMArena | Crowd-sourced pairwise human preferences; Elo-style ratings.[21] | Measures perceived conversation quality, complementary to HELM's structured scenario approach.[21] |
The HELM team has consistently argued that its differentiator is the combination of standardized prompting plus multi-metric measurement, rather than the choice of benchmarks alone: HELM Classic, for example, runs many of the same datasets used by the EleutherAI LM Evaluation Harness, but pairs them with calibration, robustness, fairness, bias, toxicity, and efficiency scores.[2][3][20]
Within two years of its release, HELM had become one of the most cited reference frameworks in LLM evaluation literature, and several of its sub-projects produced widely-quoted findings: HELM MMLU's documentation of vendor over-reporting on MMLU, HELM Safety's quantification of safety degradation under adversarial prompts, and VHELM's analysis of bias regressions in lightweight VLM variants.[6][7][17] Its leaderboard pages provide one of the few public, prompt-level audit trails for closed-API models, which has been useful for academic analyses of contamination, prompt sensitivity, and reasoning behaviors.[4][11] HELM has also served as a template for domain-specific frameworks beyond CRFM: IBM's helm-enterprise-benchmark reuses HELM's infrastructure to evaluate LLMs on enterprise domain datasets in finance, legal, climate, and cybersecurity.[18]
HELM's authors and external commentators have identified several limitations.