HELM (Holistic Evaluation of Language Models)
HELM (Holistic Evaluation of Language Models) is an open-source benchmark framework created by the Center for Research on Foundation Models (CRFM) at Stanford University for the reproducible and transparent evaluation of large language models and other foundation models.[1] The project, first released in November 2022 with the arXiv preprint "Holistic Evaluation of Language Models" by Percy Liang, Rishi Bommasani, Tony Lee and 47 co-authors, argues that prior language-model evaluations focused too narrowly on accuracy and used non-standardized prompts, leaving model behavior on risks such as bias, toxicity, and fairness largely uncharacterized.[2] HELM responds with a top-down taxonomy: seven categories of metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) computed across a fixed set of core scenarios, and a uniform prompting protocol so that any two models are evaluated under the same conditions.[2][3] Since the original release ("HELM Classic") the framework has grown into a family of leaderboards, including HELM Lite, HELM Instruct, HELM MMLU, HELM Safety, HELM Capabilities, VHELM for vision-language models, HEIM for text-to-image models, and MedHELM for medical tasks, all hosted at crfm.stanford.edu/helm.[1][4][5][6][7][8]
Infobox
| Item | Value |
|---|
| Full name | Holistic Evaluation of Language Models |
| Type | Open-source LLM benchmark framework and living leaderboard |
| Creator | Stanford Center for Research on Foundation Models (CRFM) |
| Lead authors | Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras (et al., 50 total)[2] |
| First arXiv release | 16 November 2022 (arXiv:2211.09110)[2] |
| TMLR publication | August 2023[9] |
| GitHub | stanford-crfm/helm (Apache-2.0 license)[1] |
| Leaderboard | crfm.stanford.edu/helm[4] |
| Latest framework release noted | v0.5.16 (30 April 2025) on GitHub[10] |
Background
By late 2022, large language models such as GPT-3, BLOOM, and Anthropic's earliest production model had proliferated, but their evaluation was fragmented: model creators reported scores on different subsets of benchmarks, often with non-comparable prompting conventions, and risk-oriented metrics like toxicity or bias were rarely reported alongside accuracy.[2][3] In the HELM paper Liang and colleagues observed that, before their work, on average prominent models had been evaluated on only 17.9% of HELM's core scenarios, leaving substantial gaps in the public record of model capabilities.[11]
HELM was developed at the CRFM, the foundation-models center launched inside Stanford HAI in 2021, with Percy Liang as faculty director and a leadership team including Rishi Bommasani, Tony Lee, and Christopher Manning among its 50 listed authors.[2] The paper was first posted to arXiv on 16 November 2022 and announced the same day on the CRFM blog as "the first version of HELM"; a substantially revised v2 followed in October 2023, and the article was published in Transactions on Machine Learning Research (TMLR) in August 2023.[2][3][9]
The framework was deliberately positioned as a "living benchmark": the authors released raw prompts, completions, and a modular Python toolkit so that researchers could add new scenarios, new metrics, or new models and re-run the evaluation themselves, with results aggregated into a public leaderboard at crfm.stanford.edu/helm.[11][4]
Design philosophy
HELM's central claim is that evaluation of language models should be holistic. The paper distills this into three principles.[2][3]
- Broad coverage with explicit recognition of gaps. Rather than pick a single benchmark, HELM organizes evaluation around a taxonomy of scenarios (use cases, domains, languages, demographic groups) and a taxonomy of metrics. The taxonomy makes it possible to enumerate not only what is evaluated but also what is missing.[2]
- Multi-metric measurement. For each scenario, HELM tries to measure seven categories of metrics in the same context, rather than relegating risks such as bias or toxicity to separate, accuracy-only studies.[2][3]
- Standardization. Every model is run on every scenario through a single Python codebase with uniform prompting (typically 5-shot in-context learning with fixed templates), so that comparisons are not confounded by prompting differences.[2][11]
This last point distinguishes HELM from the way many model cards report numbers. The HELM MMLU effort, for example, found that scores reported by model providers on MMLU often differed from HELM's standardized re-evaluations by as much as five percentage points, and that reported numbers were frequently higher than HELM's, suggesting advantageous prompting in vendor reports.[7]
The seven HELM metrics
HELM Classic defines seven categories of metrics, each applied wherever feasible to each core scenario.[2][3]
| Metric | What it measures |
|---|
| Accuracy | Standard task-specific quality (exact match, F1, ROUGE, etc.) on the scenario's reference answers.[2] |
| Calibration | How well a model's expressed confidence matches its empirical correctness; computed where token-level log-probabilities are available.[2] |
| Robustness | Performance under typo-style and equivalence-preserving perturbations of the input, and under invariance-style transformations meant to mimic real-world noise.[2] |
| Fairness | Performance shifts when demographic features (names, dialects) in the input are changed, including comparisons across African-American English dialects and counterfactual demographic swaps.[2] |
| Bias | Demographic representation in model outputs (e.g., gender or race associations in generation), measured independently of correctness.[2] |
| Toxicity | Rate of harmful or insulting generations, scored automatically using a toxicity classifier on free-form outputs.[2] |
| Efficiency | Wall-clock and idealized inference cost, allowing comparison of accuracy against compute or latency budgets.[2] |
Because not every metric is well-defined for every scenario (calibration, for example, requires API access to token probabilities), HELM Classic reports that it achieves coverage of roughly 87.5% across the metric-by-scenario grid.[2][11]
The 16 core scenarios
HELM Classic defines 16 core scenarios that span six user-facing task families: question answering, information retrieval, summarization, sentiment analysis, toxicity detection, and miscellaneous text classification.[11] Concrete datasets used as core scenarios include MMLU, BoolQ, NarrativeQA, NaturalQuestions, QuAC, HellaSwag, OpenbookQA, TruthfulQA, MS MARCO, CNN/DailyMail, XSum, IMDB, CivilComments, and RAFT, among others.[12] On top of the core, the paper adds 26 additional "targeted" scenarios for skills such as reasoning, knowledge, language modeling, and disinformation generation, for a total of 42 scenarios in the original release.[2][12]
Each scenario is implemented as a Python class that produces standardized in-context-learning prompts and pairs them with reference outputs or scoring functions; the same scenario object is used to evaluate every model.[1][11]
Initial 2022 evaluation
The launch evaluation in the HELM paper covered 30 prominent language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University, and Yandex.[3] Notable systems included OpenAI GPT-3 and InstructGPT variants, BLOOM, Anthropic-LM, and Meta's OPT.[3] In aggregate the team conducted more than 4,900 evaluations spanning roughly 12 billion tokens of inference and 17 million model API calls, raising scenario coverage of these models from an average 17.9% to 96.0%.[3][11] The paper distilled the results into 25 top-level findings, including the observation that no single model dominated across all metrics: even the strongest accuracy models had measurable bias and calibration deficits, and efficient smaller models were sometimes competitive on individual tasks.[11]
A second observation from the launch was the tradeoff between instruction-tuned and base models. Instruction-tuned variants, such as the InstructGPT family, dominated open-ended generation scenarios but did not necessarily improve on multiple-choice accuracy. The HELM team also reported sizeable gaps between open and closed models on accuracy, with proprietary systems such as OpenAI's text-davinci-002 clearly outperforming the strongest open systems available at the time, while open systems were sometimes competitive on individual scenarios such as sentiment analysis and short-form question answering.[11] HELM's structured presentation of these tradeoffs was widely cited in subsequent work and quickly adopted as a reference design for downstream evaluations.[11][3]
HELM Classic vs HELM Lite
By late 2023, HELM Classic had grown unwieldy: full evaluation of a new model required running all 42 scenarios with three random seeds and many perturbations, which was expensive both in API spend and in compute. On 19 December 2023, CRFM published HELM Lite v1.0.0, a deliberately stripped-down version focused on capabilities rather than the full multi-metric matrix.[5]
HELM Lite simplifies HELM Classic in four ways:[5]
- Uses one random seed instead of three over choices of in-context examples.
- Drops perturbation-based robustness and fairness measurements, on the grounds that they were strongly correlated with raw accuracy in HELM Classic.
- Removes the calibration metric, since several major LLM APIs (notably Anthropic and Google) had stopped exposing token log-probabilities, making it inapplicable.
- Drops perplexity and the information-retrieval scenarios, citing computational expense and decreasing relevance.
The HELM Lite scenario set contains nine benchmarks emphasizing generation rather than multiple choice: NarrativeQA, NaturalQuestions, OpenbookQA, a five-subject subset of MMLU, the MATH and GSM8K math benchmarks, a five-task subset of LegalBench, MedQA, and WMT-14 machine translation across five language pairs.[5] The launch evaluation ranked 28 model variants from 11 organizations by mean win rate, with GPT-4 on top overall; smaller models such as Writer's Palmyra-X and 01.AI's Yi-34B were noted as unexpectedly strong, and on the NarrativeQA scenario Yi-34B outperformed GPT-4.[5] Safety evaluations, the CRFM team noted, were deliberately handled outside HELM Lite via a separate partnership with MLCommons' AI safety working group.[5]
HELM Instruct
On 18 February 2024 CRFM published HELM Instruct, an instruction-following evaluation framework with absolute (rather than pairwise) ratings, authored by Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, and Percy Liang.[13] HELM Instruct argues that existing instruction-following evaluations were either reference-based (which assumes one correct answer) or relative (which ranks models against each other without measuring distance from perfection). It proposes three principles for instruction-following evaluation: it should be open-ended (admitting many valid outputs), multidimensional (graded on multiple axes), and absolute (using a 1 to 5 scale).[13]
The framework combines:[13]
- 7 scenarios drawn from prompt collections including HH-RLHF, Koala Eval, Vicuna Eval, OASST1, Self-Instruct, and curated ChatGPT prompts, capped at 100 examples each.
- 4 candidate models in the launch: GPT-4 (0314), GPT-3.5 Turbo (0613), Anthropic Claude v1.3, and Cohere-Command-XLarge-Beta.
- 5 criteria per response: helpfulness, understandability, completeness, conciseness, and harmlessness.
- 4 evaluators: 16 vetted Amazon Mechanical Turk workers, the Scale AI rating platform, GPT-4 (0314), and Anthropic Claude v1.3 acting as LM judges.
Reported findings included that GPT-4 was the best candidate overall, Claude excelled specifically on understandability and harmlessness, and that LM judges showed Pearson correlations of 0.48 to 0.72 with human raters depending on the criterion, with GPT-4 a closer match to humans than Claude.[13]
HELM MMLU
CRFM published HELM MMLU on 1 May 2024 as a leaderboard dedicated to the Massive Multitask Language Understanding (MMLU) test.[7] The motivation was that despite MMLU's prominence, scores reported by different vendors used inconsistent prompts, formats, and answer-extraction heuristics, making cross-model comparison unreliable.[7]
HELM MMLU re-evaluates models on all 57 MMLU subjects using a single "Multiple Choice Joint" adaptation method that instructs the model to output a single letter (A, B, C, or D). Two model-specific concessions are documented: Claude 2 is queried through Anthropic's Human/Assistant format because the API rejects other prompt shapes, and Claude 3 is given the explicit instruction "Answer with only a single letter" to suppress chain-of-thought style responses.[7] The launch evaluated 26 models, including Claude Instant through Claude 3 Opus, Gemini 1.0 Pro, GPT-4, Llama 2 and Llama 3 variants, Mistral and Mixtral, Gemma, PaLM 2, and Qwen.[7] Headline finding: HELM's standardized MMLU scores diverged from vendor-reported scores by up to five percentage points, and almost always in the direction of lower HELM scores, consistent with vendors having used more favorable prompting at evaluation time.[7]
HELM Safety
On 8 November 2024, CRFM released HELM Safety v1.0, a standardized safety leaderboard built on the HELM framework.[6] The launch argued that, although capability benchmarks had converged on a small canonical set, safety evaluation was fragmented: of 102 published safety benchmarks reviewed, only 12 had been used to evaluate any state-of-the-art model, and external evaluations rarely disclosed prompts and outputs.[6]
HELM Safety v1.0 packages five existing safety benchmarks under one harness and runs them on 24 prominent LLMs from Anthropic, OpenAI, Google, Meta, Alibaba, Cohere, Databricks, DeepSeek, and Mistral.[6] The five constituent benchmarks span six risk categories (violence, fraud, discrimination, sexual content, harassment, deception):[6]
| Benchmark | Coverage |
|---|
| BBQ | 58,492 bias-benchmark multiple-choice questions on social discrimination.[6] |
| SimpleSafetyTests | 100 unsafe prompts covering sexual content and violence.[6] |
| HarmBench | 321 red-team prompts on deception, fraud, violence and harassment, scored with automated graders.[6] |
| AnthropicRedTeam | 38,961 red-team attack transcripts across multiple harm categories.[6] |
| XSTest | 450 prompts designed to surface the helpfulness vs harmlessness tradeoff.[6] |
Claude 3.5 Sonnet (June 2024 release) ranked first overall, with particular strength on HarmBench.[6] The launch also documented a methodological problem: when using LLMs as safety graders, Claude 3.5 Sonnet refused to grade harmful outputs at rates approaching 27%, versus near-zero refusal from GPT-4o; the CRFM team interpreted this as miscalibrated refusal behavior that undermines LM-as-judge safety evaluation.[6] Models also lost roughly 26% on average to adversarial red-teaming methods, with some models degrading by 55% under attack.[6] HELM Safety has since absorbed additional benchmarks, including AIR-Bench 2024 (Stanford CRFM, 2024), which derives 314 fine-grained risk categories from 8 government regulations and 16 company policies and contributes 5,694 prompts evaluated through HELM.[14]
HELM Capabilities
On 20 March 2025, CRFM released HELM Capabilities v1.0.0, the current general-capability successor to HELM Lite.[15] HELM Capabilities is organized around five competencies, each instantiated by a single dataset:
- General knowledge: MMLU-Pro, 1,000 instances.[15]
- Reasoning: GPQA, 448 instances of graduate-level science problems.[15]
- Instruction following: IFEval, 541 instances with verifiable instruction constraints.[15]
- Dialogue: WildBench, 1,000 in-the-wild chat instances.[15]
- Mathematical reasoning: Omni-MATH, 1,000 Olympiad-level problems.[15]
The launch evaluated 22 models from OpenAI, Anthropic, Google, Meta, Mistral, Qwen, DeepSeek, and Amazon, among others. HELM Capabilities differs from HELM Lite chiefly in aggregation: it uses mean per-scenario score rather than mean win rate, a choice CRFM made to reduce sensitivity of model rankings to the composition of the model set being compared.[15]
Vision, image and domain extensions
The HELM framework has been extended to non-text modalities and to specialized domains, all under the same repository (stanford-crfm/helm).[1]
- HEIM (Holistic Evaluation of Text-to-Image Models) was published at NeurIPS 2023 (Datasets and Benchmarks) and evaluates 26 text-to-image systems on 62 scenarios across 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. HEIM reports that no single model dominates across aspects.[16]
- VHELM (Holistic Evaluation of Vision-Language Models) appeared at NeurIPS 2024 and extends the HELM design to VLMs. It aggregates 21 datasets covering nine aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. The launch evaluated 22 VLMs and showed, among other things, that efficiency-focused versions (such as Claude 3 Haiku or Gemini 1.5 Flash) tend to underperform their full counterparts specifically on bias benchmarks.[17]
- MedHELM was published in Nature Medicine in 2025 and applies the HELM framework to medical tasks. It uses a clinician-validated taxonomy that organizes medical AI into five clinical categories (clinical decision support, clinical note generation, patient communication, medical research, and administration) covering 22 subcategories and 121 tasks, with 37 benchmark evaluations. The launch reported that Claude 3.5 Sonnet attained performance comparable to top frontier models at lower estimated cost.[8]
- Additional verticals listed on the HELM leaderboard include an Audio HELM for audio language models, an Enterprise Benchmarks suite (including IBM's HELM-Enterprise extension for finance, legal, climate, and cybersecurity), and ToRR (Table Reasoning and Robustness).[1][18]
Software, releases, and infrastructure
HELM is distributed as the crfm-helm Python package on PyPI under the Apache 2.0 license, with command-line tools helm-run, helm-summarize, and helm-server for executing benchmarks, summarizing results, and serving a local web UI.[1] The GitHub repository stanford-crfm/helm shows a long history of release notes; as of the latest visible release, v0.5.16 was tagged on 30 April 2025, with monthly minor releases preceding it through 2024 and 2025.[10] CRFM has indicated that HELM will enter maintenance mode on 1 June 2026.[1]
Architecturally, the framework separates scenarios, adapters, models, and metrics into independently extensible Python interfaces. A scenario produces a list of instances (input, reference output, split tag); an adapter rewrites those instances into model-specific prompts (typically multiple-choice joint format or generation format with in-context examples); the model interface wraps a unified completion or chat API; and a metric consumes both the prompt and the completion to produce a numeric score. The same helm-run command that drives HELM Classic also drives the Lite, Capabilities, Safety, Instruct, and MMLU tracks, by swapping the scenario, adapter, and metric set.[1] CRFM provides connectors for the Anthropic API, OpenAI API, Google Gemini, Cohere, Hugging Face Hub model endpoints, and a number of local-model runtimes, allowing a single configuration to evaluate both closed-API and open-weight systems under the same prompting protocol.[1]
The HELM live leaderboard at crfm.stanford.edu/helm hosts multiple separately-versioned tracks (Classic, Lite, Capabilities, Safety, Instruct, MMLU, HEIM, VHELM, MedHELM, Audio, Enterprise) and exposes per-prompt and per-completion records so that raw model outputs can be inspected directly.[4] Each evaluation run is archived under a versioned URL (for example helm/safety/v1.8.0) so that earlier rankings remain inspectable even after the leaderboard moves on.[6]
Comparison with other LLM leaderboards
HELM coexists with several other public LLM evaluation systems, each making different methodological choices.
| Leaderboard | Operator | Scoring approach | Notes |
|---|
| HELM (Classic, Lite, Capabilities, Safety) | Stanford CRFM | Multi-metric (accuracy + 6 others in Classic) over a fixed scenario set; uniform 5-shot prompting; raw outputs published.[2][5][6][15] | Emphasizes transparency, holistic metrics, and reproducible prompting.[3] |
| BIG-Bench | Google / community (444 authors, 132 institutions) | Aggregates 204 tasks contributed by the community; primary metric is accuracy with task-specific variants. Published 2022 as "Beyond the Imitation Game".[19] | Strength is task diversity; weakness is uneven task quality and per-task accuracy focus.[19] |
| Open LLM Leaderboard (HuggingFace v1, 2023-2024) | HuggingFace | Used six accuracy benchmarks (MMLU, HellaSwag, ARC, TruthfulQA, Winogrande, GSM8K) run through the EleutherAI LM Evaluation Harness; >7,000 open models evaluated.[20] | Retired in June 2024 after extensive saturation and contamination concerns on its constituent datasets.[20] |
| Open LLM Leaderboard v2 (2024 onwards) | HuggingFace | Six newer benchmarks: IFEval, BBH, MATH, GPQA, MuSR, MMLU-Pro. Single-metric accuracy-style aggregation.[20] | Designed to harden v1 against contamination; remains accuracy-focused.[20] |
| Chatbot Arena | LMSYS / LMArena | Crowd-sourced pairwise human preferences; Elo-style ratings.[21] | Measures perceived conversation quality, complementary to HELM's structured scenario approach.[21] |
The HELM team has consistently argued that its differentiator is the combination of standardized prompting plus multi-metric measurement, rather than the choice of benchmarks alone: HELM Classic, for example, runs many of the same datasets used by the EleutherAI LM Evaluation Harness, but pairs them with calibration, robustness, fairness, bias, toxicity, and efficiency scores.[2][3][20]
Significance
Within two years of its release, HELM had become one of the most cited reference frameworks in LLM evaluation literature, and several of its sub-projects produced widely-quoted findings: HELM MMLU's documentation of vendor over-reporting on MMLU, HELM Safety's quantification of safety degradation under adversarial prompts, and VHELM's analysis of bias regressions in lightweight VLM variants.[6][7][17] Its leaderboard pages provide one of the few public, prompt-level audit trails for closed-API models, which has been useful for academic analyses of contamination, prompt sensitivity, and reasoning behaviors.[4][11] HELM has also served as a template for domain-specific frameworks beyond CRFM: IBM's helm-enterprise-benchmark reuses HELM's infrastructure to evaluate LLMs on enterprise domain datasets in finance, legal, climate, and cybersecurity.[18]
Limitations and criticisms
HELM's authors and external commentators have identified several limitations.
- Coverage of languages and modalities. HELM Classic's scenarios are predominantly English-language, although several English varieties (including African-American English) are used in the fairness perturbations.[2][11] HEIM, VHELM, MedHELM, and Audio HELM extend coverage to other modalities but remain largely English at launch.[16][17][8]
- Cost. The HELM Lite announcement explicitly acknowledged that running HELM Classic on a new closed-API model was expensive and slow, motivating the Lite redesign.[5]
- Sensitivity to prompting. Even with HELM's uniform prompting protocol, model rankings on MMLU in particular are sensitive to prompt format, in-context example choice, and answer-extraction heuristics. HELM MMLU documents up to five-percentage-point discrepancies between its standardized re-evaluations and vendor-reported numbers.[7] Independent research has also shown that small "cheating" models can game open leaderboards including HELM by training to leaked or near-duplicate evaluation data.[22]
- Saturation and contamination. Several of HELM's original scenarios (notably MMLU, HellaSwag and TruthfulQA) have been shown to be partially memorized by frontier models trained on large web crawls, a concern that motivated the HuggingFace Open LLM Leaderboard v2 transition in 2024.[20][22] HELM Capabilities responded by switching toward harder, more recent benchmarks (MMLU-Pro, GPQA, Omni-MATH).[15]
- LM-as-judge reliability. HELM Instruct and HELM Safety both rely partially on LLM judges, but HELM Safety itself documented that Claude 3.5 Sonnet refused to grade roughly a quarter of harmful outputs, illustrating that even the framework's own internal use of LM judges has open methodological problems.[6][13]
- Maintenance mode. CRFM has indicated that the HELM framework will enter maintenance mode on 1 June 2026, raising open questions about long-term updates and adoption.[1]
- MMLU is the single most prominent constituent benchmark inside HELM and the subject of its own HELM track.[7]
- MMLU-Pro, GPQA, IFEval, WildBench are the five core capability benchmarks in HELM Capabilities v1.0.0.[15]
- BIG-Bench (with BIG-Bench Hard as a curated subset) is a contemporaneous community benchmark with which HELM is frequently compared.[19]
- BBQ, HarmBench, and the WMDP benchmark are safety-oriented benchmarks; the first two are integrated directly into HELM Safety.[6]
- Chatbot Arena is the leading preference-based LLM leaderboard, complementing HELM's reference-based evaluation.[21]
- MTEB and GLUE are major non-HELM benchmark suites for embeddings and language understanding respectively.
See also
References
- Stanford CRFM, "stanford-crfm/helm: Holistic Evaluation of Language Models", GitHub repository (Apache-2.0 license), 2022-2025. https://github.com/stanford-crfm/helm. Accessed 2026-05-21.
- Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar et al., "Holistic Evaluation of Language Models", arXiv:2211.09110, 2022-11-16 (v1) and 2023-10-01 (v2). https://arxiv.org/abs/2211.09110. Accessed 2026-05-21.
- Stanford CRFM, "Holistic Evaluation of Language Models (HELM)", CRFM blog, 2022-11-17. https://crfm.stanford.edu/2022/11/17/helm.html. Accessed 2026-05-21.
- Stanford CRFM, "Holistic Evaluation of Language Models (HELM) leaderboard", crfm.stanford.edu, 2022-2025. https://crfm.stanford.edu/helm/. Accessed 2026-05-21.
- Stanford CRFM, "HELM Lite: Lightweight and Broad Capabilities Evaluation", CRFM blog, 2023-12-19. https://crfm.stanford.edu/2023/12/19/helm-lite.html. Accessed 2026-05-21.
- Stanford CRFM, "HELM Safety v1.0", CRFM blog, 2024-11-08. https://crfm.stanford.edu/2024/11/08/helm-safety.html. Accessed 2026-05-21.
- Stanford CRFM, "HELM MMLU: Massive Multitask Language Understanding", CRFM blog, 2024-05-01. https://crfm.stanford.edu/2024/05/01/helm-mmlu.html. Accessed 2026-05-21.
- Stanford CRFM and Stanford Medicine, "Holistic evaluation of large language models for medical tasks with MedHELM", Nature Medicine, 2025. https://www.nature.com/articles/s41591-025-04151-2. Accessed 2026-05-21.
- Percy Liang et al., "Holistic Evaluation of Language Models", Transactions on Machine Learning Research (TMLR), published 2023-08. https://jmlr.org/tmlr/. Accessed 2026-05-21.
- Stanford CRFM, "Releases - stanford-crfm/helm", GitHub releases page, last release v0.5.16 dated 2025-04-30. https://github.com/stanford-crfm/helm/releases. Accessed 2026-05-21.
- Stanford CRFM, "HELM Classic methodology overview (paper abstract and findings)", arXiv:2211.09110 abstract page, 2022-11-16. https://arxiv.org/abs/2211.09110. Accessed 2026-05-21.
- Stanford CRFM, "Scenarios", CRFM HELM Read-the-Docs documentation, 2024. https://crfm-helm.readthedocs.io/en/latest/scenarios/. Accessed 2026-05-21.
- Stanford CRFM (Yian Zhang, Yifan Mai, Josselin Somerville Roberts, Rishi Bommasani, Yann Dubois, Percy Liang), "HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings", CRFM blog, 2024-02-18. https://crfm.stanford.edu/2024/02/18/helm-instruct.html. Accessed 2026-05-21.
- Yi Zeng et al., "AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies", arXiv:2407.17436, 2024. https://arxiv.org/abs/2407.17436. Accessed 2026-05-21.
- Stanford CRFM, "HELM Capabilities v1.0.0", CRFM blog, 2025-03-20. https://crfm.stanford.edu/2025/03/20/helm-capabilities.html. Accessed 2026-05-21.
- Tony Lee et al., "Holistic Evaluation of Text-to-Image Models", arXiv:2311.04287; NeurIPS 2023 Datasets and Benchmarks Track, 2023-11-07. https://arxiv.org/abs/2311.04287. Accessed 2026-05-21.
- Tony Lee et al., "VHELM: A Holistic Evaluation of Vision Language Models", arXiv:2410.07112; NeurIPS 2024 Datasets and Benchmarks Track, 2024. https://arxiv.org/abs/2410.07112. Accessed 2026-05-21.
- IBM Research, "IBM/helm-enterprise-benchmark", GitHub repository, 2024-2025. https://github.com/IBM/helm-enterprise-benchmark. Accessed 2026-05-21.
- BIG-bench Collaboration (Aarohi Srivastava et al.), "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models", arXiv:2206.04615, 2022. https://arxiv.org/abs/2206.04615. Accessed 2026-05-21.
- HuggingFace, "Open LLM Leaderboard v1 (archive) and v2 documentation", huggingface.co, 2023-2024. https://huggingface.co/docs/leaderboards/en/open_llm_leaderboard/archive. Accessed 2026-05-21.
- LMSYS / LMArena, "Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings", lmarena.ai blog and HuggingFace Space, 2023-2025. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard. Accessed 2026-05-21.
- Norah Alzahrani et al., "When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards", arXiv:2402.01781, 2024. https://arxiv.org/abs/2402.01781. Accessed 2026-05-21.