HELMET

AI Benchmarks Model Evaluation

7 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,495 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HELMET (How to Evaluate Long-context Models Effectively and Thoroughly) is a benchmark for evaluating long-context language models introduced by researchers at Princeton University and Intel Labs in 2024. It assembles seven application-centered task categories, ranging from retrieval-augmented generation to long-document question answering, summarization, and many-shot in-context learning, and evaluates models at controllable input lengths of up to 128,000 tokens. HELMET was designed to replace the ad hoc mix of synthetic probes and short-context datasets that developers had been using to judge long-context ability, and it was published at the Thirteenth International Conference on Learning Representations (ICLR 2025).^[1]^[2]

Overview

As model context windows expanded from a few thousand tokens to hundreds of thousands, the community lacked a consistent way to measure whether the extra length translated into useful capability. Developers frequently relied on the Needle in a Haystack (NIAH) test or an arbitrary subset of older datasets, neither of which gave a reliable picture of real downstream performance. HELMET addresses this by combining diverse, realistic applications with consistent length control and reliable metrics in a single suite.^[1]^[3]

The benchmark was created by Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen, a collaboration between the Princeton Language and Intelligence group and Intel Labs. Its code, data, and a public leaderboard are released through the Princeton NLP GitHub organization.^[1]^[2] Using HELMET, the authors carried out what they describe as the most thorough and controlled comparison of long-context models on diverse applications to date, covering 59 large language models spanning open-weight and proprietary systems.^[3]^[4]

Motivation and problems with prior evaluations

The HELMET authors argue that existing long-context evaluations suffered from four recurring weaknesses, which together produced noisy and sometimes misleading signals.^[1]^[3]

Problem	Description
Insufficient coverage of downstream tasks	Many suites focused on a narrow domain or a single task type, so they failed to capture the breadth of real long-context applications.
Inadequate lengths	Older question-answering datasets such as QASPER and QuALITY are often capped below 32K tokens, too short to stress frontier models with 128K or larger windows.
Unreliable metrics	N-gram matching scores such as ROUGE correlate poorly with human judgment and often fail to separate strong models from weak ones.
Incompatibility with base models	Several benchmarks assume instruction-tuned models, leaving them unusable for evaluating base models during pretraining and early development.

A central empirical claim that motivates the design is that synthetic tasks like NIAH are not good predictors of downstream performance. Most modern models achieve near-perfect NIAH scores even when they struggle badly on tasks that require reasoning over the full context, so a high synthetic score can mask substantial weaknesses.^[3]^[4]

Application categories

HELMET groups its tasks into seven categories, each drawn from a recognizable application of long context. The categories use a mix of automatic and model-based metrics, and several reuse or extend datasets from prior work such as RULER, ALCE, and InfiniteBench.^[3]^[5]

Category	Representative datasets	Metric
Synthetic recall	JSON KV, RULER MK Needle, RULER MK UUID, RULER MV	Substring exact match (SubEM)
Retrieval-augmented generation (RAG)	Natural Questions, TriviaQA, PopQA, HotpotQA	SubEM
Passage re-ranking	MS MARCO	NDCG@10
Generation with citations	ALCE ASQA, ALCE QAMPARI	Recall, citation quality
Long-document QA	NarrativeQA, InfiniteBench QA, InfiniteBench MC	Model-based, ROUGE F1, accuracy
Summarization	InfiniteBench Sum, Multi-LexSum	Model-based
Many-shot in-context learning	TREC Coarse, TREC Fine, NLU, BANKING77, CLINC150	Accuracy

Synthetic recall keeps a controlled probe of pure retrieval ability, while the remaining six categories target practical uses: grounding answers in retrieved passages, ordering candidate documents, producing answers with verifiable citations, answering questions about a single long document, condensing long inputs, and learning a classification task from many in-context learning demonstrations.^[3]^[5]

Controllable lengths and methodology

A defining feature of HELMET is that every dataset is built to run at a fixed set of input lengths rather than at whatever length a source corpus happens to provide. In the published experiments the authors evaluate at L in {8K, 16K, 32K, 64K, 128K} tokens, measured with the Llama-2 tokenizer, and the design extends naturally to longer contexts as models grow.^[3]^[5]

Length is controlled differently depending on the task, so that the change in input size is the only variable while the underlying question stays fixed:^[3]

For RAG, citation, and re-ranking, the suite varies the number of retrieved or candidate passages packed into the prompt.
For many-shot ICL, it varies the number of demonstrations.
For long-document QA and summarization, it varies or truncates the length of the input document.

This construction lets HELMET report a performance curve across lengths for each model and category, exposing where and how a model degrades as the context grows. The authors also refine the prompts and few-shot demonstrations across tasks so that base models, not just instruction-tuned ones, can be evaluated robustly through few-shot prompting.^[1]^[3]

Model-based evaluation

To overcome the unreliability of n-gram metrics, HELMET introduces model-based evaluation for the open-ended categories. For long-document QA and summarization, the authors prompt GPT-4o as a judge to score model outputs against reference answers, a scheme they report shows better distinguishability between models and across input lengths than ROUGE.^[1]^[3]^[5] Closed-form categories continue to use deterministic metrics: substring exact match for recall and RAG, NDCG@10 for re-ranking, and accuracy for in-context learning, while the citation task combines answer recall with a citation-quality score adapted from the ALCE framework.^[3]^[5]

Key findings

The large-scale study of 59 long-context models surfaced several findings that the authors present as guidance for both model developers and evaluators.^[3]^[4]

Distinct category behavior. The seven categories correlate weakly with one another, so no single task captures long-context ability; in particular, many-shot ICL has among the lowest correlation with the other categories, indicating it draws on somewhat different capabilities.^[3]^[4]
Synthetic tasks mislead. Near-perfect NIAH and synthetic recall scores do not guarantee strong downstream performance, reinforcing the case against relying on NIAH alone.^[3]^[4]
Open versus closed gap. Open-weight models lag noticeably behind the strongest proprietary models on tasks that require full-context reasoning, such as generation with citations, and the gap tends to widen on the harder categories.^[3]^[4]
Length-dependent degradation. Performance drops with length are category-specific, and even leading models such as GPT-4o and Gemini show large declines on demanding tasks like passage re-ranking at long inputs.^[3]^[4]
No clear winner. No single model dominates every category, which is the central argument for multi-axis evaluation.^[3]^[4]

As a practical recommendation, the authors suggest that RAG tasks are a good lightweight proxy for fast model development, because they are inexpensive to run yet predict broader downstream performance better than synthetic probes.^[1]^[3]

Significance and adoption

HELMET has been adopted as a reference benchmark for reporting long-context performance, and its leaderboard tracks frontier systems from major developers. It is frequently cited alongside synthetic stress tests such as RULER and recall benchmarks such as MRCR, with HELMET occupying the application-oriented end of the spectrum.^[2]^[5] The Princeton group later paired it with LongProc, a companion benchmark aimed at long-form generation and procedural tasks, framing the two together as a way to evaluate both long input and long output.^[6]

By standardizing lengths, broadening task coverage, and substituting model-based judging for brittle n-gram metrics, HELMET made cross-model long-context comparisons more reproducible and shifted attention from synthetic retrieval toward realistic applications. Its release with code, data, and an open leaderboard has supported reuse across subsequent model reports and follow-up research.^[1]^[2]

Limitations

HELMET inherits some constraints from its design choices. The model-based evaluation depends on GPT-4o as a judge, which introduces cost and a dependence on a proprietary system whose behavior can change over time, and judge models can carry their own biases. The published experiments cap input length at 128K tokens, so conclusions about substantially longer contexts require extending the suite. Because length is controlled by padding prompts with additional passages, demonstrations, or document text, results reflect that specific construction of long inputs rather than every possible long-context workload. The authors position HELMET as a broad and extensible foundation rather than an exhaustive measure of every long-context capability.^[1]^[3]

References

Yen, Howard; Gao, Tianyu; Hou, Minmin; Ding, Ke; Fleischer, Daniel; Izsak, Peter; Wasserblat, Moshe; Chen, Danqi. "HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly." arXiv:2410.02694. https://arxiv.org/abs/2410.02694 ↩
"HELMET: The HELMET Benchmark." princeton-nlp on GitHub. https://github.com/princeton-nlp/HELMET ↩
Yen, Howard et al. "HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly" (full text, v2). arXiv. https://arxiv.org/html/2410.02694v2 ↩
"HELMET" project page. Princeton NLP. https://princeton-nlp.github.io/HELMET/ ↩
Yen, Howard; Gao, Tianyu. "Introducing HELMET: Holistically Evaluating Long-context Language Models." Hugging Face blog. https://huggingface.co/blog/helmet ↩
"From Long Input to Long Output: Holistic Long-Context Evaluation with HELMET and LongProc." Princeton Language and Intelligence. https://pli.princeton.edu/blog/2025/long-input-long-output-holistic-long-context-evaluation-helmet-and-longproc ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Benchmark (AI)InfiniteBench

Overview

Motivation and problems with prior evaluations

Application categories

Controllable lengths and methodology

Model-based evaluation

Key findings

Significance and adoption

Limitations

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here