HELMET
Last reviewed
Jun 2, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,495 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,495 words
Add missing citations, update stale details, or suggest a clearer explanation.
HELMET (How to Evaluate Long-context Models Effectively and Thoroughly) is a benchmark for evaluating long-context language models introduced by researchers at Princeton University and Intel Labs in 2024. It assembles seven application-centered task categories, ranging from retrieval-augmented generation to long-document question answering, summarization, and many-shot in-context learning, and evaluates models at controllable input lengths of up to 128,000 tokens. HELMET was designed to replace the ad hoc mix of synthetic probes and short-context datasets that developers had been using to judge long-context ability, and it was published at the Thirteenth International Conference on Learning Representations (ICLR 2025).[1][2]
As model context windows expanded from a few thousand tokens to hundreds of thousands, the community lacked a consistent way to measure whether the extra length translated into useful capability. Developers frequently relied on the Needle in a Haystack (NIAH) test or an arbitrary subset of older datasets, neither of which gave a reliable picture of real downstream performance. HELMET addresses this by combining diverse, realistic applications with consistent length control and reliable metrics in a single suite.[1][3]
The benchmark was created by Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen, a collaboration between the Princeton Language and Intelligence group and Intel Labs. Its code, data, and a public leaderboard are released through the Princeton NLP GitHub organization.[1][2] Using HELMET, the authors carried out what they describe as the most thorough and controlled comparison of long-context models on diverse applications to date, covering 59 large language models spanning open-weight and proprietary systems.[3][4]
The HELMET authors argue that existing long-context evaluations suffered from four recurring weaknesses, which together produced noisy and sometimes misleading signals.[1][3]
| Problem | Description |
|---|---|
| Insufficient coverage of downstream tasks | Many suites focused on a narrow domain or a single task type, so they failed to capture the breadth of real long-context applications. |
| Inadequate lengths | Older question-answering datasets such as QASPER and QuALITY are often capped below 32K tokens, too short to stress frontier models with 128K or larger windows. |
| Unreliable metrics | N-gram matching scores such as ROUGE correlate poorly with human judgment and often fail to separate strong models from weak ones. |
| Incompatibility with base models | Several benchmarks assume instruction-tuned models, leaving them unusable for evaluating base models during pretraining and early development. |
A central empirical claim that motivates the design is that synthetic tasks like NIAH are not good predictors of downstream performance. Most modern models achieve near-perfect NIAH scores even when they struggle badly on tasks that require reasoning over the full context, so a high synthetic score can mask substantial weaknesses.[3][4]
HELMET groups its tasks into seven categories, each drawn from a recognizable application of long context. The categories use a mix of automatic and model-based metrics, and several reuse or extend datasets from prior work such as RULER, ALCE, and InfiniteBench.[3][5]
| Category | Representative datasets | Metric |
|---|---|---|
| Synthetic recall | JSON KV, RULER MK Needle, RULER MK UUID, RULER MV | Substring exact match (SubEM) |
| Retrieval-augmented generation (RAG) | Natural Questions, TriviaQA, PopQA, HotpotQA | SubEM |
| Passage re-ranking | MS MARCO | NDCG@10 |
| Generation with citations | ALCE ASQA, ALCE QAMPARI | Recall, citation quality |
| Long-document QA | NarrativeQA, InfiniteBench QA, InfiniteBench MC | Model-based, ROUGE F1, accuracy |
| Summarization | InfiniteBench Sum, Multi-LexSum | Model-based |
| Many-shot in-context learning | TREC Coarse, TREC Fine, NLU, BANKING77, CLINC150 | Accuracy |
Synthetic recall keeps a controlled probe of pure retrieval ability, while the remaining six categories target practical uses: grounding answers in retrieved passages, ordering candidate documents, producing answers with verifiable citations, answering questions about a single long document, condensing long inputs, and learning a classification task from many in-context learning demonstrations.[3][5]
A defining feature of HELMET is that every dataset is built to run at a fixed set of input lengths rather than at whatever length a source corpus happens to provide. In the published experiments the authors evaluate at L in {8K, 16K, 32K, 64K, 128K} tokens, measured with the Llama-2 tokenizer, and the design extends naturally to longer contexts as models grow.[3][5]
Length is controlled differently depending on the task, so that the change in input size is the only variable while the underlying question stays fixed:[3]
This construction lets HELMET report a performance curve across lengths for each model and category, exposing where and how a model degrades as the context grows. The authors also refine the prompts and few-shot demonstrations across tasks so that base models, not just instruction-tuned ones, can be evaluated robustly through few-shot prompting.[1][3]
To overcome the unreliability of n-gram metrics, HELMET introduces model-based evaluation for the open-ended categories. For long-document QA and summarization, the authors prompt GPT-4o as a judge to score model outputs against reference answers, a scheme they report shows better distinguishability between models and across input lengths than ROUGE.[1][3][5] Closed-form categories continue to use deterministic metrics: substring exact match for recall and RAG, NDCG@10 for re-ranking, and accuracy for in-context learning, while the citation task combines answer recall with a citation-quality score adapted from the ALCE framework.[3][5]
The large-scale study of 59 long-context models surfaced several findings that the authors present as guidance for both model developers and evaluators.[3][4]
As a practical recommendation, the authors suggest that RAG tasks are a good lightweight proxy for fast model development, because they are inexpensive to run yet predict broader downstream performance better than synthetic probes.[1][3]
HELMET has been adopted as a reference benchmark for reporting long-context performance, and its leaderboard tracks frontier systems from major developers. It is frequently cited alongside synthetic stress tests such as RULER and recall benchmarks such as MRCR, with HELMET occupying the application-oriented end of the spectrum.[2][5] The Princeton group later paired it with LongProc, a companion benchmark aimed at long-form generation and procedural tasks, framing the two together as a way to evaluate both long input and long output.[6]
By standardizing lengths, broadening task coverage, and substituting model-based judging for brittle n-gram metrics, HELMET made cross-model long-context comparisons more reproducible and shifted attention from synthetic retrieval toward realistic applications. Its release with code, data, and an open leaderboard has supported reuse across subsequent model reports and follow-up research.[1][2]
HELMET inherits some constraints from its design choices. The model-based evaluation depends on GPT-4o as a judge, which introduces cost and a dependence on a proprietary system whose behavior can change over time, and judge models can carry their own biases. The published experiments cap input length at 128K tokens, so conclusions about substantially longer contexts require extending the suite. Because length is controlled by padding prompts with additional passages, demonstrations, or document text, results reflect that specific construction of long inputs rather than every possible long-context workload. The authors position HELMET as a broad and extensible foundation rather than an exhaustive measure of every long-context capability.[1][3]