**
| RULER |
|---|
| Overview |
| Full name |
| Abbreviation |
| Description |
| Release date |
| Authors |
| Organization |
| Venue |
| Technical Details |
| Type |
| Modality |
| Task format |
| Number of tasks |
| Task categories |
| Context lengths |
| Examples per task |
| Evaluation metric |
| Languages |
| Performance |
| Baseline reference |
| Top model (original) |
| Key finding |
| Resources |
| Paper |
| GitHub |
| License |
| Successor |
RULER** is a synthetic benchmark developed by NVIDIA for evaluating the long-context capabilities of large language models (LLMs). Published as a conference paper at COLM 2024, RULER goes well beyond the popular Needle in a Haystack (NIAH) test by introducing 13 tasks across four distinct categories: retrieval, multi-hop tracing, aggregation, and question answering. The benchmark was designed to answer a deceptively simple question: when a model claims to support a certain context window, how much of that window does it actually use effectively?
The original evaluation covered 17 long-context LLMs with claimed context sizes ranging from 32K to 1 million tokens. The results were revealing. Despite nearly perfect scores on the basic NIAH retrieval test, almost all models showed significant performance degradation as context length increased on RULER's more demanding tasks. Only four models maintained what the authors defined as "satisfactory performance" at 32K tokens, and none fully lived up to their advertised context window at the longest lengths tested.
The rapid expansion of context windows in LLMs has been one of the most visible trends in the field since 2023. Models like GPT-4 (128K tokens), Claude (200K tokens), Yi-34B (200K tokens), and LWM (1 million tokens) have each pushed the boundaries of how much text a model can process in a single pass. Evaluating whether these models truly leverage their full context windows, however, has remained a challenge.
The most widely adopted evaluation for long-context models has been the Needle in a Haystack (NIAH) test, originally popularized by Greg Kamradt in November 2023. In this test, a short piece of information (the "needle") is inserted at a random position within a long passage of distractor text (the "haystack"), and the model is asked to retrieve it. While useful as a basic sanity check, the NIAH test has several limitations that RULER was specifically designed to address.
First, the vanilla NIAH test only measures a single, superficial capability: retrieving one piece of information from a long context. Real-world use of long-context models involves far more complex behaviors, including tracking entities across a document, aggregating information from multiple locations, and answering questions that require reasoning over large bodies of text. Second, many models had already achieved near-perfect scores on the NIAH test, making it a saturated benchmark incapable of distinguishing between models of different quality. Third, the test offers limited configurability: the type of needle, the complexity of the haystack, and the number of items to retrieve are all fixed in the standard setup.
RULER was created to address all three of these shortcomings by providing a flexible, configurable benchmark that tests a broader range of long-context capabilities with synthetic tasks whose ground-truth answers are always known.
RULER follows several key design principles that distinguish it from both the vanilla NIAH test and other long-context benchmarks.
Synthetic task generation. All tasks in RULER are synthetically generated, meaning the ground-truth answers are always deterministic and verifiable. This eliminates the ambiguity that can arise in natural language benchmarks where multiple answers might be acceptable. It also allows the benchmark to scale to arbitrary context lengths without requiring hand-curated long documents.
Flexible configuration. Each task in RULER supports configurable parameters that control difficulty. Researchers can adjust the number of needles, the type of distractors, the number of hops in multi-hop tasks, and other variables. This flexibility allows RULER to serve as both a standard benchmark and a diagnostic tool for probing specific model weaknesses.
Minimal reliance on parametric knowledge. Because the tasks use synthetic data (random words, numbers, UUIDs, and variable names), models cannot rely on knowledge stored in their parameters to answer correctly. They must actually process and reason over the provided context. This is a deliberate contrast with some natural-language benchmarks where a well-trained model might answer correctly based on its pre-training knowledge alone.
Scalable evaluation. RULER generates 500 test examples per task at each context length (4K, 8K, 16K, 32K, 64K, and 128K tokens), producing a statistically robust evaluation at every scale.
RULER organizes its 13 tasks into four categories, each testing a fundamentally different aspect of long-context processing.
The retrieval category extends the standard NIAH test into multiple variants that probe different retrieval challenges. In all variants, "needles" are key-value pairs (e.g., "The special magic number for {key} is: {value}") embedded at various positions within distractor text.
| Task | Abbreviation | Description | What It Tests |
|---|---|---|---|
| Single NIAH | S-NIAH | Retrieve a single needle from the haystack. Keys and values can be words, 7-digit numbers, or 32-character UUIDs. Haystacks consist of either repeated noise sentences or Paul Graham essays. | Basic single-item retrieval across different data types |
| Multi-keys NIAH | MK-NIAH | Multiple needles with different keys are inserted, but only one specific needle must be retrieved. The extra needles act as hard distractors. | Retrieval accuracy in the presence of similar distractors |
| Multi-values NIAH | MV-NIAH | Multiple needles share the same key but have different values. The model must retrieve all associated values. | Complete multi-item retrieval for a single query |
| Multi-queries NIAH | MQ-NIAH | Multiple needles with distinct keys are inserted, and the model must retrieve the values for all of them. | Parallel retrieval of multiple independent targets |
The retrieval category encompasses several sub-configurations based on the format of the haystack and the key-value types. The original paper groups these into the following additional configurations:
| Configuration | Haystack Type | Key/Value Format |
|---|---|---|
| Passkey Retrieval | Repeated noise sentences | Word-number pairs |
| Vanilla NIAH | Paul Graham essays | Word-number pairs |
| Line Retrieval | Distractors filling the full context | Line identification |
| Key-Value Retrieval | UUID key-value pairs as distractors | UUID strings |
Together, these retrieval tasks number eight in total (four structural variants, each with multiple configurations), covering the full spectrum from simple single-needle extraction to complex multi-target retrieval under heavy distraction.
This category contains a single task, Variable Tracking (VT), which serves as a minimal proxy for coreference chain resolution. In this task, a variable X1 is initialized with a value V. Then, a chain of variable assignment statements (X2 = X1, X3 = X2, and so on) is inserted at various positions throughout the input text. The model must trace through the chain to determine which variables ultimately hold the value V.
Variable Tracking tests a fundamentally different capability from retrieval. Rather than finding a specific piece of information, the model must follow a chain of references scattered across the full context. The task complexity can be increased by adding more hops (longer chains) or by inserting multiple independent chains that the model must distinguish from one another.
This task is particularly revealing because it mimics a pattern common in real documents: pronoun resolution, entity tracking, and following chains of logic or reference that span many pages of text.
The aggregation category contains two tasks that serve as proxy measures for a model's ability to synthesize information distributed across the entire context, similar to what is required in summarization.
| Task | Abbreviation | Description | Distribution |
|---|---|---|---|
| Common Words Extraction | CWE | Identify the top 10 most frequently occurring words from a list where words are drawn from a discrete uniform distribution. The number of common words is fixed, while the number of uncommon words grows with context length. | Uniform |
| Frequent Words Extraction | FWE | Identify the 3 most frequently occurring words from a list where word frequencies follow a Zeta (Zipf) distribution. The top-ranked words from the distribution serve as noise that the model must filter out. | Zeta (Zipf) |
These aggregation tasks require models to process the entire input rather than locating a single piece of information. A model that only attends to a portion of the context will miss occurrences of the target words and produce incorrect frequency counts. The mathematical formulation for FWE uses the Zeta distribution, where the frequency of the k-th ranked word is k^(-alpha) * N / zeta(alpha), making the task more challenging because the frequency differences between words are subtler.
The QA category adapts existing short-context question answering datasets to the long-context setting by padding the original passages with distracting paragraphs.
| Task | Abbreviation | Source Dataset | Description |
|---|---|---|---|
| Single-hop QA | SQA | SQuAD | A question-answer pair from SQuAD is placed within a long context filled with irrelevant paragraphs from other SQuAD articles. The model must locate the relevant paragraph and answer the question. |
| Multi-hop QA | MQA | HotpotQA | A multi-hop question from HotpotQA requires reasoning over two or more paragraphs. These paragraphs are embedded within a large number of distractors. |
The QA tasks bridge the gap between RULER's synthetic evaluations and realistic use cases. By using genuine QA datasets, these tasks measure whether a model can still answer factual questions when the relevant information is buried in thousands of tokens of irrelevant content.
RULER evaluates models at six standard context lengths: 4K, 8K, 16K, 32K, 64K, and 128K tokens. For certain experiments exploring extrapolation behavior, the authors also tested select models at 256K and beyond. Each task generates 500 independent test instances at each length, yielding a total of up to 39,000 evaluations per model in a full run (13 tasks times 6 lengths times 500 examples).
Performance on each task is measured using string matching against the known ground-truth answers. For tasks with multiple correct answers (such as MV-NIAH or CWE), the evaluation checks whether the model's output contains all required items. The per-task accuracy scores are then averaged to produce category-level and overall benchmark scores.
The authors introduce two weighted averaging schemes to produce a single composite score per model:
wAvg(inc): A linearly increasing weight scheme that assigns greater importance to longer context lengths. This simulates scenarios where long-context performance matters most.
wAvg(dec): A linearly decreasing weight scheme that emphasizes shorter contexts. This reflects use cases where moderate context lengths are more common.
In practice, the top-performing models (GPT-4, Command-R, Yi-34B, Mixtral) ranked consistently at the top regardless of which weighting scheme was used, indicating robust performance across all lengths.
One of RULER's most impactful contributions is the concept of effective context size. This metric defines the longest context length at which a model's composite RULER score meets or exceeds a baseline threshold. The authors set this threshold at the performance of Llama 2-7B (chat) evaluated at 4K tokens, which corresponds to 85.6% accuracy.
The effective context size provides a single, interpretable number that captures how much of a model's claimed context window is actually functional. It revealed dramatic gaps between marketing claims and measured performance across the evaluated models.
The original RULER paper evaluated 17 models spanning multiple architectures, sizes, and training approaches. These were divided into aligned (instruction-tuned) models used for the primary evaluation and base models used for supplementary architecture comparisons.
| Model | Parameters | Claimed Context | Effective Context | Avg. Score (4K-128K) | 4K Score | 128K Score | Degradation |
|---|---|---|---|---|---|---|---|
| GPT-4 | Undisclosed | 128K | 64K | 91.6 | 96.6 | 81.2 | 15.4 |
| Command-R | 35B | 128K | 32K | 88.3 | 93.8 | 76.0 | 17.8 |
| Yi-34B | 34B | 200K | 32K | 87.5 | 93.3 | 77.3 | 16.0 |
| Mixtral | 8x7B | 32K | 32K | 80.4 | 94.9 | 44.5 | 50.4 |
| ChatGLM | 6B | 128K | 4K | - | 85.6 | - | - |
| Mistral | 7B | 32K | 16K | 68.4 | 93.6 | 13.8 | 79.8 |
| LWM | 7B | 1M | <4K | - | 82.3 | 65.0 | 17.3 |
| Together | 7B | 32K | 4K | - | 88.2 | 0.0 | 88.2 |
| LongChat | 7B | 32K | <4K | - | 84.7 | 0.0 | 84.7 |
| LongAlpaca | 13B | 32K | <4K | - | 60.6 | 0.0 | 60.6 |
Several patterns emerge from these results. GPT-4 was the clear leader, achieving the highest scores at every context length and exhibiting the smallest performance degradation (15.4 points from 4K to 128K). Among open-source models, Command-R, Yi-34B, and Mixtral formed a strong second tier, all maintaining effective context sizes of 32K. Notably, all three of these models use a large base frequency in their Rotary Position Embedding (RoPE) implementation, which the authors identify as a contributing factor to their superior long-context performance. They are also larger in parameter count than the remaining models.
The most striking failures occurred in models claiming very large context windows. LWM, despite claiming a 1 million token context, could not even meet the baseline threshold at 4K tokens on RULER's full task suite. Together and LongChat both scored 0.0 at 64K and 128K tokens, indicating complete failure at those lengths. LongAlpaca scored the lowest overall, with performance below the baseline even at 4K.
The paper also evaluated several base (non-instruction-tuned) models and non-Transformer architectures:
| Model | Architecture | Key Finding |
|---|---|---|
| Mixtral-base | Transformer (MoE) | Performance patterns similar to aligned version |
| Mistral-base | Transformer | Base model showed similar degradation curve |
| Jamba-base | Hybrid Transformer-Mamba | Competitive with Transformer baselines |
| RWKV-v5 | RWKV (linear attention) | Substantially underperformed Llama 2-7B baseline at all lengths |
| Mamba-2.8B | Mamba (SSM) | Large gap below Transformer baseline, even at 4K |
The non-Transformer architectures (RWKV and Mamba) performed significantly worse than the Transformer-based Llama 2-7B baseline, lagging by large margins even at the shortest 4K context length. This finding raised questions about the long-context capabilities of alternative architectures available at the time of the study (early 2024), though later hybrid models like Jamba 1.5 would go on to demonstrate strong RULER performance.
The most important finding from RULER is the systematic discrepancy between claimed and effective context sizes. While every model in the evaluation claimed a minimum context window of 32K tokens, only half could maintain satisfactory performance (above the 85.6% baseline) at that length. The gap was most extreme for models claiming very large windows:
| Model | Claimed Context | Effective Context | Gap |
|---|---|---|---|
| LWM | 1,000,000 | <4,000 | >996,000 tokens |
| LongAlpaca | 32,000 | <4,000 | >28,000 tokens |
| LongChat | 32,000 | <4,000 | >28,000 tokens |
| Yi-34B | 200,000 | 32,000 | 168,000 tokens |
| GPT-4 | 128,000 | 64,000 | 64,000 tokens |
| Command-R | 128,000 | 32,000 | 96,000 tokens |
RULER's multi-task design enabled the identification of specific failure modes that simpler benchmarks cannot detect.
Non-robustness to needle types. Models that performed well with word or number needles showed significant accuracy drops when the needles were 32-character UUIDs. This indicates that retrieval performance depends heavily on the format of the target information, not just the context length.
Failure to ignore distractors. In the Multi-keys NIAH task, adding distractor needles (needles with different keys) caused substantial performance drops. Yi-34B, for example, lost approximately 40 points at 256K tokens when distractor needles were introduced, despite maintaining strong performance without them.
Incomplete retrieval. When asked to retrieve multiple items (as in MV-NIAH), models exhibited a 15-point performance loss when retrieving 8 items compared to retrieving just 1. A common error pattern was producing duplicate values rather than retrieving all distinct targets.
Context copying in aggregation. In the Common Words Extraction task at 128K tokens, over 80% of the outputs from Yi-34B, LWM, and LongAlpaca consisted of verbatim copies of text from the context rather than actual word frequency analysis. This suggests these models default to a copying strategy when overwhelmed by long inputs.
Unreliable multi-hop tracing. On the Variable Tracking task, models frequently returned empty strings or variables from unrelated chains, indicating a fundamental inability to reliably follow reference chains across long contexts.
QA hallucination at scale. As context length increased, model performance on the QA tasks approached (and sometimes fell below) a no-context baseline, meaning models were essentially hallucinating answers rather than extracting them from the provided text.
Experiments with the Yi model family (6B, 9B, and 34B parameters, all trained on the same data with the same context length) showed that the 34B variant significantly outperformed the smaller models at every context length. This confirms that model capacity plays an important role in long-context performance, independent of the training context length.
Models from the LWM series were evaluated at different training context lengths (32K, 128K, 512K, and 1M). The 512K variant actually outperformed the 1M variant when tested at 256K tokens, suggesting that simply increasing the training context length does not guarantee better performance. The authors attributed this to insufficient adjustment of the RoPE base frequency when extending to 1M tokens.
RULER occupies a specific position in the broader landscape of long-context evaluation tools. The following comparison situates RULER relative to other prominent benchmarks.
| Benchmark | Type | Task Categories | Context Lengths | Key Difference from RULER |
|---|---|---|---|---|
| Needle in a Haystack | Synthetic | Retrieval only | Variable | Single task; RULER extends this into 8 retrieval variants |
| LongBench | Natural | Summarization, QA, coding, and others | Up to 128K | Uses natural documents; cannot control for parametric knowledge |
| ZeroScrolls | Natural | Summarization, QA | Up to 100K | Relies on natural long documents with subjective evaluation |
| InfiniteBench | Natural/Synthetic | Retrieval, QA, math, coding | 100K+ | Combines natural and synthetic tasks; less configurable |
| HELMET | Meta-benchmark | Multiple (includes RULER) | Variable | Aggregates multiple benchmarks including RULER for standardized comparison |
| RULER | Synthetic | Retrieval, tracing, aggregation, QA | 4K-128K | Fully synthetic, highly configurable, 13 tasks across 4 categories |
RULER's synthetic nature is both its greatest strength and its primary limitation. The synthetic design ensures reproducible, unambiguous evaluation with known ground-truth answers at any context length. However, the lack of natural language complexity means that RULER scores may not perfectly predict performance on real-world long-context tasks such as document summarization, legal analysis, or codebase understanding. The authors of RULER acknowledged this limitation, noting that RULER "is by no means comprehensive enough and it cannot replace the more preferred realistic tasks."
Since its release in April 2024, RULER has become one of the standard benchmarks in the long-context LLM evaluation ecosystem. Several developments illustrate its influence:
Stanford HELM Long Context. The HELM (Holistic Evaluation of Language Models) project at Stanford includes RULER as one of its core long-context benchmarks, using it alongside other evaluations to provide transparent and reproducible model comparisons.
Model development validation. AI21's Jamba 1.5 models (released in August 2024) were explicitly benchmarked on RULER, with the Jamba 1.5 Large and Mini both demonstrating strong performance. Jamba 1.5 is reported as one of the first models to maintain an effective context length of 256K tokens on RULER, far exceeding the effective lengths measured for the original 17 models.
NVIDIA's own model evaluation. NVIDIA's Nemotron 3 Super (120B) achieved a RULER score of 0.917, demonstrating the benchmark's continued relevance for evaluating newer, more capable models.
Research community adoption. RULER has been cited extensively in research on long-context architectures, position encoding methods, and context extension techniques. Its configurable nature makes it a convenient diagnostic tool for researchers developing new approaches to long-context modeling.
Following the original RULER benchmark, a successor called RULER V2 was developed to address remaining gaps in the evaluation framework. RULER V2 adopts a bottom-up approach, starting with basic retrieval and progressively increasing task difficulty toward complex reasoning. The design ensures that even models achieving perfect scores on simpler retrieval tasks encounter meaningful challenges at higher difficulty levels.
Key differences between RULER V1 and V2 include a focus on systematic difficulty scaling, where performance decreases as both context length and task complexity increase simultaneously. RULER V2 creates a more challenging evaluation space with clearer improvement targets, and its findings emphasize that models need to master fundamental retrieval abilities before they can be expected to handle complex long-context reasoning.
In March 2025, researchers published ONERULER ("One Ruler to Measure Them All"), a multilingual adaptation of the RULER benchmark. ONERULER extends the evaluation to 26 languages and includes seven synthetic tasks that test both retrieval and aggregation capabilities. It also introduces a novel variation of the NIAH task that accounts for the possibility of a nonexistent needle, testing whether models can correctly report that no relevant information exists in the context. ONERULER was published at COLM 2025.
While RULER represents a significant advance over the vanilla NIAH test, it has several acknowledged limitations.
Synthetic vs. realistic tasks. RULER's synthetic task design, while ensuring clean evaluation, does not capture the full complexity of real-world long-context use cases. Tasks like document summarization, cross-document reasoning, and long-form code understanding involve linguistic nuances, ambiguity, and compositional reasoning that synthetic benchmarks cannot fully model. Studies have found a lack of strong correlation between performance on RULER's synthetic tasks and performance on realistic long-context tasks.
English-only scope. The original RULER benchmark evaluates only English text, limiting its applicability to multilingual models and non-English use cases. The ONERULER extension addresses this gap, but the original benchmark remains English-only.
No generation quality assessment. RULER evaluates whether a model can locate, trace, or aggregate information, but it does not assess the quality of generated text in response to long contexts. Benchmarks like LongGenBench have shown that models performing well on RULER may still struggle with long-form text generation tasks.
Static difficulty. Although RULER's tasks are configurable, the standard evaluation uses fixed difficulty settings. A model that has been specifically optimized for RULER's default parameters could potentially achieve high scores without genuine long-context understanding.
Dated model evaluations. The original paper's results reflect the state of models as of early 2024. Models released after this date (including GPT-4o, Claude 3.5, Gemini 1.5 Pro, and Llama 3.1) have not been evaluated in the original paper, though some have been tested using the open-source RULER codebase by the community.