RULER (benchmark)

AI Benchmarks Large Language Models

26 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v5 · 5,114 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RULER
Overview
Full name	RULER: What's the Real Context Size of Your Long-Context Language Models?
Abbreviation	RULER
Description	A synthetic benchmark for evaluating long-context large language models across 13 tasks spanning retrieval, multi-hop tracing, aggregation, and question answering
Release date	2024-04-09
Authors	Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg
Organization	NVIDIA
Venue	COLM 2024
Technical Details
Type	Long-context evaluation, Synthetic benchmark
Modality	Text
Task format	Generation (string matching evaluation)
Number of tasks	13
Task categories	4 (Retrieval, Multi-hop Tracing, Aggregation, Question Answering)
Context lengths	4K, 8K, 16K, 32K, 64K, 128K tokens
Examples per task	500 per task per length
Evaluation metric	Accuracy, String matching
Languages	English
Performance
Baseline reference	Llama 2-7B at 4K context: 85.6%
Top model (original)	GPT-4 (91.6% average across lengths)
Key finding	Most models' effective context is far shorter than claimed
Resources
Paper	arXiv:2404.06654
GitHub	NVIDIA/RULER
License	Apache 2.0
Successor	RULER V2

RULER is a synthetic benchmark from NVIDIA that measures the real, usable context window of large language models (LLMs) by testing them on 13 tasks across four categories (retrieval, multi-hop tracing, aggregation, and question answering) at sequence lengths from 4K to 128K tokens.^[1] Introduced in the April 2024 paper "RULER: What's the Real Context Size of Your Long-Context Language Models?" and published at COLM 2024, RULER showed that almost every model claiming a 32K-or-larger window degrades sharply well before reaching its advertised length: of the 17 LLMs evaluated, only four (GPT-4, Command-R, Yi-34B, and Mixtral) sustained satisfactory performance even at 32K tokens.^[1]

RULER goes well beyond the popular Needle in a Haystack (NIAH) test, which checks only a single retrieval skill, by adding harder retrieval variants plus two entirely new task families. As the authors state in the abstract, "this simple retrieval-based test is indicative of only a superficial form of long-context understanding," so RULER "introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context."^[1] The benchmark was designed to answer a deceptively simple question: when a model claims to support a certain context window, how much of that window does it actually use effectively?^[1]

The original evaluation covered 17 long-context LLMs with claimed context sizes ranging from 32K to 1 million tokens.^[1] The results were revealing. Despite nearly perfect scores on the basic NIAH retrieval test, almost all models showed significant performance degradation as context length increased on RULER's more demanding tasks. Only four models maintained what the authors defined as "satisfactory performance" at 32K tokens, and none fully lived up to their advertised context window at the longest lengths tested.^[1]

What is the RULER benchmark?

RULER (the name plays on a measuring ruler) is a fully synthetic, configurable test suite for long context LLMs. It generates its own inputs from random words, numbers, UUIDs, and variable names, so the correct answer is always known and a model cannot fall back on memorized world knowledge.^[1] A full run produces up to 39,000 graded examples per model (13 tasks times 6 context lengths times 500 instances), and scoring is done by deterministic string matching against the ground truth.^[1] Because the inputs are generated on demand, the same tasks can be scaled to any context length, which is what lets RULER probe whether a 128K or 1M token claim holds up in practice.

The benchmark is open source under the Apache 2.0 license, with the generator and evaluation harness published in the NVIDIA/RULER GitHub repository.^[3] This made it easy for the community to run RULER on newer models that the original paper never tested.

Why was RULER created?

The rapid expansion of context windows in LLMs has been one of the most visible trends in the field since 2023. Models like GPT-4 (128K tokens), Claude (200K tokens), Yi-34B (200K tokens), and LWM (1 million tokens) have each pushed the boundaries of how much text a model can process in a single pass.^[1] Evaluating whether these models truly leverage their full context windows, however, has remained a challenge.

The most widely adopted evaluation for long-context models has been the Needle in a Haystack (NIAH) test, originally popularized by Greg Kamradt in November 2023.^[2] In this test, a short piece of information (the "needle") is inserted at a random position within a long passage of distractor text (the "haystack"), and the model is asked to retrieve it.^[2] While useful as a basic sanity check, the NIAH test has several limitations that RULER was specifically designed to address.^[1]

First, the vanilla NIAH test only measures a single, superficial capability: retrieving one piece of information from a long context. Real-world use of long-context models involves far more complex behaviors, including tracking entities across a document, aggregating information from multiple locations, and answering questions that require reasoning over large bodies of text. Second, many models had already achieved near-perfect scores on the NIAH test, making it a saturated benchmark incapable of distinguishing between models of different quality. Third, the test offers limited configurability: the type of needle, the complexity of the haystack, and the number of items to retrieve are all fixed in the standard setup.^[1]

RULER was created to address all three of these shortcomings by providing a flexible, configurable benchmark that tests a broader range of long-context capabilities with synthetic tasks whose ground-truth answers are always known.^[1]

How is RULER designed?

Design Principles

RULER follows several key design principles that distinguish it from both the vanilla NIAH test and other long-context benchmarks.

Synthetic task generation. All tasks in RULER are synthetically generated, meaning the ground-truth answers are always deterministic and verifiable.^[1] This eliminates the ambiguity that can arise in natural language benchmarks where multiple answers might be acceptable. It also allows the benchmark to scale to arbitrary context lengths without requiring hand-curated long documents.

Flexible configuration. Each task in RULER supports configurable parameters that control difficulty. Researchers can adjust the number of needles, the type of distractors, the number of hops in multi-hop tasks, and other variables.^[1] This flexibility allows RULER to serve as both a standard benchmark and a diagnostic tool for probing specific model weaknesses.

Minimal reliance on parametric knowledge. Because the tasks use synthetic data (random words, numbers, UUIDs, and variable names), models cannot rely on knowledge stored in their parameters to answer correctly.^[1] They must actually process and reason over the provided context. This is a deliberate contrast with some natural-language benchmarks where a well-trained model might answer correctly based on its pre-training knowledge alone.

Scalable evaluation. RULER generates 500 test examples per task at each context length (4K, 8K, 16K, 32K, 64K, and 128K tokens), producing a statistically robust evaluation at every scale.^[1]

What are the four task categories in RULER?

RULER organizes its 13 tasks into four categories, each testing a fundamentally different aspect of long-context processing.^[1]

Category 1: Retrieval (Needle-in-a-Haystack Variants)

The retrieval category extends the standard NIAH test into multiple variants that probe different retrieval challenges. In all variants, "needles" are key-value pairs (e.g., "The special magic number for {key} is: {value}") embedded at various positions within distractor text.^[1]

Task	Abbreviation	Description	What It Tests
Single NIAH	S-NIAH	Retrieve a single needle from the haystack. Keys and values can be words, 7-digit numbers, or 32-character UUIDs. Haystacks consist of either repeated noise sentences or Paul Graham essays.	Basic single-item retrieval across different data types
Multi-keys NIAH	MK-NIAH	Multiple needles with different keys are inserted, but only one specific needle must be retrieved. The extra needles act as hard distractors.	Retrieval accuracy in the presence of similar distractors
Multi-values NIAH	MV-NIAH	Multiple needles share the same key but have different values. The model must retrieve all associated values.	Complete multi-item retrieval for a single query
Multi-queries NIAH	MQ-NIAH	Multiple needles with distinct keys are inserted, and the model must retrieve the values for all of them.	Parallel retrieval of multiple independent targets

The retrieval category encompasses several sub-configurations based on the format of the haystack and the key-value types. The original paper groups these into the following additional configurations:^[1]

Configuration	Haystack Type	Key/Value Format
Passkey Retrieval	Repeated noise sentences	Word-number pairs
Vanilla NIAH	Paul Graham essays	Word-number pairs
Line Retrieval	Distractors filling the full context	Line identification
Key-Value Retrieval	UUID key-value pairs as distractors	UUID strings

Together, these retrieval tasks number eight in total (four structural variants, each with multiple configurations), covering the full spectrum from simple single-needle extraction to complex multi-target retrieval under heavy distraction.^[1]

Category 2: Multi-hop Tracing

This category contains a single task, Variable Tracking (VT), which serves as a minimal proxy for coreference chain resolution. In this task, a variable X1 is initialized with a value V. Then, a chain of variable assignment statements (X2 = X1, X3 = X2, and so on) is inserted at various positions throughout the input text. The model must trace through the chain to determine which variables ultimately hold the value V.^[1]

Variable Tracking tests a fundamentally different capability from retrieval. Rather than finding a specific piece of information, the model must follow a chain of references scattered across the full context. The task complexity can be increased by adding more hops (longer chains) or by inserting multiple independent chains that the model must distinguish from one another.

This task is particularly revealing because it mimics a pattern common in real documents: pronoun resolution, entity tracking, and following chains of logic or reference that span many pages of text.

Category 3: Aggregation

The aggregation category contains two tasks that serve as proxy measures for a model's ability to synthesize information distributed across the entire context, similar to what is required in summarization.^[1]

Task	Abbreviation	Description	Distribution
Common Words Extraction	CWE	Identify the top 10 most frequently occurring words from a list where words are drawn from a discrete uniform distribution. The number of common words is fixed, while the number of uncommon words grows with context length.	Uniform
Frequent Words Extraction	FWE	Identify the 3 most frequently occurring words from a list where word frequencies follow a Zeta (Zipf) distribution. The top-ranked words from the distribution serve as noise that the model must filter out.	Zeta (Zipf)

These aggregation tasks require models to process the entire input rather than locating a single piece of information. A model that only attends to a portion of the context will miss occurrences of the target words and produce incorrect frequency counts. The mathematical formulation for FWE uses the Zeta distribution, where the frequency of the k-th ranked word is k^(-alpha) * N / zeta(alpha), making the task more challenging because the frequency differences between words are subtler.^[1]

Category 4: Question Answering

The QA category adapts existing short-context question answering datasets to the long-context setting by padding the original passages with distracting paragraphs.^[1]

Task	Abbreviation	Source Dataset	Description
Single-hop QA	SQA	SQuAD	A question-answer pair from SQuAD is placed within a long context filled with irrelevant paragraphs from other SQuAD articles. The model must locate the relevant paragraph and answer the question.^[4]
Multi-hop QA	MQA	HotpotQA	A multi-hop question from HotpotQA requires reasoning over two or more paragraphs. These paragraphs are embedded within a large number of distractors.^[5]

The QA tasks bridge the gap between RULER's synthetic evaluations and realistic use cases. By using genuine QA datasets, these tasks measure whether a model can still answer factual questions when the relevant information is buried in thousands of tokens of irrelevant content.

How does RULER measure context length?

Context Length Settings

RULER evaluates models at six standard context lengths: 4K, 8K, 16K, 32K, 64K, and 128K tokens.^[1] For certain experiments exploring extrapolation behavior, the authors also tested select models at 256K and beyond. Each task generates 500 independent test instances at each length, yielding a total of up to 39,000 evaluations per model in a full run (13 tasks times 6 lengths times 500 examples).^[1]

Scoring

Performance on each task is measured using string matching against the known ground-truth answers. For tasks with multiple correct answers (such as MV-NIAH or CWE), the evaluation checks whether the model's output contains all required items. The per-task accuracy scores are then averaged to produce category-level and overall benchmark scores.^[1]

Weighted Averages and Ranking

The authors introduce two weighted averaging schemes to produce a single composite score per model:^[1]

wAvg(inc): A linearly increasing weight scheme that assigns greater importance to longer context lengths. This simulates scenarios where long-context performance matters most.

wAvg(dec): A linearly decreasing weight scheme that emphasizes shorter contexts. This reflects use cases where moderate context lengths are more common.

In practice, the top-performing models (GPT-4, Command-R, Yi-34B, Mixtral) ranked consistently at the top regardless of which weighting scheme was used, indicating robust performance across all lengths.^[1]

What is effective context size?

One of RULER's most impactful contributions is the concept of effective context size. This metric defines the longest context length at which a model's composite RULER score meets or exceeds a baseline threshold. As the paper puts it, "We use the performance of Llama2-7b model at the 4K context length as the threshold," which corresponds to 85.6% accuracy.^[1] A model's effective context size is therefore the longest length at which it still scores at least 85.6 on RULER.

The effective context size provides a single, interpretable number that captures how much of a model's claimed context window is actually functional. It revealed dramatic gaps between marketing claims and measured performance across the evaluated models.

Which models did RULER originally evaluate?

The original RULER paper evaluated 17 models spanning multiple architectures, sizes, and training approaches. These were divided into aligned (instruction-tuned) models used for the primary evaluation and base models used for supplementary architecture comparisons.^[1]

Aligned Models (Primary Evaluation)

Model	Parameters	Claimed Context	Effective Context	Avg. Score (4K-128K)	4K Score	128K Score	Degradation
GPT-4	Undisclosed	128K	64K	91.6	96.6	81.2	15.4
Command-R	35B	128K	32K	88.3	93.8	76.0	17.8
Yi-34B	34B	200K	32K	87.5	93.3	77.3	16.0
Mixtral	8x7B	32K	32K	80.4	94.9	44.5	50.4
ChatGLM	6B	128K	4K	-	85.6	-	-
Mistral	7B	32K	16K	68.4	93.6	13.8	79.8
LWM	7B	1M	<4K	-	82.3	65.0	17.3
Together	7B	32K	4K	-	88.2	0.0	88.2
LongChat	7B	32K	<4K	-	84.7	0.0	84.7
LongAlpaca	13B	32K	<4K	-	60.6	0.0	60.6

Several patterns emerge from these results. GPT-4 was the clear leader, achieving the highest scores at every context length and exhibiting the smallest performance degradation (15.4 points from 4K to 128K).^[1] Among open-source models, Command-R, Yi-34B, and Mixtral formed a strong second tier, all maintaining effective context sizes of 32K.^[1] Notably, all three of these models use a large base frequency in their Rotary Position Embedding (RoPE) implementation, which the authors identify as a contributing factor to their superior long-context performance.^[1] They are also larger in parameter count than the remaining models.

The most striking failures occurred in models claiming very large context windows. LWM, despite claiming a 1 million token context, could not even meet the baseline threshold at 4K tokens on RULER's full task suite.^[1] Together and LongChat both scored 0.0 at 64K and 128K tokens, indicating complete failure at those lengths.^[1] LongAlpaca scored the lowest overall, with performance below the baseline even at 4K.^[1]

Base and Non-Transformer Models

The paper also evaluated several base (non-instruction-tuned) models and non-Transformer architectures:^[1]

Model	Architecture	Key Finding
Mixtral-base	Transformer (MoE)	Performance patterns similar to aligned version
Mistral-base	Transformer	Base model showed similar degradation curve
Jamba-base	Hybrid Transformer-Mamba	Competitive with Transformer baselines
RWKV-v5	RWKV (linear attention)	Substantially underperformed Llama 2-7B baseline at all lengths
Mamba-2.8B	Mamba (SSM)	Large gap below Transformer baseline, even at 4K

The non-Transformer architectures (RWKV and Mamba) performed significantly worse than the Transformer-based Llama 2-7B baseline, lagging by large margins even at the shortest 4K context length.^[1] This finding raised questions about the long-context capabilities of alternative architectures available at the time of the study (early 2024), though later hybrid models like Jamba 1.5 would go on to demonstrate strong RULER performance.^[6]

What did RULER find about long-context models?

The Claimed vs. Effective Context Gap

The most important finding from RULER is the systematic discrepancy between claimed and effective context sizes. While every model in the evaluation claimed a minimum context window of 32K tokens, only half could maintain satisfactory performance (above the 85.6% baseline) at that length.^[1] The gap was most extreme for models claiming very large windows:

Model	Claimed Context	Effective Context	Gap
LWM	1,000,000	<4,000	>996,000 tokens
LongAlpaca	32,000	<4,000	>28,000 tokens
LongChat	32,000	<4,000	>28,000 tokens
Yi-34B	200,000	32,000	168,000 tokens
GPT-4	128,000	64,000	64,000 tokens
Command-R	128,000	32,000	96,000 tokens

Failure Mode Analysis

RULER's multi-task design enabled the identification of specific failure modes that simpler benchmarks cannot detect.

Non-robustness to needle types. Models that performed well with word or number needles showed significant accuracy drops when the needles were 32-character UUIDs.^[1] This indicates that retrieval performance depends heavily on the format of the target information, not just the context length.

Failure to ignore distractors. In the Multi-keys NIAH task, adding distractor needles (needles with different keys) caused substantial performance drops. The paper notes that "Yi fails to effectively ignore the hard distractors given long input context, thus incorrectly retrieves values associated with the distractor keys," losing roughly 40 points at 256K tokens once distractor needles were introduced.^[1]

Incomplete retrieval. When asked to retrieve multiple items (as in MV-NIAH), models exhibited a 15-point performance loss when retrieving 8 items compared to retrieving just 1; the paper reports that "increasing the number of queries from 1 to 8 drops the performance by ~15 points."^[1] A common error pattern was producing duplicate values rather than retrieving all distinct targets.

Context copying in aggregation. In the Common Words Extraction task at 128K tokens, "over 80% of Yi's output in the CWE task at 128K is simply a string copied from the one-shot example" rather than the result of actual word frequency analysis.^[1] This suggests these models default to a copying strategy when overwhelmed by long inputs.

Unreliable multi-hop tracing. On the Variable Tracking task, models frequently returned empty strings or variables from unrelated chains, indicating a fundamental inability to reliably follow reference chains across long contexts.^[1]

QA hallucination at scale. As context length increased, model performance on the QA tasks approached (and sometimes fell below) a no-context baseline, meaning models were essentially hallucinating answers rather than extracting them from the provided text.^[1]

Effect of Model Size

Experiments with the Yi model family (6B, 9B, and 34B parameters, all trained on the same data with the same context length) showed that the 34B variant significantly outperformed the smaller models at every context length.^[1] This confirms that model capacity plays an important role in long-context performance, independent of the training context length.

Effect of Training Context Length

Models from the LWM series were evaluated at different training context lengths (32K, 128K, 512K, and 1M). The 512K variant actually outperformed the 1M variant when tested at 256K tokens, suggesting that simply increasing the training context length does not guarantee better performance.^[1] The authors attributed this to insufficient adjustment of the RoPE base frequency when extending to 1M tokens.^[1]

How does RULER compare to other long-context benchmarks?

RULER occupies a specific position in the broader landscape of long-context evaluation tools. The following comparison situates RULER relative to other prominent benchmarks.

Benchmark	Type	Task Categories	Context Lengths	Key Difference from RULER
Needle in a Haystack	Synthetic	Retrieval only	Variable	Single task; RULER extends this into 8 retrieval variants
LongBench	Natural	Summarization, QA, coding, and others	Up to 128K	Uses natural documents; cannot control for parametric knowledge
ZeroScrolls	Natural	Summarization, QA	Up to 100K	Relies on natural long documents with subjective evaluation
InfiniteBench	Natural/Synthetic	Retrieval, QA, math, coding	100K+	Combines natural and synthetic tasks; less configurable
HELMET	Meta-benchmark	Multiple (includes RULER)	Variable	Aggregates multiple benchmarks including RULER for standardized comparison
RULER	Synthetic	Retrieval, tracing, aggregation, QA	4K-128K	Fully synthetic, highly configurable, 13 tasks across 4 categories

RULER's synthetic nature is both its greatest strength and its primary limitation. The synthetic design ensures reproducible, unambiguous evaluation with known ground-truth answers at any context length. However, the lack of natural language complexity means that RULER scores may not perfectly predict performance on real-world long-context tasks such as document summarization, legal analysis, or codebase understanding. The authors of RULER acknowledged this limitation, writing that the benchmark "is by no means comprehensive enough and it cannot replace the more preferred realistic tasks."^[1]

How widely is RULER used?

Since its release in April 2024, RULER has become one of the standard benchmarks in the long-context LLM evaluation ecosystem.^[1] Several developments illustrate its influence:

Stanford HELM Long Context. The HELM (Holistic Evaluation of Language Models) project at Stanford includes RULER as one of its core long-context benchmarks, using it alongside other evaluations to provide transparent and reproducible model comparisons.^[8]

Model development validation. AI21's Jamba 1.5 models (released in August 2024) were explicitly benchmarked on RULER, with the Jamba 1.5 Large and Mini both demonstrating strong performance.^[6] AI21 reported that the Jamba 1.5 family were "the only models with an effective length of 256K on the RULER benchmark," far exceeding the effective lengths measured for the original 17 models.^[6]

NVIDIA's own model evaluation. NVIDIA continues to use RULER to validate its own releases. The Nemotron 3 Super (120B) model posts a RULER score of 0.917, the top result on community RULER leaderboards as of 2026 and a demonstration of the benchmark's continued relevance for evaluating newer, more capable models.^[11]

Research community adoption. RULER has been cited extensively in research on long-context architectures, position encoding methods, and context extension techniques. Its configurable nature makes it a convenient diagnostic tool for researchers developing new approaches to long-context modeling.

What is RULER V2?

Following the original RULER benchmark, a successor called RULER V2 was developed to address remaining gaps in the evaluation framework. RULER V2 adopts a bottom-up approach, starting with basic retrieval and progressively increasing task difficulty toward complex reasoning. The design ensures that even models achieving perfect scores on simpler retrieval tasks encounter meaningful challenges at higher difficulty levels.

Key differences between RULER V1 and V2 include a focus on systematic difficulty scaling, where performance decreases as both context length and task complexity increase simultaneously. RULER V2 creates a more challenging evaluation space with clearer improvement targets, and its findings emphasize that models need to master fundamental retrieval abilities before they can be expected to handle complex long-context reasoning.

What is ONERULER (the multilingual extension)?

In March 2025, researchers published ONERULER ("One Ruler to Measure Them All"), a multilingual adaptation of the RULER benchmark.^[7] ONERULER extends the evaluation to 26 languages and includes seven synthetic tasks that test both retrieval and aggregation capabilities.^[7] It also introduces a novel variation of the NIAH task that accounts for the possibility of a nonexistent needle, testing whether models can correctly report that no relevant information exists in the context.^[7] ONERULER was published at COLM 2025.^[7]

What are RULER's limitations?

While RULER represents a significant advance over the vanilla NIAH test, it has several acknowledged limitations.

Synthetic vs. realistic tasks. RULER's synthetic task design, while ensuring clean evaluation, does not capture the full complexity of real-world long-context use cases. Tasks like document summarization, cross-document reasoning, and long-form code understanding involve linguistic nuances, ambiguity, and compositional reasoning that synthetic benchmarks cannot fully model. Studies have found a lack of strong correlation between performance on RULER's synthetic tasks and performance on realistic long-context tasks.

English-only scope. The original RULER benchmark evaluates only English text, limiting its applicability to multilingual models and non-English use cases.^[1] The ONERULER extension addresses this gap, but the original benchmark remains English-only.^[7]

No generation quality assessment. RULER evaluates whether a model can locate, trace, or aggregate information, but it does not assess the quality of generated text in response to long contexts. Benchmarks like LongGenBench have shown that models performing well on RULER may still struggle with long-form text generation tasks.

Static difficulty. Although RULER's tasks are configurable, the standard evaluation uses fixed difficulty settings. A model that has been specifically optimized for RULER's default parameters could potentially achieve high scores without genuine long-context understanding.

Dated model evaluations. The original paper's results reflect the state of models as of early 2024. Models released after this date (including GPT-4o, Claude 3.5, Gemini 1.5 Pro, and Llama 3.1) have not been evaluated in the original paper, though some have been tested using the open-source RULER codebase by the community.^[3]

Recent Developments (2026)

By 2026, the field has moved beyond RULER's original 4K-128K evaluation range, as frontier models now advertise context windows of 1 million tokens or more. Industry analysis applying RULER's methodology to current models consistently finds that the effective context gap identified in 2024 persists at longer scales.^[10] Research in 2026 estimates that frontier LLMs reliably use only 50 to 65 percent of their advertised context window for multi-hop reasoning tasks, with the remaining capacity present architecturally but unreliable in practice.^[10]

RULER's framework continues to be adopted in evaluation suites. The Awesome Agents long-context benchmarks leaderboard (2026) tracks RULER alongside MRCR v2 and LongBench v2 across frontier models.^[9] MRCR v2 eight-needle results at 1M tokens show Claude Opus 4.6 achieving approximately 76% and Gemini 3 Pro at 26.3%, illustrating that multi-fact retrieval at scale remains a differentiating capability even among top-tier models.^[9]

The core insight from RULER's original study, that models must demonstrate effective context size rather than merely claimed context size, has become a standard principle in long-context evaluation discourse. RULER's effective-context-size concept is now cited in model technical reports and benchmark comparisons as a reference framework for interpreting context window claims.

ELI5: RULER explained simply

Imagine you give someone a giant book and hide a secret password somewhere inside. If they can flip to the right page and read it back, that is the classic "needle in a haystack" test, and it is pretty easy. RULER is a much harder set of quizzes about that same giant book. It does not just ask for one password; it asks the reader to find several passwords at once, to follow a trail of clues where each note points to the next page, and to count which words show up most often in the whole book. Many AI models brag that they can "read" a book of a million words, but RULER shows that when the questions get tricky, most of them really only handle a fraction of that, often far less than they claim.

References

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., & Ginsburg, B. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" *Conference on Language Modeling (COLM) 2024*. arXiv:2404.06654. https://arxiv.org/abs/2404.06654 ↩
Kamradt, G. (2023). "Needle in a Haystack - Pressure Testing LLMs." November 2023. https://github.com/gkamradt/LLMTest_NeedleInAHaystack ↩
NVIDIA. (2024). "RULER GitHub Repository." https://github.com/NVIDIA/RULER ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *EMNLP 2016*. arXiv:1606.05250. ↩
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." *EMNLP 2018*. arXiv:1809.09600. ↩
AI21 Labs. (2024). "The Jamba 1.5 Open Model Family: The Most Powerful and Efficient Long Context Models." https://www.ai21.com/blog/announcing-jamba-model-family/ ; Jamba Team (2024). "Jamba-1.5: Hybrid Transformer-Mamba Models at Scale." arXiv:2408.12570. ↩
Karpinska, M., & Shah, D. (2025). "One Ruler to Measure Them All: Benchmarking Multilingual Long-Context Language Models." *COLM 2025*. arXiv:2503.01996. ↩
Liang, P., Bommasani, R., et al. (2025). "HELM Long Context." Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/2025/09/29/helm-long-context.html ↩
Awesome Agents. (2026). "Long-Context Benchmarks Leaderboard: MRCR, RULER, and LongBench v2." https://awesomeagents.ai/leaderboards/long-context-benchmarks-leaderboard/ (accessed May 2026). ↩
ofox.ai. (2026). "Long-Context LLM Benchmarks 2026: Which Model Actually Holds Accuracy Past 200K Tokens?" https://ofox.ai/blog/long-context-llm-benchmarks-200k-tokens-2026/ (accessed May 2026). ↩
LLM Stats. (2026). "RULER Leaderboard." https://llm-stats.com/benchmarks/ruler (accessed June 2026). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Benchmark (AI)InfiniteBench LLM Context Window Comparison Long-context language models LongRoPE Machine learning terms/Natural Language Processing Needle in a Haystack (NIAH)NoLiMa Phi-3 Titans (neural architecture)

What is the RULER benchmark?

Why was RULER created?

How is RULER designed?

Design Principles

What are the four task categories in RULER?

Category 1: Retrieval (Needle-in-a-Haystack Variants)

Category 2: Multi-hop Tracing

Category 3: Aggregation

Category 4: Question Answering

How does RULER measure context length?

Context Length Settings

Scoring

Weighted Averages and Ranking

What is effective context size?

Which models did RULER originally evaluate?

Aligned Models (Primary Evaluation)

Base and Non-Transformer Models

What did RULER find about long-context models?

The Claimed vs. Effective Context Gap

Failure Mode Analysis

Effect of Model Size

Effect of Training Context Length

How does RULER compare to other long-context benchmarks?

How widely is RULER used?

What is RULER V2?

What is ONERULER (the multilingual extension)?

What are RULER's limitations?

Recent Developments (2026)

ELI5: RULER explained simply

See Also

References

Improve this article

Related Articles

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

MBPP

What links here

Related Articles

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

MBPP

What links here