NoLiMa

AI Benchmarks Large Language Models Model Evaluation

10 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v1 · 2,084 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NoLiMa, short for "No Literal Matching," is a long-context benchmark for large language models that measures how well a model can find and use a single relevant fact buried in a long document when that fact shares almost no words with the question being asked. It was introduced by Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schütze in the paper "NoLiMa: Long-Context Evaluation Beyond Literal Matching," published at ICML 2025 (arXiv:2502.05167). ^[1]^[2] The benchmark is a direct response to a weakness in the popular needle-in-a-haystack test: when the planted fact and the query use the same keywords, a model can succeed by surface pattern matching rather than by reading and reasoning. NoLiMa removes that crutch, and the result is that frontier models advertised with 128K to 1M token windows degrade sharply long before they reach their claimed limits. ^[1]

Why the needle-in-a-haystack test is too easy

The needle-in-a-haystack (NIAH) test became the default way to advertise context window length. The recipe is simple: hide a sentence (the needle) somewhere inside a long stretch of unrelated text (the haystack), then ask the model a question whose answer is that sentence. Vendors often publish a grid of green cells showing near-perfect recall across hundreds of thousands of tokens, which suggests that the long context is fully usable.

The NoLiMa authors point out that most NIAH setups quietly hand the model an easy signal. The needle and the question tend to share literal words. If the needle says "the secret password is 7431" and the question asks "what is the secret password," the phrase "secret password" appears in both. A transformer's attention mechanism is very good at matching repeated tokens, so the model can locate the answer by keyword overlap without understanding the surrounding text. A high NIAH score in that case reflects lexical retrieval, not comprehension across a long context. ^[1]

NoLiMa is built to break this shortcut. Each question and its needle are designed to have minimal word overlap, so the model cannot lock onto a repeated keyword. To connect the two, it has to recognize a latent association: a relationship that is true in the world but never spelled out as a matching phrase in the text. ^[1]

How the needles are built

The core trick is to phrase the planted fact and the question so they refer to the same thing through different vocabulary. The paper's running example uses a needle that reads, "Actually, Yuki lives next to the Semper Opera House," paired with the question "Which character has been to Dresden?" Answering it requires knowing that the Semper Opera House is in Dresden, a fact the passage never states. The model has to retrieve the needle and then make the geographic association on its own. ^[1]^[3]

The authors call each association step a latent hop. A one-hop item needs a single inference (Semper Opera House implies Dresden). A two-hop item chains associations together, for example linking a region to a city and then the city to a landmark, which raises the difficulty further. The needle set is organized so that the number of hops and the type of association can be analyzed separately, and the study reports that performance falls as the number of required hops grows. ^[1]^[3]

The haystacks are assembled from snippets of ten open-licensed books, with passages randomly selected and concatenated into long irrelevant contexts. A single needle is inserted at a controlled position, and the same item is tested at many context lengths so that degradation can be tracked as the surrounding text grows. ^[1] The benchmark uses 58 question-needle pairs in total, which expand into roughly 7,540 individual tests at each context length once placements and variants are accounted for. ^[1] The public release also includes several needle variants, such as direct, multiple-choice, distractor-included, and chain-of-thought formats. ^[2]

Evaluation setup and metrics

NoLiMa evaluates models that claim to support at least 128K tokens. Each model is tested at short lengths (a few hundred to about 1,000 tokens) to establish a base score, then at progressively longer contexts: 1K, 2K, 4K, 8K, 16K, 32K, and, for a few models, 64K and 128K. ^[1]^[2]

Two numbers summarize each model. The base score is accuracy in the short-context setting, where retrieval is not yet a bottleneck. The effective length is the longest context at which the model still keeps at least 85 percent of that base score. Effective length is the headline figure, because it captures the gap between a model's advertised window and the span over which it actually stays reliable on this associative task. ^[2]

Headline results

The original paper evaluated thirteen models and reported a stark pattern. Models that look excellent in short contexts fall off quickly as the haystack grows. In the paper's main set, out of twelve models with full curves, ten dropped to half or less of their base score by 32K tokens. ^[1] GPT-4o, one of the stronger performers, slid from an almost perfect base score of 99.3 percent to 69.7 percent at 32K, and its effective length came out at just 8K tokens despite a claimed 128K window. ^[1] The authors attribute the decline to the attention mechanism struggling to surface the right span once the easy keyword cue is gone and the irrelevant text dominates. ^[1]

The project's public leaderboard has since grown to more than twenty models, including newer releases. The table below shows representative results, with base score, the effective length (longest context holding at least 85 percent of base), and accuracy at selected lengths. All figures are from the NoLiMa repository and paper. ^[1]^[2]

Model	Claimed length	Effective length	Base score	4K	8K	16K	32K
GPT-4.1	1M	16K	97.0	91.7	87.5	84.9	79.8
GPT-4o	128K	8K	99.3	95.7	89.2	81.6	69.7
Gemini 1.5 Pro	2M	2K	92.6	75.4	63.9	55.5	48.2
Llama 3.3 70B	128K	2K	97.3	81.5	72.1	59.5	42.7
Llama 3.1 405B	128K	2K	94.7	74.5	60.1	48.4	38.0
Claude 3.5 Sonnet	200K	4K	87.6	77.6	61.7	45.7	29.8
Mistral Large 2	128K	2K	87.9	73.3	51.5	32.6	18.7
GPT-4o mini	128K	<1K	84.9	44.1	32.6	20.6	13.7

Scores are percentages. Effective length is the longest tested context where the model keeps at least 85 percent of its base score. Source: NoLiMa paper and repository. ^[1]^[2]

A few patterns stand out. The effective lengths are short across the board, often a small fraction of the advertised window: 8K for GPT-4o against a claimed 128K, 2K for several large models, and under 1K for some smaller ones. The decline is steep rather than gradual, with most models losing the bulk of their accuracy between 8K and 32K. Larger parameter counts do not buy much protection; some of the biggest models in the set, such as Llama 3.1 405B, still collapse on long contexts. ^[1]^[2]

Reasoning models and NoLiMa-Hard

To probe the hardest cases, the authors built NoLiMa-Hard from the ten most difficult question-needle pairs and ran it against reasoning-focused models that use extended thinking. These models do better than standard chat models, but they are not immune. Reasoning systems that score near 100 percent at short lengths still fall well off their base by 32K, with several dropping below half. ^[1]^[2] Adding chain-of-thought prompting helps at moderate lengths but does not prevent the long-context decline. ^[1] The takeaway is that the difficulty is not purely a reasoning gap that more deliberate thinking can close; it also reflects how well the model can locate the relevant span in the first place when no literal cue points to it.

How NoLiMa differs from RULER, InfiniteBench, and LongBench

NoLiMa is not the first benchmark to argue that vanilla NIAH overstates long-context ability, and it is useful to see what it adds.

RULER generates synthetic tasks of configurable length and complexity, including multi-key retrieval, variable tracking, and aggregation, and it consistently shows that effective context lengths fall short of claimed ones. The NoLiMa authors note that RULER and similar suites still tend to include literal overlap between the relevant content and the query, even when distractors are added, so attention can lean on repeated patterns. NoLiMa's contribution is to strip that overlap out by design and force an associative hop. ^[1]

InfiniteBench and LongBench take a different approach. They are broad suites of long-context tasks such as long-document question answering, summarization, code completion, and retrieval over book-length inputs. They test many capabilities at once on realistic material, but because the inputs and tasks are heterogeneous, it is harder to isolate the specific failure NoLiMa targets. NoLiMa is deliberately narrow: a single, controlled probe of associative retrieval where length is the only thing changing across conditions, which makes the degradation curve easy to read. ^[1]

The table below sketches the contrast.

Benchmark	Task style	Literal overlap with query	Main signal
Standard NIAH	Plant and retrieve one fact	Usually high	Keyword retrieval
RULER	Synthetic, configurable retrieval and tracking	Often present	Effective length under load
InfiniteBench / LongBench	Broad realistic long-document tasks	Varies by task	General long-context utility
NoLiMa	Plant and retrieve via latent association	Minimal by design	Associative retrieval vs. length

What it implies for long-context claims

The practical message is that an advertised context window is a capacity figure, not a guarantee of reliable use across that span. A model rated for 128K or 1M tokens may stay dependable only up to a few thousand tokens on a task that requires connecting ideas without shared keywords. For systems that feed long documents to a model and expect it to reason over the whole input, such as retrieval pipelines and document assistants, the gap matters. It suggests that placing many candidate passages in a single prompt and trusting the model to find the relevant one can fail when the query is phrased differently from the source, which is common in real questions. ^[1] This is one reason careful chunking and ranking in retrieval-augmented generation systems often outperforms simply enlarging the context and dumping everything in.

Limitations

NoLiMa is a focused probe rather than a complete picture of long-context behavior. It measures one ability, associative single-fact retrieval, and a model that struggles on it may still do well on tasks with clearer lexical cues or on broad suites like LongBench. The needle set is modest in size, with 58 question-needle pairs, and the associations are drawn from world knowledge such as places and landmarks, so results can interact with what a model already knows rather than purely with its context handling. The haystacks are concatenated book snippets, which differ from the structured documents seen in many real applications. The reported figures also track a public leaderboard that grows as new models are added, so specific numbers can change over time even though the overall pattern of steep degradation has held. ^[1]^[2]

References

Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., & Schütze, H. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. arXiv:2502.05167. https://arxiv.org/abs/2502.05167 ↩
Adobe Research. NoLiMa official repository (code, data, and leaderboard). GitHub. https://github.com/adobe-research/NoLiMa ↩
NoLiMa: Long-Context Evaluation Beyond Literal Matching (HTML version). arXiv. https://arxiv.org/html/2502.05167v1 ↩
NoLiMa: Long-Context Evaluation Beyond Literal Matching. OpenReview (ICML 2025). https://openreview.net/forum?id=0OshX1hiSa
ICML 2025 Poster: NoLiMa: Long-Context Evaluation Beyond Literal Matching. https://icml.cc/virtual/2025/poster/46685
NoLiMa: Long-Context Evaluation Beyond Literal Matching. Hugging Face Papers. https://huggingface.co/papers/2502.05167
NoLiMa: Long-Context Evaluation Beyond Literal Matching. Semantic Scholar. https://www.semanticscholar.org/paper/60d3856bcf01c382a7a1b41aa6d8c95665397779
The Decoder. AI language models struggle to connect the dots in long texts, study finds (2025). https://the-decoder.com/ai-language-models-struggle-to-connect-the-dots-in-long-texts-study-finds/
Portkey. NoLiMa: Long-Context Evaluation Beyond Literal Matching, summary. https://portkey.ai/blog/evaluating-long-context-llms/
NoLiMa Benchmark Leaderboard. LLM-Stats. https://llm-stats.com/benchmarks/nolima
Hsieh, C.-P., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654. https://arxiv.org/abs/2404.06654

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

LLM Context Window Comparison LongBench v2 Needle in a Haystack (NIAH)

Why the needle-in-a-haystack test is too easy

How the needles are built

Evaluation setup and metrics

Headline results

Reasoning models and NoLiMa-Hard

How NoLiMa differs from RULER, InfiniteBench, and LongBench

What it implies for long-context claims

Limitations

See also

References

Improve this article

Related Articles

LLM-as-a-judge

FACTS Grounding

LongBench v2

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here

Related Articles

LLM-as-a-judge

FACTS Grounding

LongBench v2

BABILong

MRCR

LLM Benchmark Comparison (Leaderboard Overview)

What links here