NoLiMa
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 2,084 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 2,084 words
Add missing citations, update stale details, or suggest a clearer explanation.
NoLiMa, short for "No Literal Matching," is a long-context benchmark for large language models that measures how well a model can find and use a single relevant fact buried in a long document when that fact shares almost no words with the question being asked. It was introduced by Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schütze in the paper "NoLiMa: Long-Context Evaluation Beyond Literal Matching," published at ICML 2025 (arXiv:2502.05167). [1][2] The benchmark is a direct response to a weakness in the popular needle-in-a-haystack test: when the planted fact and the query use the same keywords, a model can succeed by surface pattern matching rather than by reading and reasoning. NoLiMa removes that crutch, and the result is that frontier models advertised with 128K to 1M token windows degrade sharply long before they reach their claimed limits. [1]
The needle-in-a-haystack (NIAH) test became the default way to advertise context window length. The recipe is simple: hide a sentence (the needle) somewhere inside a long stretch of unrelated text (the haystack), then ask the model a question whose answer is that sentence. Vendors often publish a grid of green cells showing near-perfect recall across hundreds of thousands of tokens, which suggests that the long context is fully usable.
The NoLiMa authors point out that most NIAH setups quietly hand the model an easy signal. The needle and the question tend to share literal words. If the needle says "the secret password is 7431" and the question asks "what is the secret password," the phrase "secret password" appears in both. A transformer's attention mechanism is very good at matching repeated tokens, so the model can locate the answer by keyword overlap without understanding the surrounding text. A high NIAH score in that case reflects lexical retrieval, not comprehension across a long context. [1]
NoLiMa is built to break this shortcut. Each question and its needle are designed to have minimal word overlap, so the model cannot lock onto a repeated keyword. To connect the two, it has to recognize a latent association: a relationship that is true in the world but never spelled out as a matching phrase in the text. [1]
The core trick is to phrase the planted fact and the question so they refer to the same thing through different vocabulary. The paper's running example uses a needle that reads, "Actually, Yuki lives next to the Semper Opera House," paired with the question "Which character has been to Dresden?" Answering it requires knowing that the Semper Opera House is in Dresden, a fact the passage never states. The model has to retrieve the needle and then make the geographic association on its own. [1][3]
The authors call each association step a latent hop. A one-hop item needs a single inference (Semper Opera House implies Dresden). A two-hop item chains associations together, for example linking a region to a city and then the city to a landmark, which raises the difficulty further. The needle set is organized so that the number of hops and the type of association can be analyzed separately, and the study reports that performance falls as the number of required hops grows. [1][3]
The haystacks are assembled from snippets of ten open-licensed books, with passages randomly selected and concatenated into long irrelevant contexts. A single needle is inserted at a controlled position, and the same item is tested at many context lengths so that degradation can be tracked as the surrounding text grows. [1] The benchmark uses 58 question-needle pairs in total, which expand into roughly 7,540 individual tests at each context length once placements and variants are accounted for. [1] The public release also includes several needle variants, such as direct, multiple-choice, distractor-included, and chain-of-thought formats. [2]
NoLiMa evaluates models that claim to support at least 128K tokens. Each model is tested at short lengths (a few hundred to about 1,000 tokens) to establish a base score, then at progressively longer contexts: 1K, 2K, 4K, 8K, 16K, 32K, and, for a few models, 64K and 128K. [1][2]
Two numbers summarize each model. The base score is accuracy in the short-context setting, where retrieval is not yet a bottleneck. The effective length is the longest context at which the model still keeps at least 85 percent of that base score. Effective length is the headline figure, because it captures the gap between a model's advertised window and the span over which it actually stays reliable on this associative task. [2]
The original paper evaluated thirteen models and reported a stark pattern. Models that look excellent in short contexts fall off quickly as the haystack grows. In the paper's main set, out of twelve models with full curves, ten dropped to half or less of their base score by 32K tokens. [1] GPT-4o, one of the stronger performers, slid from an almost perfect base score of 99.3 percent to 69.7 percent at 32K, and its effective length came out at just 8K tokens despite a claimed 128K window. [1] The authors attribute the decline to the attention mechanism struggling to surface the right span once the easy keyword cue is gone and the irrelevant text dominates. [1]
The project's public leaderboard has since grown to more than twenty models, including newer releases. The table below shows representative results, with base score, the effective length (longest context holding at least 85 percent of base), and accuracy at selected lengths. All figures are from the NoLiMa repository and paper. [1][2]
| Model | Claimed length | Effective length | Base score | 4K | 8K | 16K | 32K |
|---|---|---|---|---|---|---|---|
| GPT-4.1 | 1M | 16K | 97.0 | 91.7 | 87.5 | 84.9 | 79.8 |
| GPT-4o | 128K | 8K | 99.3 | 95.7 | 89.2 | 81.6 | 69.7 |
| Gemini 1.5 Pro | 2M | 2K | 92.6 | 75.4 | 63.9 | 55.5 | 48.2 |
| Llama 3.3 70B | 128K | 2K | 97.3 | 81.5 | 72.1 | 59.5 | 42.7 |
| Llama 3.1 405B | 128K | 2K | 94.7 | 74.5 | 60.1 | 48.4 | 38.0 |
| Claude 3.5 Sonnet | 200K | 4K | 87.6 | 77.6 | 61.7 | 45.7 | 29.8 |
| Mistral Large 2 | 128K | 2K | 87.9 | 73.3 | 51.5 | 32.6 | 18.7 |
| GPT-4o mini | 128K | <1K | 84.9 | 44.1 | 32.6 | 20.6 | 13.7 |
Scores are percentages. Effective length is the longest tested context where the model keeps at least 85 percent of its base score. Source: NoLiMa paper and repository. [1][2]
A few patterns stand out. The effective lengths are short across the board, often a small fraction of the advertised window: 8K for GPT-4o against a claimed 128K, 2K for several large models, and under 1K for some smaller ones. The decline is steep rather than gradual, with most models losing the bulk of their accuracy between 8K and 32K. Larger parameter counts do not buy much protection; some of the biggest models in the set, such as Llama 3.1 405B, still collapse on long contexts. [1][2]
To probe the hardest cases, the authors built NoLiMa-Hard from the ten most difficult question-needle pairs and ran it against reasoning-focused models that use extended thinking. These models do better than standard chat models, but they are not immune. Reasoning systems that score near 100 percent at short lengths still fall well off their base by 32K, with several dropping below half. [1][2] Adding chain-of-thought prompting helps at moderate lengths but does not prevent the long-context decline. [1] The takeaway is that the difficulty is not purely a reasoning gap that more deliberate thinking can close; it also reflects how well the model can locate the relevant span in the first place when no literal cue points to it.
NoLiMa is not the first benchmark to argue that vanilla NIAH overstates long-context ability, and it is useful to see what it adds.
RULER generates synthetic tasks of configurable length and complexity, including multi-key retrieval, variable tracking, and aggregation, and it consistently shows that effective context lengths fall short of claimed ones. The NoLiMa authors note that RULER and similar suites still tend to include literal overlap between the relevant content and the query, even when distractors are added, so attention can lean on repeated patterns. NoLiMa's contribution is to strip that overlap out by design and force an associative hop. [1]
InfiniteBench and LongBench take a different approach. They are broad suites of long-context tasks such as long-document question answering, summarization, code completion, and retrieval over book-length inputs. They test many capabilities at once on realistic material, but because the inputs and tasks are heterogeneous, it is harder to isolate the specific failure NoLiMa targets. NoLiMa is deliberately narrow: a single, controlled probe of associative retrieval where length is the only thing changing across conditions, which makes the degradation curve easy to read. [1]
The table below sketches the contrast.
| Benchmark | Task style | Literal overlap with query | Main signal |
|---|---|---|---|
| Standard NIAH | Plant and retrieve one fact | Usually high | Keyword retrieval |
| RULER | Synthetic, configurable retrieval and tracking | Often present | Effective length under load |
| InfiniteBench / LongBench | Broad realistic long-document tasks | Varies by task | General long-context utility |
| NoLiMa | Plant and retrieve via latent association | Minimal by design | Associative retrieval vs. length |
The practical message is that an advertised context window is a capacity figure, not a guarantee of reliable use across that span. A model rated for 128K or 1M tokens may stay dependable only up to a few thousand tokens on a task that requires connecting ideas without shared keywords. For systems that feed long documents to a model and expect it to reason over the whole input, such as retrieval pipelines and document assistants, the gap matters. It suggests that placing many candidate passages in a single prompt and trusting the model to find the relevant one can fail when the query is phrased differently from the source, which is common in real questions. [1] This is one reason careful chunking and ranking in retrieval-augmented generation systems often outperforms simply enlarging the context and dumping everything in.
NoLiMa is a focused probe rather than a complete picture of long-context behavior. It measures one ability, associative single-fact retrieval, and a model that struggles on it may still do well on tasks with clearer lexical cues or on broad suites like LongBench. The needle set is modest in size, with 58 question-needle pairs, and the associations are drawn from world knowledge such as places and landmarks, so results can interact with what a model already knows rather than purely with its context handling. The haystacks are concatenated book snippets, which differ from the structured documents seen in many real applications. The reported figures also track a public leaderboard that grows as new models are added, so specific numbers can change over time even though the overall pattern of steep degradation has held. [1][2]