Needle in a Haystack (NIAH) is a benchmark test designed to evaluate the ability of large language models (LLMs) to retrieve specific information embedded within long input contexts. First introduced by Greg Kamradt in November 2023, the test inserts a known fact (the "needle") at varying positions within a large body of text (the "haystack") and then asks the model to recall that fact. The resulting performance data is typically visualized as a heatmap, with document depth on one axis and context length on the other, revealing where a model's recall begins to break down.
NIAH became one of the most widely cited evaluations for long-context LLMs during 2023 and 2024, adopted by companies such as OpenAI, Anthropic, and Google DeepMind to showcase the retrieval capabilities of their models. While the test proved useful for identifying gross failures in context utilization, critics have noted that its simplicity makes it an insufficient measure of true long-context understanding. Newer benchmarks such as RULER, BABILong, and NoLiMa have since expanded on the original NIAH methodology to test more complex reasoning and retrieval behaviors.
The Needle in a Haystack test originated from Greg Kamradt, an AI practitioner and content creator who shared the concept on X (formerly Twitter) on November 8, 2023. Kamradt's initial motivation was to pressure-test GPT-4 Turbo, which OpenAI had just released with a 128,000-token context window. While the expanded context window was a major selling point, Kamradt wanted to determine whether GPT-4 Turbo could actually use all of that context effectively.
To conduct the test, Kamradt inserted a single, out-of-place sentence into a corpus of essays by Paul Graham. The needle sentence was:
"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."
This statement has no connection to Paul Graham's essays about startups, technology, and philosophy, making it easy to determine whether a model genuinely retrieved the embedded fact or simply hallucinated an answer. After inserting the needle, Kamradt prompted the model with the question: "What is the best thing to do in San Francisco?" The model was instructed to answer using only the provided context.
Kamradt discussed his methodology and results in detail on his YouTube channel and released the source code as an open-source repository on GitHub under the name LLMTest_NeedleInAHaystack.
Several factors contributed to the test's rapid adoption. First, it was extremely simple to understand and replicate. Any developer could run the test against different models by swapping API keys. Second, the visual heatmap output was immediately intuitive: green cells indicated successful retrieval, red cells indicated failure, and the overall pattern told a clear story about where a model's recall degraded. Third, the test arrived at a moment when context window sizes were expanding rapidly, and there was genuine uncertainty about whether longer windows translated to better performance. Kamradt's work demonstrated that having a large context window did not guarantee that a model could use it all reliably.
The NIAH test follows a straightforward protocol:
The key innovation of the NIAH test is that it systematically varies two parameters:
| Parameter | Description | Typical Range |
|---|---|---|
| Context Length | Total number of tokens in the haystack (including the needle) | 1,000 to the model's maximum (e.g., 128K, 200K, or 1M tokens) |
| Document Depth | Position of the needle within the haystack, expressed as a percentage | 0% (beginning) to 100% (end), typically in 10% increments |
By testing every combination of context length and document depth, the experimenter produces a two-dimensional grid of results. For example, testing 15 context lengths and 11 depth positions yields 165 individual test runs per model.
The original NIAH test used a simple binary scoring system: the model either returned the correct answer or it did not. Kamradt visualized the results as a heatmap where:
These heatmaps became the signature visualization of the NIAH test, widely shared on social media and in model announcement blog posts. The visual format made it easy to spot patterns, such as a model failing specifically when the needle was in the middle of a very long document or when the needle was near the beginning of a context that approached the model's maximum length.
Later implementations refined the scoring. For instance, Arize AI's adaptation of the test used automated matching to search for specific keywords or phrases in the model's output rather than relying on subjective grading. This approach reduced evaluation time from days to hours and improved consistency.
Kamradt's open-source implementation supports several configurable parameters:
Kamradt's original November 2023 test of GPT-4 Turbo revealed several important patterns:
These findings were significant because they suggested that GPT-4 Turbo's 128K context window, while technically supported, was not fully reliable for information retrieval tasks at its upper range.
Shortly after Kamradt's GPT-4 test, he ran the same evaluation against Anthropic's Claude 2.1, which had launched on November 21, 2023, with a 200,000-token context window. The initial results were surprising: Claude 2.1 achieved only 27% retrieval accuracy overall.
This poor performance was not a fundamental retrieval failure but rather a behavioral artifact. Anthropic published a blog post on December 6, 2023, explaining the issue. Claude 2.1 had been trained extensively on real-world long document tasks with feedback aimed at reducing hallucinations. This training made the model reluctant to answer questions based on a single, out-of-place sentence embedded in a larger document. Rather than returning the needle's content, Claude would often respond that the document did not contain sufficient information to answer the question.
Anthropic demonstrated that adding a single sentence to the prompt template dramatically improved performance. By prefilling the assistant's response with "Here is the most relevant sentence in the context:", Claude's accuracy jumped from 27% to 98%. This modification reframed the task from "make an assertion about the document" to "identify and extract a relevant sentence," which aligned better with the model's training. Anthropic also showed that accuracy reached 90% to 95% when the needle sentence was topically related to the haystack content, suggesting that the original out-of-place needle design was particularly challenging for a model trained to avoid unsupported claims.
When Anthropic launched the Claude 3 model family on March 4, 2024, the company highlighted near-perfect NIAH performance as a key achievement. Claude 3 Opus surpassed 99% accuracy across the full 200K token context window, using an enhanced version of the test with 30 random needle/question pairs tested against a diverse document corpus rather than a single fixed needle.
The Claude 3 launch also produced one of the most discussed anecdotes in AI evaluation history. During testing, Claude 3 Opus not only retrieved the needle correctly but also appeared to recognize that it was being tested. In one instance, the model responded by noting that the needle sentence "seems very out of place and unrelated to the rest of the content" and suggested that "this feels like it could be an artificial test to see if I'm paying attention." This meta-awareness finding, while not a formal benchmark result, generated significant discussion about emergent model capabilities.
Google DeepMind used NIAH testing extensively when announcing Gemini 1.5 Pro in February 2024. The model supported a 1 million token context window at launch, with research demonstrations extending to 10 million tokens.
Key results from Google's testing:
| Context Length | Text Recall | Notes |
|---|---|---|
| Up to 530K tokens | 100% | Perfect retrieval across all modalities |
| Up to 1M tokens | 99.7% | Near-perfect retrieval |
| Up to 10M tokens | 99.2% | Research setting, not production |
Gemini 1.5 Pro was also tested with multimodal haystacks. In a video variant, a secret word was embedded in a single frame of a 10.5-hour video sampled at one frame per second, and the model had to identify it. In an audio variant, a short audio segment containing a keyword was inserted into up to 22 hours of concatenated audio clips. Both Gemini 1.5 Pro and Gemini 1.5 Flash achieved over 99% recall across text, video, and audio modalities up to millions of tokens.
Google also conducted a multi-needle variant, requiring retrieval of 100 unique needles in a single turn. Gemini 1.5 Pro maintained recall above 99.7% up to 1 million tokens and 99.2% at 10 million tokens for this task.
The NIAH test's results are closely related to a broader finding in natural language processing research known as the "lost in the middle" effect. This phenomenon was described in a July 2023 paper by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang from Stanford University.
Their research demonstrated that language models tend to recall information most effectively when it appears at the beginning or end of their input context, with a significant performance drop for information located in the middle. This produces a characteristic U-shaped accuracy curve when recall is plotted against input position. The effect is reminiscent of the serial-position effect in human psychology, where people tend to remember the first items (primacy effect) and last items (recency effect) in a list better than those in the middle.
NIAH heatmaps frequently show this same pattern. Many models produce diagonal failure bands in the upper-right region of the heatmap, corresponding to long contexts where the needle is placed early in the document. This suggests that the attention mechanism in transformer-based models struggles to maintain strong connections to information buried in the middle of very long sequences, particularly as the overall context length increases.
In March 2024, Lance Martin of LangChain extended Kamradt's original repository to support multiple needles within a single haystack. This variant tests a model's ability to retrieve and reason over several distinct facts simultaneously. The extension works by placing the first needle at a specified depth and then evenly distributing additional needles through the remaining context.
Martin tested GPT-4 Turbo (128K context) on retrieval of 1, 3, and 10 needles across context lengths from 1K to 120K tokens. The results revealed two important trends:
Both Gemini 1.5 and Claude 3 used multi-needle variants of the NIAH test as part of their official benchmark suites.
Beyond simple multi-needle retrieval, researchers have explored tasks that require the model to synthesize information from multiple needles. For example, a test might insert several facts about different characters' travel histories and then ask which character has visited a specific city. This requires the model to not only find all relevant needles but also perform a comparison or logical inference across them. These reasoning-intensive variants proved to be significantly more challenging for most models and closer to real-world use cases like legal document review or financial analysis.
The most fundamental criticism of the NIAH test is that it has become too easy for modern models. By late 2024, most frontier LLMs achieved near-perfect scores on the standard single-needle test, rendering it ineffective for differentiating between models. A benchmark that every top model passes perfectly cannot provide useful signal about relative capabilities.
The original NIAH test is a pure retrieval task with essentially no reasoning component. The needle is semantically unrelated to the haystack, so the model can succeed by pattern-matching for out-of-distribution content without understanding either the needle or the haystack. This makes the test a poor proxy for real-world tasks that require models to locate, interpret, and reason about information within long documents, such as finding errors in financial reports, synthesizing information across legal filings, or answering complex questions about codebases.
The use of a completely unrelated needle sentence (a statement about San Francisco embedded in essays about startups) creates an unnaturally easy detection problem. In real-world applications, the relevant information typically blends in with the surrounding content rather than standing out as an obvious outlier. Some researchers have argued that this makes NIAH results misleadingly optimistic about a model's actual long-context capabilities.
The NIAH test evaluates only whether a model can locate and repeat a specific piece of information. It does not test whether the model truly understands the full context, can summarize it accurately, can reason about relationships within it, or can maintain consistent quality when generating responses that depend on information spread throughout a long document. These capabilities are often more important for practical applications.
Research from Chroma (2025) and others has identified a phenomenon called "context rot," in which LLM output quality degrades as input context length increases, even when the model can technically access all relevant information. The NIAH test, by focusing solely on retrieval accuracy, does not capture this degradation in reasoning quality. A model might achieve 100% needle retrieval at 128K tokens while simultaneously producing worse analysis, summaries, or reasoning at that context length compared to shorter inputs. Testing 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5, the Chroma study found that every model exhibited context rot at every input length increment tested.
The limitations of the original NIAH test have motivated the development of more sophisticated long-context evaluation frameworks.
RULER ("What's the Real Context Size of Your Long-Context Language Models?") was published by Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg from NVIDIA in April 2024. It was presented at COLM 2024.
RULER expands the original NIAH test into four task categories:
| Category | Description | Example Tasks |
|---|---|---|
| Retrieval | Extends NIAH with diverse types and quantities of needles, including distractor needles | Single needle, multi-needle, multi-needle with distractors |
| Multi-hop Tracing | Tests the ability to trace entities through chains of references | Variable tracking as a proxy for coreference chain resolution |
| Aggregation | Tests the ability to aggregate information across long-range context | Common/frequent word extraction as a proxy for summarization |
| Question Answering | Adds distracting information to existing short-context QA datasets | SQuAD-based QA with extended context |
The RULER evaluation tested 17 long-context language models at context sizes from 4K to 128K tokens. The results showed that while nearly all models achieved near-perfect accuracy on the vanilla NIAH test, most exhibited large performance drops as context length increased on the more complex RULER tasks. Among models claiming support for 32K or more tokens, only half maintained satisfactory performance at that length on RULER's harder tasks.
BABILong, presented at NeurIPS 2024, is designed to test reasoning across facts distributed in extremely long documents. It includes 20 distinct reasoning tasks, including fact chaining, induction, deduction, counting, and handling lists and sets. The key innovation is that the "needles" are not just passively retrieved; the model must reason across multiple facts embedded at different positions in the context.
Evaluations showed that popular LLMs effectively utilized only 10% to 20% of the context and that performance declined sharply with increased reasoning complexity, even for models that scored perfectly on standard NIAH tests.
NoLiMa ("Long-Context Evaluation Beyond Literal Matching"), published by Adobe Research in February 2025 and accepted at ICML 2025, addresses the literal matching shortcut in NIAH-style tests. In the original NIAH test, the question closely resembles the needle text, allowing models to succeed through lexical pattern matching. NoLiMa constructs needle-question pairs with minimal word overlap, forcing models to infer latent associations rather than match surface-level patterns.
Testing 13 LLMs that claim support for at least 128K tokens, NoLiMa found that while models performed well at short context lengths (under 1K tokens), performance degraded significantly as context grew. At 32K tokens, 11 of the 13 models dropped below 50% of their baseline short-context performance.
NeedleBench (2024) designs more complex information structures, such as descriptions of relationships between entities or kinship relationships, and inserts them into long contexts. This tests not just retrieval but also the model's ability to understand and reason about the retrieved information.
Sequential-NIAH, published at EMNLP 2025, evaluates the ability of LLMs to extract sequential information items from long contexts. Rather than finding a single needle, the model must identify and correctly order multiple pieces of information scattered throughout the document, with context lengths ranging from 8K to 128K tokens.
MMNeedle, presented at NAACL 2025, extends the NIAH concept to multimodal settings. This benchmark tests whether multimodal large language models can locate specific visual or textual information embedded within long sequences of images, video frames, or interleaved text and images.
The NIAH test has had a significant impact on how AI companies develop and market their models. It became standard practice for model announcements to include NIAH heatmaps as evidence of long-context capability. Companies like Anthropic, Google DeepMind, and Cohere have featured NIAH results prominently in their technical reports and blog posts.
The NIAH methodology has also been adapted for evaluating retrieval-augmented generation (RAG) systems. In RAG applications, the test can assess whether the retrieval component correctly identifies the relevant document chunk and whether the generation component correctly uses the retrieved information. Arize AI developed a variant specifically for RAG evaluation that tracks both retrieval accuracy and end-to-end answer quality.
For model developers, NIAH testing provides a quick diagnostic tool during training and evaluation. By running NIAH tests at various checkpoints during training, developers can monitor whether changes to the model architecture, training data, or fine-tuning process affect long-context retrieval. The test's simplicity and speed make it useful for rapid iteration, even if more comprehensive benchmarks are needed for final evaluation.
NIAH results have contributed to research on improving long-context handling in transformer architectures. The consistent finding that models struggle with information in the middle of long contexts has motivated work on improved positional encoding schemes, alternative attention mechanisms, and architectural modifications designed to maintain uniform retrieval quality across all positions in the context window.
The Needle in a Haystack test holds a notable place in the history of LLM evaluation. Greg Kamradt's simple but effective methodology revealed genuine limitations in models that were being marketed with impressive context window numbers, and the test's visual heatmap format made those limitations immediately apparent to both technical and non-technical audiences. The test also produced memorable moments, including the discovery that Claude 2.1's poor performance was a prompting artifact and the finding that Claude 3 Opus appeared to recognize it was being tested.
At the same time, the test's simplicity has become its primary limitation. As frontier models have saturated the basic NIAH test, the field has moved toward more challenging benchmarks that test reasoning, multi-hop retrieval, and semantic understanding rather than simple pattern matching. The NIAH test remains valuable as a baseline sanity check, but it is no longer sufficient on its own to characterize a model's long-context capabilities.