Needle in a Haystack (NIAH)

AI Benchmarks Large Language Models

28 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v6 · 5,622 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Needle in a Haystack (NIAH) is a long-context evaluation that measures whether a large language model can retrieve a single fact (the "needle") inserted at a controlled position inside a long body of text (the "haystack"). It was introduced by AI practitioner Greg Kamradt on November 8, 2023, who placed an unrelated sentence at varying depths inside a corpus of Paul Graham essays and tested whether GPT-4 Turbo could recall it across context windows up to 128,000 tokens.^[1] Results are visualized as a heatmap with document depth on one axis and context length on the other, where green cells mark successful retrieval and red cells mark failure, revealing exactly where a model's recall breaks down.^[1]^[2]

NIAH became one of the most widely cited evaluations for long-context LLMs during 2023 and 2024, adopted by companies such as OpenAI, Anthropic, and Google DeepMind to showcase the retrieval capabilities of their models. While the test proved useful for identifying gross failures in context utilization, critics have noted that its simplicity makes it an insufficient measure of true long-context understanding. By late 2024 nearly all frontier models scored near-perfect on the standard single-needle test, prompting newer benchmarks such as RULER, BABILong, and NoLiMa that expand on the original NIAH methodology to test more complex reasoning and retrieval behaviors.^[8]^[10]^[11]

Who created the Needle in a Haystack test?

Greg Kamradt's Original Experiment

The Needle in a Haystack test originated from Greg Kamradt, an AI practitioner and content creator who shared the concept on X (formerly Twitter) on November 8, 2023.^[1] Kamradt's initial motivation was to pressure-test GPT-4 Turbo, which OpenAI had just released with a 128,000-token context window.^[1] While the expanded context window was a major selling point, Kamradt wanted to determine whether GPT-4 Turbo could actually use all of that context effectively. He framed the goal plainly: "128K tokens of context is awesome, but what's performance like? I wanted to find out."^[1]

To conduct the test, Kamradt inserted a single, out-of-place sentence into a corpus of essays by Paul Graham.^[1]^[2] The needle sentence was:

"The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day."

This statement has no connection to Paul Graham's essays about startups, technology, and philosophy, making it easy to determine whether a model genuinely retrieved the embedded fact or simply hallucinated an answer. After inserting the needle, Kamradt prompted the model with the question: "What is the best thing to do in San Francisco?" The model was instructed to answer using only the provided context.^[2]

Kamradt discussed his methodology and results in detail on his YouTube channel and released the source code as an open-source repository on GitHub under the name LLMTest_NeedleInAHaystack.^[2]

Why did the test resonate?

Several factors contributed to the test's rapid adoption. First, it was extremely simple to understand and replicate. Any developer could run the test against different models by swapping API keys. Second, the visual heatmap output was immediately intuitive: green cells indicated successful retrieval, red cells indicated failure, and the overall pattern told a clear story about where a model's recall degraded. Third, the test arrived at a moment when context window sizes were expanding rapidly, and there was genuine uncertainty about whether longer windows translated to better performance. Kamradt's work demonstrated that having a large context window did not guarantee that a model could use it all reliably.^[1]

How does the Needle in a Haystack test work?

Core Design

The NIAH test follows a straightforward protocol:

Select a haystack. A large body of text is chosen as the distractor content. In the original test, this was a collection of Paul Graham's essays, concatenated to fill the desired number of tokens.^[2]
Define a needle. A short, semantically distinct sentence is chosen as the fact to be retrieved. The needle should be clearly unrelated to the haystack so that correct retrieval is unambiguous.
Insert the needle at a specific depth. The needle is placed at a controlled position within the haystack, expressed as a percentage of document depth. A depth of 0% places the needle at the very beginning, 50% places it in the middle, and 100% places it at the end.
Prompt the model. The combined haystack-plus-needle text is provided to the model along with a question that can only be answered correctly by finding the needle.
Evaluate the response. The model's output is checked against the expected answer. In the original test, this was a binary pass/fail judgment, though later implementations used more nuanced scoring.^[2]^[13]

Test Matrix

The key innovation of the NIAH test is that it systematically varies two parameters:

Parameter	Description	Typical Range
Context Length	Total number of tokens in the haystack (including the needle)	1,000 to the model's maximum (e.g., 128K, 200K, or 1M tokens)
Document Depth	Position of the needle within the haystack, expressed as a percentage	0% (beginning) to 100% (end), typically in 10% increments

By testing every combination of context length and document depth, the experimenter produces a two-dimensional grid of results. For example, testing 15 context lengths and 11 depth positions yields 165 individual test runs per model.

Scoring and Visualization

The original NIAH test used a simple binary scoring system: the model either returned the correct answer or it did not.^[2] Kamradt visualized the results as a heatmap where:

The x-axis represents context length (from short to long).
The y-axis represents document depth (from 0% at the top to 100% at the bottom).
Each cell is color-coded by accuracy: green for correct retrieval, red for failure, and intermediate colors for partial accuracy.

These heatmaps became the signature visualization of the NIAH test, widely shared on social media and in model announcement blog posts. The visual format made it easy to spot patterns, such as a model failing specifically when the needle was in the middle of a very long document or when the needle was near the beginning of a context that approached the model's maximum length.

Later implementations refined the scoring. For instance, Arize AI's adaptation of the test used automated matching to search for specific keywords or phrases in the model's output rather than relying on subjective grading. This approach reduced evaluation time from days to hours and improved consistency.^[13]

Configuration and Extensions

Kamradt's open-source implementation supports several configurable parameters:

Minimum and maximum context lengths, with customizable intervals or a fixed set of lengths.
Document depth percentages, specifying where the needle is placed.
Interval distribution, which can be linear or sigmoid (concentrating more test points near the extremes).
Concurrent request handling, allowing parallel API calls to speed up testing.
Multiple model providers, including OpenAI, Anthropic, and Cohere.^[2]

Later revisions of the repository replaced the original fixed needle with configurable needle types, including random UUID needles and multi-hop UUID chains whose links are spread through the haystack, reducing the risk that models pass by detecting an out-of-place sentence or by relying on memorized content.^[2]

What did the original NIAH tests find?

GPT-4 Turbo (128K Context)

Kamradt's original November 2023 test of GPT-4 Turbo revealed several important patterns:

The model performed well at shorter context lengths (under 64K tokens), achieving near-perfect retrieval regardless of needle depth.^[1]
Performance began to degrade above 73K tokens, with retrieval accuracy dropping noticeably.^[1]
At 100K tokens and above, accuracy fell sharply, especially when the needle was placed between 7% and 50% document depth.^[1]
The model performed best when the needle appeared at the very end of the document or at the very beginning, consistent with the "lost in the middle" phenomenon described in prior research.^[1]^[9]

These findings were significant because they suggested that GPT-4 Turbo's 128K context window, while technically supported, was not fully reliable for information retrieval tasks at its upper range.^[1]

Why did Claude 2.1 score only 27%?

Shortly after Kamradt's GPT-4 test, he ran the same evaluation against Anthropic's Claude 2.1, which had launched on November 21, 2023, with a 200,000-token context window.^[15] The initial results were surprising: Claude 2.1 achieved only 27% retrieval accuracy overall.^[3]

This poor performance was not a fundamental retrieval failure but rather a behavioral artifact. Anthropic published a blog post on December 6, 2023, explaining the issue.^[3] As Anthropic put it, "Claude is trained on a mix of data aimed at reducing inaccuracies. This includes not answering a question based on a document if it doesn't contain enough information to justify that answer."^[3] This training made the model reluctant to answer questions based on a single, out-of-place sentence embedded in a larger document. Rather than returning the needle's content, Claude would often respond that the document did not contain sufficient information to answer the question.^[3]

Anthropic demonstrated that adding a single sentence to the prompt template dramatically improved performance. By prefilling the assistant's response with "Here is the most relevant sentence in the context:", Claude's accuracy jumped from 27% to 98%.^[3] This modification reframed the task from "make an assertion about the document" to "identify and extract a relevant sentence," which aligned better with the model's training. Anthropic also showed that accuracy reached 90% to 95% when the needle sentence was topically related to the haystack content, suggesting that the original out-of-place needle design was particularly challenging for a model trained to avoid unsupported claims.^[3]

Claude 3 (200K Context)

When Anthropic launched the Claude 3 model family on March 4, 2024, the company highlighted near-perfect NIAH performance as a key achievement.^[4] Claude 3 Opus surpassed 99% accuracy across the full 200K token context window, using an enhanced version of the test with 30 random needle/question pairs tested against a diverse document corpus rather than a single fixed needle.^[4]

The Claude 3 launch also produced one of the most discussed anecdotes in AI evaluation history. During testing, Claude 3 Opus not only retrieved the needle correctly but also appeared to recognize that it was being tested. In one instance, the model responded by noting that the needle sentence "seems very out of place and unrelated to the rest of the content" and suggested that the fact "may have been inserted as a joke or to test if I was paying attention."^[16] This meta-awareness finding, while not a formal benchmark result, generated significant discussion about emergent model capabilities.^[17] The account came from Anthropic prompt engineer Alex Albert, who shared the model's response on X; the needle in that run was a sentence about pizza toppings inserted among documents about programming languages, startups, and work.^[16] Machine learning researchers broadly cautioned against interpreting the response as self-awareness, attributing it instead to humanlike patterns learned from training data.^[17]

Gemini 1.5 Pro (1M+ Context)

Google DeepMind used NIAH testing extensively when announcing Gemini 1.5 Pro in February 2024. The model supported a 1 million token context window at launch, with research demonstrations extending to 10 million tokens.^[5]^[6]

Key results from Google's testing:^[6]

Context Length	Text Recall	Notes
Up to 530K tokens	100%	Perfect retrieval across all modalities
Up to 1M tokens	99.7%	Near-perfect retrieval
Up to 10M tokens	99.2%	Research setting, not production

Gemini 1.5 Pro was also tested with multimodal haystacks. In a video variant, a secret word was embedded in a single frame of a 10.5-hour video sampled at one frame per second, and the model had to identify it.^[6] In an audio variant, a short audio segment containing a keyword was inserted into up to 22 hours of concatenated audio clips.^[6] Both Gemini 1.5 Pro and Gemini 1.5 Flash achieved over 99% recall across text, video, and audio modalities up to millions of tokens.^[6]

Google also conducted a multi-needle variant, requiring retrieval of 100 unique needles in a single turn. Gemini 1.5 Pro maintained recall above 99.7% up to 1 million tokens and 99.2% at 10 million tokens for this task.^[6]

Why are models "lost in the middle"?

The NIAH test's results are closely related to a broader finding in natural language processing research known as the "lost in the middle" effect. This phenomenon was described in a July 2023 paper by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang from Stanford University, the University of California, Berkeley, and Samaya AI.^[9] The paper was later published in Transactions of the Association for Computational Linguistics.^[9]

Their research demonstrated that language models tend to recall information most effectively when it appears at the beginning or end of their input context, with a significant performance drop for information located in the middle.^[9] This produces a characteristic U-shaped accuracy curve when recall is plotted against input position. The effect is reminiscent of the serial-position effect in human psychology, where people tend to remember the first items (primacy effect) and last items (recency effect) in a list better than those in the middle.^[9]

NIAH heatmaps frequently show this same pattern. Many models produce diagonal failure bands in the upper-right region of the heatmap, corresponding to long contexts where the needle is placed early in the document. This suggests that the attention mechanism in transformer-based models struggles to maintain strong connections to information buried in the middle of very long sequences, particularly as the overall context length increases.

Multi-Needle Variants

LangChain's Multi-Needle Extension

In March 2024, Lance Martin of LangChain extended Kamradt's original repository to support multiple needles within a single haystack.^[7] This variant tests a model's ability to retrieve and reason over several distinct facts simultaneously. The extension works by placing the first needle at a specified depth and then evenly distributing additional needles through the remaining context.^[7]

Martin tested GPT-4 Turbo (128K context) on retrieval of 1, 3, and 10 needles across context lengths from 1K to 120K tokens.^[7] The results revealed two important trends:

Performance degraded as the number of needles increased. Retrieving 10 needles was substantially harder than retrieving 1.^[7]
Performance degraded further when the model had to reason about the relationships between retrieved facts, not just list them.^[7]

Both Gemini 1.5 and Claude 3 used multi-needle variants of the NIAH test as part of their official benchmark suites.

Reasoning Over Multiple Needles

Beyond simple multi-needle retrieval, researchers have explored tasks that require the model to synthesize information from multiple needles. For example, a test might insert several facts about different characters' travel histories and then ask which character has visited a specific city. This requires the model to not only find all relevant needles but also perform a comparison or logical inference across them. These reasoning-intensive variants proved to be significantly more challenging for most models and closer to real-world use cases like legal document review or financial analysis.

Why is the Needle in a Haystack test criticized?

Saturation

The most fundamental criticism of the NIAH test is that it has become too easy for modern models. By late 2024, most frontier LLMs achieved near-perfect scores on the standard single-needle test, rendering it ineffective for differentiating between models.^[8] A benchmark that every top model passes perfectly cannot provide useful signal about relative capabilities.

Simplicity of the Retrieval Task

The original NIAH test is a pure retrieval task with essentially no reasoning component. The needle is semantically unrelated to the haystack, so the model can succeed by pattern-matching for out-of-distribution content without understanding either the needle or the haystack. This makes the test a poor proxy for real-world tasks that require models to locate, interpret, and reason about information within long documents, such as finding errors in financial reports, synthesizing information across legal filings, or answering complex questions about codebases.

Artificial Needle Design

The use of a completely unrelated needle sentence (a statement about San Francisco embedded in essays about startups) creates an unnaturally easy detection problem. In real-world applications, the relevant information typically blends in with the surrounding content rather than standing out as an obvious outlier. Some researchers have argued that this makes NIAH results misleadingly optimistic about a model's actual long-context capabilities.

The AI startup Magic made a version of this argument explicit in August 2024, observing that a semantic outlier can be found through anomaly detection rather than genuine storage and retrieval capacity, which lets weaker architectures appear successful at long-context tasks. As Magic put it, such evaluations make it "easy to take shortcuts," because "the relevant information can often be retrieved by a simple syntactic match rather than requiring the model to understand the entire context."^[18] Magic proposed HashHop, an alternative evaluation built on chains of random hash pairs that cannot be compressed or recognized semantically, and announced LTM-2-Mini, a model with a claimed 100 million token context window, alongside it.^[18]

Limited Evaluation of Context Understanding

The NIAH test evaluates only whether a model can locate and repeat a specific piece of information. It does not test whether the model truly understands the full context, can summarize it accurately, can reason about relationships within it, or can maintain consistent quality when generating responses that depend on information spread throughout a long document. These capabilities are often more important for practical applications.

HELMET, a benchmark suite from researchers at Princeton University and Intel presented at ICLR 2025, quantified this gap: across 59 long-context models and seven application-centric task categories, synthetic tasks like NIAH did not reliably predict downstream performance, and models with perfect NIAH scores diverged widely on tasks requiring full-context reasoning or complex instruction following.^[19]

Context Rot

Research from Chroma (2025) and others has identified a phenomenon called "context rot," in which LLM output quality degrades as input context length increases, even when the model can technically access all relevant information.^[12] The NIAH test, by focusing solely on retrieval accuracy, does not capture this degradation in reasoning quality. A model might achieve 100% needle retrieval at 128K tokens while simultaneously producing worse analysis, summaries, or reasoning at that context length compared to shorter inputs. Testing 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5, the Chroma study found that every model exhibited context rot at every input length increment tested.^[12] The July 2025 report, authored by Kelly Hong, Anton Troynikov, and Jeff Huber, also found that lower semantic similarity between the question and the needle accelerated degradation, that even a single distractor measurably reduced accuracy, and, counterintuitively, that logically structured haystacks performed worse than shuffled ones.^[12]

What benchmarks replaced Needle in a Haystack?

The limitations of the original NIAH test have motivated the development of more sophisticated long-context evaluation frameworks.

RULER

RULER ("What's the Real Context Size of Your Long-Context Language Models?") was published by Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg from NVIDIA in April 2024. It was presented at COLM 2024.^[8]

RULER expands the original NIAH test into four task categories:^[8]

Category	Description	Example Tasks
Retrieval	Extends NIAH with diverse types and quantities of needles, including distractor needles	Single needle, multi-needle, multi-needle with distractors
Multi-hop Tracing	Tests the ability to trace entities through chains of references	Variable tracking as a proxy for coreference chain resolution
Aggregation	Tests the ability to aggregate information across long-range context	Common/frequent word extraction as a proxy for summarization
Question Answering	Adds distracting information to existing short-context QA datasets	SQuAD-based QA with extended context

The RULER evaluation tested 17 long-context language models with 13 tasks at context sizes from 4K to 128K tokens.^[8] The headline finding was that surface-level retrieval scores hide a sharp gap: "Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases."^[8] Among models claiming support for 32K or more tokens, only half maintained satisfactory performance at that length on RULER's harder tasks, and most fell well below their advertised context size in effective length (for example, GPT-4 claimed 128K but had an effective length closer to 64K).^[8]

BABILong

BABILong, presented at NeurIPS 2024, is designed to test reasoning across facts distributed in extremely long documents.^[10] It includes 20 distinct reasoning tasks, including fact chaining, induction, deduction, counting, and handling lists and sets.^[10] The key innovation is that the "needles" are not just passively retrieved; the model must reason across multiple facts embedded at different positions in the context.

Evaluations showed that popular LLMs effectively utilized only 10% to 20% of the context and that performance declined sharply with increased reasoning complexity, even for models that scored perfectly on standard NIAH tests.^[10]

NoLiMa

NoLiMa ("Long-Context Evaluation Beyond Literal Matching"), published by Adobe Research in February 2025 and accepted at ICML 2025, addresses the literal matching shortcut in NIAH-style tests.^[11] In the original NIAH test, the question closely resembles the needle text, allowing models to succeed through lexical pattern matching. NoLiMa constructs needle-question pairs with minimal word overlap, forcing models to infer latent associations rather than match surface-level patterns.^[11]

Testing 13 LLMs that claim support for at least 128K tokens, NoLiMa found that while models performed well at short context lengths (under 1K tokens), performance degraded significantly as context grew. At 32K tokens, 11 of the 13 models dropped below 50% of their baseline short-context performance.^[11] The authors reported that even models enhanced with reasoning capabilities or chain-of-thought prompting struggled to maintain performance as context grew.^[11]

NeedleBench

NeedleBench (2024) designs more complex information structures, such as descriptions of relationships between entities or kinship relationships, and inserts them into long contexts.^[20] This tests not just retrieval but also the model's ability to understand and reason about the retrieved information.^[20] Its information-dense variant, the Ancestral Trace Challenge, distributes relevant kinship facts continuously through the context and found that even reasoning models such as DeepSeek-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios, even at shorter context lengths.^[20]

Sequential-NIAH

Sequential-NIAH, published at EMNLP 2025, evaluates the ability of LLMs to extract sequential information items from long contexts.^[21] Rather than finding a single needle, the model must identify and correctly order multiple pieces of information scattered throughout the document, with context lengths ranging from 8K to 128K tokens.^[21] On its 2,000-sample test set, the best-performing of six evaluated models reached only about 63% accuracy.^[21]

Multimodal Needle in a Haystack (MMNeedle)

MMNeedle, presented at NAACL 2025, extends the NIAH concept to multimodal settings. This benchmark tests whether multimodal large language models can locate a target sub-image embedded within large sets of stitched images, based on textual instructions.^[22] Its evaluation found that GPT-4o consistently outperformed other models in long-context scenarios but hallucinated needles when none were present.^[22]

Michelangelo and MRCR

In September 2024, researchers at Google DeepMind introduced Michelangelo, an evaluation suite built on a framework called Latent Structure Queries that moves "beyond haystacks." Instead of retrieving an isolated fact, the model must filter away irrelevant context to reveal a latent structure, such as the state of a list after a sequence of edits, and then answer queries about that structure.^[23] One of its tasks, multi-round co-reference resolution (MRCR), hides multiple near-identical items in a long synthetic conversation and asks the model to retrieve a specific instance.^[23] In April 2025, OpenAI released a public adaptation, OpenAI-MRCR, in which an identical request appears two, four, or eight times among distractors in a long generated conversation; because all needles and distractors are written by the same model, the needles blend in rather than standing out.^[24] A companion OpenAI dataset, Graphwalks, fills the context window with a directed graph of hexadecimal node identifiers and asks the model to perform operations such as breadth-first search, scored by F1 against the true node set.^[25] MRCR variants subsequently became a standard long-context metric in frontier model reporting.^[14]

LongBench v2

LongBench v2, released in December 2024 by researchers at Tsinghua University, moved away from synthetic needles toward realistic long-context reasoning.^[27] It contains 503 multiple-choice questions over materials ranging from 8,000 to 2 million words across six task categories, including single-document and multi-document question answering, long in-context learning, dialogue history understanding, code repositories, and structured data.^[27] The questions are difficult enough that human experts achieved only 53.7% accuracy under a 15-minute time constraint; the best model evaluated in the paper reached 50.1% when answering directly, while o1-preview, using longer test-time reasoning, reached 57.7%.^[27]

Practical Applications and Influence

Industry Adoption

The NIAH test has had a significant impact on how AI companies develop and market their models. It became standard practice for model announcements to include NIAH heatmaps as evidence of long-context capability. Companies like Anthropic, Google DeepMind, and Cohere have featured NIAH results prominently in their technical reports and blog posts.^[4]^[6]

RAG Evaluation

The NIAH methodology has also been adapted for evaluating retrieval-augmented generation (RAG) systems. In RAG applications, the test can assess whether the retrieval component correctly identifies the relevant document chunk and whether the generation component correctly uses the retrieved information. Arize AI developed a variant specifically for RAG evaluation that tracks both retrieval accuracy and end-to-end answer quality.^[13]

Training and Development Feedback

For model developers, NIAH testing provides a quick diagnostic tool during training and evaluation. By running NIAH tests at various checkpoints during training, developers can monitor whether changes to the model architecture, training data, or fine-tuning process affect long-context retrieval. The test's simplicity and speed make it useful for rapid iteration, even if more comprehensive benchmarks are needed for final evaluation.

Influence on Architecture Research

NIAH results have contributed to research on improving long-context handling in transformer architectures. The consistent finding that models struggle with information in the middle of long contexts has motivated work on improved positional encoding schemes, alternative attention mechanisms, and architectural modifications designed to maintain uniform retrieval quality across all positions in the context window.

Recent Developments (2026)

By 2026, the standard single-needle NIAH test is fully saturated at contexts up to 200K tokens; every major frontier model achieves near-perfect recall in that range.^[14] The more informative differentiation comes from updated variants tested at 1 million tokens and beyond.

NIAH-2, an updated version of Kamradt's original test, reports the following single-needle results at 1M tokens (April 2026):^[14]

Model	NIAH-2 Single-Needle @ 1M	NIAH-2 Multi-Needle (8) @ 1M
Gemini 3 Deep Think	99%	89%
GPT-5.5	96%	74%
Claude Opus 4.7	89%	56%
DeepSeek V4-Pro	78%	41%

A key finding from this updated testing is that the single-needle score at 1M tokens significantly overstates production capability. In workloads requiring multi-fact integration, performance drops by 10 to 37 points depending on the model.^[14] Gemini 3 shows the smallest degradation from single to multi-needle retrieval, while DeepSeek V4-Pro shows the largest.^[14] This gap is especially important for real-world tasks such as legal document review and financial analysis, where multiple facts must be retrieved and synthesized in a single context.

The "1M-token claim" from model developers is now recognized as a capacity statement (the model architecturally accepts a 1M-token input) rather than a performance guarantee.^[14] Industry analysis in 2026 consistently finds that multi-needle and reasoning-over-context scores better reflect production reality than single-needle headline figures.^[14]

The Race Past 1 Million Tokens (2025-2026)

The NIAH test's legacy is visible in how model releases during 2025 and 2026 framed their long-context claims. In April 2025, OpenAI shipped the GPT-4.1 family, its first models with a 1 million token context window, and stated that the models could retrieve a needle reliably at all positions up to the full 1 million tokens; the announcement leaned on OpenAI-MRCR and Graphwalks results rather than the single-needle test alone.^[24]^[26] The same month, Meta claimed an industry-leading 10 million token window for Llama 4 Scout, supported by needle-in-a-haystack retrieval charts and an interleaved-attention architecture it called iRoPE; Meta disclosed that the model was pre-trained and post-trained at a 256K context length, with longer inputs relying on length generalization.^[28] Independent community evaluations of long-context comprehension were less favorable, with tests such as Fiction.LiveBench placing Llama 4 near the bottom among contemporary frontier models shortly after release.^[29]

Anthropic brought a 1 million token window to Claude Sonnet 4 in public beta on the Anthropic API and Amazon Bedrock in August 2025, a fivefold increase over its prior 200K limit, with premium pricing for prompts above 200K tokens.^[30] Google's Gemini 3 Pro, released on November 18, 2025, reported the harder MRCR v2 (8-needle) metric in its model card rather than a NIAH heatmap, scoring 77.0% at a 128K average context length and 26.3% pointwise at 1 million tokens.^[31] As of June 2026, the highest self-reported MRCR v2 (8-needle) scores tracked by the llm-stats leaderboard were 76% for Claude Opus 4.6 and 74% for GPT-5.5, underscoring how much headroom remains on multi-needle retrieval even as the original single-needle NIAH stands solved.^[32]

Summary

The Needle in a Haystack test holds a notable place in the history of LLM evaluation. Greg Kamradt's simple but effective methodology revealed genuine limitations in models that were being marketed with impressive context window numbers, and the test's visual heatmap format made those limitations immediately apparent to both technical and non-technical audiences.^[1] The test also produced memorable moments, including the discovery that Claude 2.1's poor performance was a prompting artifact and the finding that Claude 3 Opus appeared to recognize it was being tested.^[3]^[16]

At the same time, the test's simplicity has become its primary limitation. As frontier models have saturated the basic NIAH test, the field has moved toward more challenging benchmarks that test reasoning, multi-hop retrieval, and semantic understanding rather than simple pattern matching.^[8]^[11]^[14] The NIAH test remains valuable as a baseline sanity check, but it is no longer sufficient on its own to characterize a model's long-context capabilities.

References

Kamradt, G. (2023). "Pressure Testing GPT-4-128K With Long Context Recall." X post, November 8, 2023. https://x.com/GregKamradt/status/1722386725635580292 ↩
Kamradt, G. (2023). LLMTest_NeedleInAHaystack. GitHub repository. https://github.com/gkamradt/LLMTest_NeedleInAHaystack ↩
Anthropic. (2023). "Long context prompting for Claude 2.1." Blog post, December 6, 2023. https://www.anthropic.com/news/claude-2-1-prompting ↩
Anthropic. (2024). "The Claude 3 Model Family: Opus, Sonnet, Haiku." March 4, 2024. https://www.anthropic.com/news/claude-3-family ↩
Google. (2024). "Introducing Gemini 1.5, Google's next-generation AI model." February 2024. https://blog.google/innovation-and-ai/products/google-gemini-next-generation-model-february-2024/ ↩
Gemini Team. (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530. ↩
Martin, L. (2024). "Multi Needle in a Haystack." LangChain Blog, March 2024. https://blog.langchain.com/multi-needle-in-a-haystack/ ↩
Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., & Ginsburg, B. (2024). "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv:2404.06654. Presented at COLM 2024. ↩
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. ↩
Kuratov, Y., & Bulatov, A. et al. (2024). "BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack." Presented at NeurIPS 2024. ↩
Adobe Research. (2025). "NoLiMa: Long-Context Evaluation Beyond Literal Matching." arXiv:2502.05167. Accepted at ICML 2025. ↩
Chroma Research. (2025). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." https://research.trychroma.com/context-rot ↩
Dhinakaran, A. (2024). "The Needle In a Haystack Test: Evaluating the Performance of LLM RAG Systems." Arize AI Blog. https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/ ↩
Digital Applied. (2026). "Long-Context Retrieval 2026: Needle-in-Haystack Test." https://www.digitalapplied.com/blog/long-context-retrieval-needle-in-haystack-2026 (accessed May 2026). ↩
Anthropic. (2023). "Introducing Claude 2.1." November 21, 2023. https://www.anthropic.com/news/claude-2-1 ↩
Albert, A. (2024). X post describing Claude 3 Opus behavior during internal needle-in-a-haystack testing. March 4, 2024. https://x.com/alexalbert__/status/1764722513014329620 ↩
VentureBeat. (2024). "Anthropic's Claude 3 knew when researchers were testing it." March 2024. https://venturebeat.com/ai/anthropics-claude-3-knew-when-researchers-were-testing-it ↩
Magic. (2024). "100M Token Context Windows." August 29, 2024. https://magic.dev/blog/100m-token-context-windows ↩
Yen, H., Gao, T., Hou, M., Ding, K., Fleischer, D., Izsak, P., Wasserblat, M., & Chen, D. (2024). "HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly." ICLR 2025. arXiv:2410.02694. https://arxiv.org/abs/2410.02694 ↩
Li, M., Zhang, S., Zhang, T., Duan, H., Liu, Y., & Chen, K. (2024). "NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?" arXiv:2407.11963. https://arxiv.org/abs/2407.11963 ↩
Yu, Y., Zhang, Q.-W., Qiao, L., Yin, D., Li, F., Wang, J., Chen, Z., Zheng, S., Liang, X., & Sun, X. (2025). "Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts." Proceedings of EMNLP 2025. arXiv:2504.04713. https://aclanthology.org/2025.emnlp-main.1497/ ↩
Wang, H., Shi, H., Tan, S., Qin, W., Wang, W., Zhang, T., Nambi, A., Ganu, T., & Wang, H. (2024). "Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models." NAACL 2025. arXiv:2406.11230. https://arxiv.org/abs/2406.11230 ↩
Vodrahalli, K., et al. (2024). "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries." Google DeepMind. arXiv:2409.12640. https://arxiv.org/abs/2409.12640 ↩
OpenAI. (2025). "OpenAI MRCR: Multi-round co-reference resolution." Hugging Face dataset, April 2025. https://huggingface.co/datasets/openai/mrcr ↩
OpenAI. (2025). "Graphwalks." Hugging Face dataset, April 2025. https://huggingface.co/datasets/openai/graphwalks ↩
OpenAI. (2025). "Introducing GPT-4.1 in the API." April 14, 2025. https://openai.com/index/gpt-4-1/ ↩
Bai, Y., et al. (2024). "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks." arXiv:2412.15204. https://arxiv.org/abs/2412.15204 ↩
Meta AI. (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ↩
Saplin, M. (2025). "Llama 4 - 10M Context? Coding? Decent Follow-up?" DEV Community, April 2025. https://dev.to/maximsaplin/llama-4-10m-context-coding-decent-follow-up-426n ↩
Anthropic. (2025). "Claude Sonnet 4 now supports 1M tokens of context." August 12, 2025. https://www.anthropic.com/news/1m-context ↩
Google DeepMind. (2025). "Gemini 3 Pro Model Card." November 2025. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf ↩
llm-stats.com. (2026). "MRCR v2 (8-needle) Leaderboard." Accessed June 2026. https://llm-stats.com/benchmarks/mrcr-v2-(8-needle) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit