Contextual retrieval
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,948 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,948 words
Add missing citations, update stale details, or suggest a clearer explanation.
Contextual retrieval is a document indexing technique for retrieval augmented generation systems, introduced by anthropic on September 19, 2024 in a blog post titled "Introducing Contextual Retrieval".[^1] The method addresses a long-standing weakness in chunk-based retrieval: when a document is split into short passages for embedding, each chunk loses the surrounding context (which document it belongs to, what section it falls under, which entities it refers to) that gave it meaning. Contextual retrieval pre-processes each chunk by passing it to a small language model along with the full source document; the model writes a 50 to 100 token explanation situating the chunk inside the document, and that explanation is prepended to the chunk before embeddings are computed and before the chunk is indexed by BM25.[^1] Anthropic reported that the technique reduces top-20 retrieval failure rates by about 35% when applied to embeddings alone, 49% when combined with bm25 over the contextualized chunks, and 67% when a reranking model is added on top.[^1] The approach depends on prompt caching, which Anthropic had released a month earlier, to make the per-chunk context-generation step affordable at production scale.[^1][^2]
A standard retrieval augmented generation (RAG) pipeline ingests a corpus by splitting each document into fixed or semantic chunks (typically a few hundred tokens each), embedding those chunks with a sentence or document encoder, and storing the resulting vectors in an approximate nearest-neighbor index.[^3] At query time, the user question is embedded into the same space, the closest chunks are retrieved, and they are passed to a generator model as context for answering. Many production deployments additionally run a sparse lexical index such as bm25 in parallel, then fuse the two ranked lists.[^1]
The architecture has a structural blind spot. Because each chunk is encoded in isolation, references that depend on surrounding text become ambiguous after splitting. A chunk that reads, "The company's revenue grew by 3% over the previous quarter," is meaningful only if the reader already knows which company and which quarter the sentence describes; in the index, the chunk loses that anchor, and a query like "What was ACME Corp's revenue growth in Q2 2023?" may not retrieve the chunk because the surface tokens "ACME" and "Q2 2023" never appeared in the embedded text.[^1] The same problem appears with pronouns, section-internal references ("the algorithm above"), and acronyms that were defined elsewhere in the document. Anthropic framed contextual retrieval as a remedy for this loss of document-level context at indexing time.[^1]
The technique sits inside a broader wave of work, published in 2023 and 2024, that tried to bridge query and document representations. Hypothetical document embeddings (hyde), introduced by Gao and colleagues at ACL 2023, generate a fictitious answer for the query and embed that answer instead of the query.[^4] Late chunking, published by Jina AI on arXiv in September 2024, processes the entire long document through a long-context encoder, applies mean pooling only after chunk boundaries are marked, and produces chunk vectors that already contain document-wide context without invoking a separate generator.[^5] Contextual retrieval is the LLM-centric branch of this work: instead of changing the encoder or the query, it rewrites the chunk text using a language model before it ever reaches the embedder.[^1]
Anthropic credited the practicality of the approach to its own prompt caching feature, which had launched on August 14, 2024 for Claude 3.5 Sonnet and Claude 3 Haiku on the anthropic api.[^2] Prompt caching lets a developer mark a large input prefix (in this case, the full document) as cacheable; subsequent calls that reuse the prefix pay roughly 10% of the normal input-token price for the cached portion, with a 25% surcharge on the initial cache write.[^2] Because contextual retrieval generates one context blurb per chunk while keeping the parent document constant across all chunks of that document, caching converts the cost from quadratic in document size to roughly linear in chunk count, and Anthropic put the one-time cost at about $1.02 per million document tokens when generating contexts with Claude 3 Haiku.[^1]
The method operates entirely at indexing time. It takes a chunked corpus and produces a contextualized chunked corpus, which is then fed unchanged into the rest of an existing RAG pipeline. There are no model retraining steps and no changes to the retriever architecture.
For each chunk, an instruction-tuned language model is given the full parent document followed by the specific chunk, and is asked to produce a short paragraph that explains where the chunk sits inside the document and what context a reader would need to interpret it. Anthropic's published prompt template is:[^1][^6]
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within
the overall document for the purposes of improving search retrieval
of the chunk. Answer only with the succinct context and nothing else.
The generator is asked for only the situating context, not a rewrite of the chunk. The result is typically a sentence or two of 50 to 100 tokens.[^1] Anthropic's running example transforms an isolated sentence about quarterly revenue growth into the contextualized form, "This chunk is from an SEC filing on ACME Corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."[^1] The contextualized text is what gets embedded and indexed; the original chunk text is preserved as the payload returned at retrieval time so that the generator at answer time sees the natural language users wrote, not the LLM-generated preamble.
Anthropic recommended Claude 3 Haiku, the smallest model in the Claude 3 family at the time, for the contextualization step.[^1][^7] Haiku was released on March 13, 2024 and priced at $0.25 per million input tokens and $1.25 per million output tokens, the cheapest option in the Claude 3 line.[^7] The blog post argued that the contextualization task does not require frontier reasoning; what matters is that the model can read a long document and write a short, factual summary that names the relevant entities and section. Haiku's 200,000 token context window covered most single documents in Anthropic's test corpora.[^1][^7]
Third-party reimplementations have substituted other models. The Together AI reference implementation uses an open-weights model in the Qwen family for context generation, and an asynchronous Python reimplementation by Jason Liu, author of the Instructor library, was published on September 26, 2024 using Haiku under the Instructor structured-output framework.[^8][^9] The DataCamp tutorial, written by NLP researcher Ryan Ong, swaps in GPT-4o through LangChain to demonstrate that the approach is not Anthropic-specific.[^10] What ties these reimplementations together is the indexing-time structure (full document plus chunk in, short context out), not any particular generator.
Without caching, generating a context for every chunk would require sending the full document on every call, and the cost would scale as the product of document length and chunk count. Caching the document collapses that to a roughly fixed per-document write cost plus a cheap per-chunk read. Anthropic published a worked example: for 800-token chunks, with the document loaded into the cache once, the one-time indexing cost works out to $1.02 per million document tokens.[^1] The cookbook reports that on its 9-codebase, 737-chunk demonstration corpus, 61.83% of all input tokens were served from cache, reducing the contextualization bill from about $9.20 to about $2.85.[^6]
A practical wrinkle is that Anthropic's prompt cache has a five-minute lifetime by default.[^2] Several commenters on Hacker News pointed out that an indexing job needs to process all chunks for a given document within that window to capture the cache benefit, which steers implementations toward processing each document's chunks contiguously rather than interleaving documents.[^11] The cookbook achieves this by iterating documents in an outer loop and chunks in an inner loop with a ThreadPoolExecutor, so all chunks of a document hit the same cache prefix.[^6]
The full pipeline Anthropic recommends has four stages.[^1]
First, every chunk is contextualized as above, and the contextualized text becomes the input to both the embedding model and the BM25 indexer. Indexing the contextualized text rather than the raw chunk means that the lexical index also benefits: a query containing entity names that appeared only in the document title now matches contextualized chunks that mention those entities, even when the raw chunk did not.[^1]
Second, at query time the system runs the user query against both indices in parallel: a dense vector search returns the top 150 chunks by cosine similarity, and a BM25 search returns the top 150 by lexical score.[^6]
Third, the two ranked lists are merged using reciprocal rank fusion (RRF), with a weighting of roughly 0.8 toward the semantic list and 0.2 toward the lexical list in the cookbook's default configuration.[^6] RRF assigns each chunk a score of 1 divided by its rank in each list, multiplied by the list weight, and the merged list is re-sorted by the summed score.[^6]
Fourth, an optional reranking model takes the top 150 fused candidates and re-scores them by passing the query and each candidate chunk through a cross-encoder. The top 20 reranked chunks become the final retrieved set passed to the generator. Anthropic's blog described its primary tests with the Cohere reranker; Voyage AI's reranker was noted but not benchmarked.[^1]
The evaluation metric in the Anthropic post is "1 minus recall at 20", that is, the fraction of evaluation queries for which the gold chunk is not present in the top 20 retrieved chunks. The headline numbers, 35%, 49%, and 67%, are relative reductions in this failure rate, computed from absolute failure rates of 5.7% (baseline RAG with no contextualization), 3.7% (contextual embeddings only), 2.9% (contextual embeddings plus contextual BM25), and 1.9% (all three components plus reranking).[^1][^12]
Anthropic ran the method on five evaluation corpora: a code retrieval set, a fiction set, a set of ArXiv papers, a set of science papers, and a Wikipedia subset.[^1] The blog post does not publish exact per-corpus dataset sizes, but the open-source cookbook publishes the code-retrieval subset as data/codebase_chunks.json (737 chunks across 9 codebases) with a 248-query evaluation set in data/evaluation_set.jsonl.[^6]
Aggregated across the five corpora and across the embedding models tested (Voyage AI's voyage-2, Google's text-embedding-004, OpenAI's text-embedding-3-large, and Cohere's embed-english-v3.0), the relative improvements were:[^1]
| Configuration | Top-20 failure rate | Relative reduction vs baseline |
|---|---|---|
| Embeddings only (baseline RAG) | 5.7% | 0 |
| Embeddings + BM25 (hybrid baseline) | 4.8% | 16% |
| Contextual embeddings only | 3.7% | 35% |
| Contextual embeddings + contextual BM25 | 2.9% | 49% |
| Contextual embeddings + contextual BM25 + reranking | 1.9% | 67% |
Anthropic reported that Gemini text-embedding-004 and Voyage AI's models produced the best absolute scores, but contextualization improved every embedding model tested.[^1] The codebase-only results from the cookbook are reported as pass-at-k rather than failure rate, and show baseline RAG at 87.15% pass-at-10, contextual embeddings at 92.34%, contextual embeddings plus BM25 at 92.31%, and the full pipeline with Cohere reranking at 95.26%.[^6]
The blog also reports several ablations relevant to deployment.[^1]
The number of retrieved chunks matters. Performance plateaus near k = 20; the post argues that retrieving more chunks generally helps until the generator's context window begins to dilute relevance.
The size of the LLM-generated context matters but with diminishing returns. Anthropic used the 50 to 100 token range and reported that longer contexts did not yield additional gains worth the cost.[^1]
Contextualization helps every embedding model. The post notes that even closed-source frontier embedders such as Voyage and Gemini, which already encode some discourse structure, gained measurable accuracy from contextualization, suggesting the gain comes from new lexical signal rather than from rescuing a weak encoder.[^1]
hyde is the closest conceptual neighbor on the query side. Introduced by Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan at ACL 2023, HyDE expands a short query by asking an instruction-tuned LLM to write a hypothetical passage that would answer the query, embedding that synthetic passage, and using the synthetic embedding as the search vector against the document index.[^4] The technique improves zero-shot dense retrieval by aligning the dimensionality and style of the query embedding with that of the document embeddings.[^4]
Contextual retrieval is the symmetric move. Where HyDE expands the query at search time using an LLM, contextual retrieval expands the document chunks at indexing time using an LLM. The two are not mutually exclusive (a system can do both), and they have different cost profiles: HyDE pays its LLM cost once per query and so is sensitive to query throughput, while contextual retrieval pays its LLM cost once per chunk at indexing and pays nothing extra at query time. A Hacker News commenter on the launch thread reported that in their internal A/B test with RAGAS, HyDE decreased both answer quality and retrieval quality, while hybrid retrieval was a wash; the commenter contrasted this with contextual retrieval, which they expected to behave differently because it operates on documents rather than queries.[^11]
Late chunking, published by Michael Günther and colleagues from Jina AI and Weaviate on September 4, 2024 (arXiv 2409.04701), targets the same underlying problem (chunks losing document context) but with a very different mechanism.[^5] Instead of asking an LLM to write a contextual summary, late chunking runs the entire document through a long-context embedding model in a single forward pass; per-token hidden states are produced for the whole document, and chunk vectors are built by mean-pooling only the tokens that belong to each chunk. Because each token's representation already attends to the full document, the resulting chunk vector encodes document-level context implicitly, without any separate generation step.[^5]
The Jina authors compared their approach to Anthropic's and noted that contextual retrieval and late chunking achieve qualitatively similar similarity scores on their test sets, with late chunking offering the practical advantages of not requiring an LLM call per chunk and not requiring a chunk-rewrite step at all.[^13] On the BeIR benchmark with three Jina encoders, late chunking yielded relative improvements of 3.63% with sentence-boundary chunking, 3.46% with fixed-size chunking, and 2.70% with semantic chunking.[^5] The trade-off is that late chunking requires a long-context embedder that can process the whole document in one pass, which limits the choice of encoder; contextual retrieval imposes no such constraint and works with any embedder, including short-context ones.[^13]
In July 2025, Voyage AI released voyage-context-3, a commercial embedding model that incorporates document-level context directly into chunk embeddings during inference; the release framed itself as a managed alternative to both contextual retrieval and late chunking, removing the need for a separate LLM contextualization step.[^14]
Query rewriting techniques, such as those used in Microsoft's GraphRAG pipeline (graphrag) or in agentic retrieval loops, modify the user's query before it reaches the retriever, typically by adding synonyms, expanding acronyms, or decomposing multi-hop questions into sub-queries.[^15] Like HyDE, these techniques operate on the query side and pay LLM cost per query. They are complementary to contextual retrieval rather than substitutive: a system can rewrite queries at search time and index contextualized chunks at indexing time.
A simpler approach discussed on the Hacker News launch thread is to format chunks with explicit section headers from the source document rather than generating new context with an LLM. A commenter described prepending the markdown heading path (for example, "# Fever ## Treatment ---") to each chunk before embedding, capturing some of the context-loss problem without any LLM cost.[^11] This works well when documents already have rich hierarchical structure (medical references, technical documentation) but fails on flat documents such as news articles, code without comments, or transcript-like text, which is precisely the territory where Anthropic argued contextualization helps most.[^1]
Anthropic published the reference implementation as a Jupyter notebook in its cookbook repository on GitHub, originally at anthropics/anthropic-cookbook and later migrated to anthropics/claude-cookbooks, under skills/contextual-embeddings.[^6] The notebook defines a ContextualVectorDB class that wraps the Voyage AI client (using the voyage-2 embedding model) and the Anthropic client, applies the prompt template above with cache_control: ephemeral on the document text, and stores the contextualized text along with metadata in a pickle file. It also includes an ElasticsearchBM25 class wrapping a local Elasticsearch instance for the lexical index, and a retrieve_advanced function that performs the RRF fusion described above.[^6]
The cookbook reports the token accounting for its 737-chunk demonstration: 1,223,730 input tokens billed at full price, 2,267,069 input tokens served from cache at the 10% rate, and a 61.83% cache-read share of total input tokens.[^6]
Beyond the official notebook, the technique was rapidly absorbed into ecosystem tools.
On Amazon Web Services, the AWS Machine Learning Blog published a Bedrock Knowledge Bases implementation on June 5, 2025 that performs the contextualization step in a Lambda function during knowledge-base ingestion, using Claude 3 Haiku for context generation, fixed 300-token chunks with 20% overlap, and Amazon Titan for embedding; the team reported improvements over Bedrock's default chunking on context recall, context precision, and answer correctness using the RAGAS framework.[^16]
Together AI shipped an open-source variant in its documentation that swaps Anthropic-specific components for open-weights alternatives: Qwen 3.5-9B for context generation, multilingual-e5-large-instruct for embeddings, and Mxbai-Rerank-Large-V2 for reranking, with the same RRF fusion structure.[^8]
LangChain, LlamaIndex, and the Instructor framework all published tutorials within weeks of the launch. The Instructor implementation by Jason Liu, dated September 26, 2024, paralleled the Anthropic cookbook but used Python's asyncio to run the per-chunk Haiku calls concurrently, treating the contextualization output as a Pydantic SituatedContext object for type safety.[^9] Milvus, the open-source vector database, published a separate quickstart that wires contextual retrieval into a Milvus collection.[^17]
The Hacker News submission of the Anthropic post on September 20, 2024 (item 41598119, submitted by user loganfrederick) accumulated 309 points and 72 comments.[^11] Praise focused on the simplicity of the method (a prompt template plus an embedding step that any RAG developer can drop into an existing pipeline) and on the leverage that prompt caching provides for indexing-time use cases.[^11]
Critical commentary on the thread and in subsequent blog posts hit several recurring points.[^11][^18]
The five-minute cache lifetime constrains how indexing jobs must be ordered. Implementations that interleave documents during indexing or that span days lose the cache benefit and revert to the uncached cost. The cookbook's recommendation of a thread pool over a single document at a time addresses this but increases implementation complexity.[^6]
The method requires that the parent document fit in the context window of the contextualization model. For documents above 200,000 tokens (Haiku's window in 2024), the document must itself be split before contextualization, which reintroduces a version of the original problem.[^11]
Vendor lock-in concerns. Several commenters noted that, while the technique itself is model-agnostic, the prompt-caching cost advantage was Anthropic-specific at the time of the launch; OpenAI, Google, and AWS later shipped comparable caching features for their own APIs.[^11][^18]
The technique adds a one-time indexing cost. For corpora that change frequently, the cost of reindexing must be weighed against the retrieval-quality gain. Anthropic's $1.02 per million tokens figure is small in absolute terms but non-trivial for, say, a 10-billion-token corpus that is reindexed weekly.[^1][^11]
The 35% to 67% headline numbers are reductions in retrieval failure, not improvements in end-to-end answer quality. Several follow-up writeups noted that downstream answer accuracy gains are smaller than the retrieval-failure reductions, because the generator can sometimes recover from missing context if other retrieved chunks compensate.[^18]
Despite the criticisms, the launch produced a measurable shift in RAG defaults. Within months of the post, AWS, Together AI, Milvus, LlamaIndex, LangChain, and Pinecone had each published contextual-retrieval tutorials or integrations, and the technique entered the standard RAG stack alongside hybrid search and reranking.[^16][^8][^17] By mid-2025, commercial embedding providers including Voyage AI began shipping models that performed contextualization in-encoder, positioning their products as a managed substitute for the Anthropic recipe.[^14]
The blog post and its third-party analyses identified several scope limits for the technique.[^1][^11][^18]
Contextual retrieval helps most when chunks contain ambiguous references that the parent document resolves. Documents that are already self-contained at the paragraph level (well-edited reference articles with explicit topic sentences, code with inline docstrings, structured database records) see smaller gains. The post is explicit that for knowledge bases under approximately 200,000 tokens (about 500 pages), users should consider loading the entire knowledge base into the generator's context window rather than retrieving from it at all, since prompt caching makes that strategy cheap enough to be the default for small corpora.[^1]
The method does not address the failure modes that come from poor chunking itself. If a chunk boundary cuts a sentence in half, or if a chunk groups unrelated content, no amount of contextual summary will repair the underlying fragmentation. Several commentators paired contextual retrieval with semantic or recursive chunking strategies for this reason.[^11][^18]
The technique inherits the factuality limits of the contextualization model. If Claude 3 Haiku writes a context summary that misattributes a chunk, the embedding picks up the wrong signal, and the chunk becomes harder to find rather than easier. Anthropic's prompt is deliberately constrained ("Answer only with the succinct context and nothing else") to keep summaries grounded, but the failure mode exists.[^1][^6]
The retrieval gains do not necessarily translate one-to-one into answer-quality gains, particularly when the generator is large enough to compensate for partially missing context by reasoning over what was retrieved. Some downstream pipelines may see most of the improvement in the worst 5% to 10% of queries (the ones where the gold chunk previously failed to be retrieved at all) rather than across the distribution.[^18]