Chunking (information retrieval)
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,087 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,087 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chunking is the preprocessing step in document indexing and retrieval-augmented generation (RAG) systems in which a long document or corpus is split into smaller passages, called chunks, that are then embedded and stored in a vector database for retrieval. Chunking exists because embedding models and large language models impose finite input lengths: most Sentence-BERT variants accept up to 512 subword tokens, the default all-MiniLM-L6-v2 model truncates to 256 tokens, and even long-context encoders such as OpenAI's text-embedding-3-large cap input at 8,192 tokens.[^1][^2][^3] Chunking decisions, namely chunk size, overlap, and boundary policy, materially affect retrieval recall and downstream answer faithfulness; small chunks concentrate the relevant span but lose surrounding context, whereas large chunks preserve context but dilute the embedding signal across irrelevant material.[^4][^5] Common strategies range from fixed-size character or token splits, through recursive splitters that try paragraph, sentence, and word boundaries in turn, to embedding-driven semantic chunking and the more recent contextual and late chunking methods introduced by Anthropic, Jina AI, and Voyage AI in 2024 and 2025.[^6][^7][^8]
Dense retrieval pipelines map text into fixed-dimensional vectors so that semantic similarity can be approximated by cosine similarity or dot product between query and document vectors.[^4] Two engineering constraints make chunking unavoidable. First, the transformer encoders behind dense retrievers have a maximum sequence length, beyond which tokens are silently truncated. The all-MiniLM-L6-v2 model, one of the most widely used Sentence-BERT checkpoints, truncates inputs longer than 256 subword tokens by default and was trained on sequences capped at 128 tokens, so it does not generalise well to longer passages even if the underlying BERT configuration allows 512 positions.[^2] OpenAI's third-generation embedding models, text-embedding-3-small and text-embedding-3-large, both accept up to 8,192 tokens and produce up to 1,536 and 3,072 dimensional vectors respectively, but inputs longer than that must be either truncated or split.[^3][^9] Second, even when an embedding model has a long context window, the LLM that consumes the retrieved chunks at generation time has its own context window limit, and packing too few, overly long chunks into the prompt wastes the reranker and generator on irrelevant text.[^4]
Beyond hard limits, chunking serves a quality function: a single embedding compresses an entire passage into one vector, so the longer the passage, the more semantic detail collapses. When a 1,000-token chunk discusses three different sub-topics, the resulting vector points toward the centroid of those topics and is unlikely to closely match any query about a single one of them. Smaller chunks therefore raise retrieval precision because the relevant span dominates the chunk's vector, but they also raise the risk that the answer-bearing information was placed in a neighbouring chunk and missed.[^5][^10] Pinecone's chunking guide frames the tradeoff explicitly: smaller chunks capture granular semantics, larger chunks retain broader context, and the right size depends on the document type, the embedding model's training regime, and the expected query complexity.[^4]
Fixed-size chunking, the simplest strategy, splits text every N characters or tokens, optionally with overlap between adjacent chunks. Greg Kamradt's "5 Levels of Text Splitting" tutorial, an influential reference among RAG practitioners, calls fixed-size character splitting "Level 1" and describes it as a useful starting point for understanding the chunking problem but not a method he recommends for production systems.[^7][^11] Amazon Bedrock Knowledge Bases exposes fixed-size chunking as a first-class option, allowing users to configure the maximum number of tokens per chunk and a percentage overlap between consecutive chunks.[^12] Bedrock's "default chunking" mode, by contrast, targets approximately 300 tokens per chunk while honoring sentence boundaries, so the resulting chunks are not strictly equal in length but never cut a sentence in half.[^12]
Fixed-size approaches have two well-known failure modes. First, they cut across sentences, tables, and code structures: a 512-token window starting mid-paragraph may begin in the middle of one sentence and end in the middle of another, leaving the embedding model to encode a fragment that lacks subject or predicate. Second, the size parameter is hard to choose blind. Pinecone's guide and many community guides converge on a default range of 200 to 512 tokens with 10 to 20 percent overlap, but NVIDIA's evaluation on FinanceBench with 1,024-token chunks found that 15 percent overlap outperformed 10 and 20 percent, and Chroma's evaluation framework shows that the OpenAI Assistants default of 800 tokens with 400-token overlap had below-average recall on a multi-corpus benchmark.[^4][^13][^14]
Chunk overlap is the practice of repeating a tail of one chunk at the head of the next, so that information located near a chunk boundary appears in two chunks rather than being split. The standard heuristic in the industry is 10 to 20 percent of chunk size; for a 1,000-character chunk, this means 100 to 200 characters of overlap.[^13] The purpose is to mitigate the worst-case scenario in which the answer to a query straddles a boundary and would otherwise appear at the end of one chunk and the start of the next, hurting the embedding similarity of both. Overlap is not a free lunch: it inflates the number of stored vectors and the cost of retrieval, and beyond roughly 30 percent the marginal gain in recall is small relative to the storage and latency penalty.[^13]
The most widely adopted general-purpose chunker is LangChain's RecursiveCharacterTextSplitter, which Greg Kamradt has called "the swiss army knife of splitters" and his usual first choice when prototyping.[^7] The recursive splitter accepts a target chunk_size and an ordered list of separators, by default ["\n\n", "\n", " ", ""], representing paragraph breaks, line breaks, spaces, and empty strings.[^15][^6] It first tries to split the input on the strongest separator (paragraph break); if the resulting segments still exceed chunk_size, it recurses on the next separator within each oversized segment, and so on down to the character level. The intent is to keep paragraphs together where possible and only break into smaller units when the paragraph itself exceeds the target size.[^15]
LangChain's API exposes a chunk_overlap parameter that adds overlap between adjacent output chunks and a length_function that defaults to character count but can be swapped for a token counter to match the embedding model's tokenizer.[^6][^15] For languages such as Chinese, Japanese, and Thai, which lack whitespace word boundaries, the documentation recommends supplying custom separators including punctuation and zero-width spaces so that the splitter does not break mid-word.[^15] Internal tests at Chroma showed that with default parameters, RecursiveCharacterTextSplitter performed "relatively poorly" on synthetic query evaluation, which the authors attribute to the gap between default character-based size and tokenizer-aware sizing rather than to a flaw in the recursive idea itself.[^14]
LlamaIndex provides an analogous component called SentenceSplitter, with default chunk_size=1024 tokens and chunk_overlap=200 tokens, that tries to keep sentences and paragraphs together so that fewer hanging sentence fragments occur at chunk boundaries than with the older TokenTextSplitter.[^16]
A second family of strategies splits text on linguistic or structural boundaries rather than on raw character counts.
Sentence-based splitters use NLTK's sent_tokenize or spaCy's statistical sentence segmenter to first break a document into sentences, then group adjacent sentences until a target token budget is reached.[^17] NLTK uses a deterministic, rule-based tokenizer based on punctuation patterns; spaCy uses a trained statistical model that handles ambiguous cases such as abbreviations more robustly.[^17] LangChain wraps both in NLTKTextSplitter and SpacyTextSplitter adapters. Sentence chunking avoids mid-sentence splits, which is the main visual artefact of fixed-size methods, but it does not guarantee that the resulting chunks are semantically coherent: a single boundary placed between two related sentences can still split a coherent idea across chunks.
When documents have explicit hierarchical structure, the splitter can use that structure as a cue. LangChain's MarkdownHeaderTextSplitter accepts a configurable list of header levels, by default ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), and produces one chunk per section, attaching the chain of enclosing headers to each chunk as metadata so that downstream retrieval can filter or rerank by section.[^18][^19] The splitter strips the header line from the chunk body by default, controlled by the strip_headers parameter, and overlap does not cross section boundaries.[^18] A common pipeline chains MarkdownHeaderTextSplitter with RecursiveCharacterTextSplitter so that very long sections are further subdivided while shorter sections remain whole.[^19] Analogous splitters exist for HTML, LaTeX, and PDF documents that exploit the corresponding structural markers.[^4]
Source code is poorly served by character-based chunking because functions, classes, and control structures rarely align with fixed token windows. The cAST method, published in June 2025 by researchers from Carnegie Mellon University and Augment Code, parses source files into an Abstract Syntax Tree (AST) with tree-sitter and then applies a recursive split-then-merge algorithm that greedily merges adjacent AST nodes while respecting a configured chunk size, recursively subdivides oversized nodes, and measures size in non-whitespace characters.[^20] cAST is designed for four properties: syntactic integrity (chunks correspond to whole syntactic units), high information density, language invariance across more than a hundred grammars supported by tree-sitter, and plug-and-play behaviour (concatenating chunks recovers the original file).[^20] In experiments across RepoEval, CrossCodeEval, and SWE-bench, cAST improved RepoEval Recall@5 by 4.3 points, SWE-bench Pass@1 by 2.67 points, and CrossCodeEval exact match by 4.3 points compared with line-based baselines, with the largest single gain (5.5 points) coming from StarCoder2 on RepoEval.[^20]
Semantic chunking, introduced as Level 4 of Kamradt's framework, places chunk boundaries where the topic shifts rather than at a fixed token count.[^7] The basic algorithm first splits the document into sentences, embeds each sentence, then walks adjacent sentences computing the cosine similarity between their embeddings; when the similarity between two adjacent sentence groups drops below a configurable threshold, the algorithm declares a chunk boundary.[^21] LlamaIndex implements this idea as SemanticSplitterNodeParser with three primary hyperparameters: buffer_size (the number of surrounding sentences combined before embedding, default 1), breakpoint_percentile_threshold (the percentile of inter-sentence distances above which a boundary is placed, typically 80 to 95), and an embed_model pointer to the embedding provider used for the similarity computation.[^22] AWS Bedrock Knowledge Bases exposes a similar set of parameters: maximum tokens per chunk, buffer size of surrounding sentences, and a breakpoint percentile threshold, with higher thresholds yielding fewer and larger chunks.[^12]
Semantic chunking trades simplicity for adaptability. A document that switches frequently between topics will produce many chunks, while a long discussion of a single topic will produce a few large ones. The cost is that an embedding model must be invoked once per sentence (or per small sentence group) during ingestion, adding latency and dollars per document. Bedrock's documentation notes explicitly that semantic chunking incurs additional foundation-model cost during ingestion compared with fixed or hierarchical chunking.[^12]
Kamradt also describes a Level 5 "agentic" chunker that prompts an LLM to identify split points, treating chunking as a reasoning task rather than a similarity computation, and an "experimental" Level intended for the regime where token costs approach zero.[^7] Chroma's open evaluation framework instantiates a similar idea as LLMChunker, which reached the highest recall (91.9 percent) among all chunkers tested with OpenAI's text-embedding-3-large at five retrieved chunks, although its precision (3.9) was lower than the ClusterSemanticChunker baseline (8.0 precision, 91.3 percent recall) at the same 200-token target size.[^14]
In September 2024 Anthropic published "Introducing Contextual Retrieval," describing a method that prepends a short, document-aware context string to each chunk before embedding and BM25 indexing.[^23] The motivation is the observation that a chunk such as "The company's revenue grew by 3% over the previous quarter" loses the antecedents required to retrieve it: a query for revenue growth at a specific company cannot match this chunk on its own because the chunk does not name the company or the time period.[^23] The Anthropic method uses Claude (in the published cookbook, Claude 3 Haiku) to generate a one- or two-sentence context for each chunk given the full source document, then concatenates the context with the chunk before embedding and before constructing the BM25 inverted index.[^23]
Anthropic reported retrieval-failure reductions on a benchmark of codebases, fiction, ArXiv papers, and other science papers: contextual embeddings alone cut failure rates from 5.7 percent to 3.7 percent (a 35 percent reduction), contextual embeddings combined with contextual BM25 cut them to 2.9 percent (a 49 percent reduction), and adding a reranker brought them to 1.9 percent (a 67 percent reduction). Their cost model assumes 800-token chunks within 8,000-token documents, a 50-token context-generation instruction, and roughly 100 tokens of generated context per chunk, yielding a one-time ingestion cost of $1.02 per million document tokens when prompt caching is used to amortize the document content across all chunks of the same document.[^23] AWS published an Amazon Bedrock Knowledge Bases integration that implements the technique, and Anthropic's public cookbook includes a reference implementation.[^24]
A parallel September 2024 paper from Jina AI, "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models" by Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao, proposes a different solution to the same context-loss problem.[^25] Rather than rewriting chunks before embedding, late chunking changes the order of operations: it first runs the entire long document through a long-context embedding model in a single pass, producing one token-level vector per token in the document, and only then segments the resulting vector sequence into chunks and applies mean pooling within each chunk. Because every token's contextual representation already attended to every other token during the forward pass, each resulting chunk vector encodes information from the surrounding document, not only from the chunk itself.[^7][^25]
Jina's published evaluation on BeIR datasets shows consistent gains over naive chunking on the same encoder: SciFact nDCG@10 rose from 64.20 to 66.10, TRECCOVID from 63.36 to 64.70, and NFCorpus from 23.46 to 29.98, with the authors observing that "the longer the document, the more effective the late chunking strategy becomes."[^26] The technique requires no extra training and works with any long-context embedding model that uses mean pooling, with Jina's own Jina Embeddings v3 and its predecessor v2-base-en (both with 8,192-token windows) as the reference targets.[^25][^26] Subsequent integration posts from Milvus, Elasticsearch, and other vector-search vendors describe how to wire late chunking into production retrieval stacks.
In July 2025 Voyage AI (by then a subsidiary of MongoDB) released voyage-context-3, an embedding model that internalises the contextualisation step rather than performing it as a preprocessing pass.[^8][^27] The model accepts a list of chunks from the same source document, runs them through a single forward pass with cross-chunk attention, and emits one embedding per chunk that has been conditioned on the rest of the document. Voyage describe this as "contextualised chunk embeddings" and present them as a drop-in replacement for standard embedding APIs that requires no metadata augmentation by the user.[^27]
On Voyage's evaluation of 93 retrieval datasets spanning nine domains (web reviews, law, medicine, long documents, technical documentation, code, finance, conversations, and multilingual text), voyage-context-3 at NDCG@10 reportedly outperforms text-embedding-3-large by 14.24 percent at chunk-level retrieval and 12.56 percent at document-level retrieval, Cohere embed v4 by 7.89 percent and 5.64 percent, Jina v3 late chunking by 23.66 percent and 6.76 percent, and Anthropic's contextual retrieval by 20.54 percent and 2.40 percent.[^8] The model supports 2048, 1024, 512, and 256 dimensional outputs and 32-bit float, signed and unsigned 8-bit integer, and binary quantisations; Voyage reports that the binary 512-dimensional variant matches text-embedding-3-large in retrieval quality while using roughly 0.5 percent of the storage.[^8]
The chunk-size question has been studied empirically and the answers are corpus-specific, but a few generalisations recur across guides from Pinecone, NVIDIA, Chroma, and the LangChain community:
| Embedding model | Max input | Practical chunk size |
|---|---|---|
all-MiniLM-L6-v2 (Sentence-BERT) | 256 tokens (default) | 128 to 256 tokens[^2] |
| BERT-base / SBERT 512-token variants | 512 tokens | 256 to 512 tokens[^1] |
OpenAI text-embedding-3-large | 8,192 tokens | 256 to 1,024 tokens (typical), up to 8,192 for full-doc[^3][^9] |
| Jina embeddings v3 | 8,192 tokens | 512 to 2,048 tokens (or full-doc with late chunking)[^26] |
Voyage voyage-context-3 | full-document | chunk size mostly irrelevant; sensitivity 2.06 percent across configurations[^8] |
For chunk overlap, 10 to 20 percent of chunk size is the broadly cited default; NVIDIA's FinanceBench experiment found 15 percent optimal at 1,024-token chunks, intermediate between the 10 and 20 percent endpoints.[^13] Above 30 percent overlap, returns diminish quickly while index size grows linearly.[^13]
Because chunking sits upstream of every other RAG component, evaluating chunking quality requires a measurable retrieval task, not just qualitative inspection. The standard metric is Recall@K, the fraction of test queries for which at least one relevant chunk appears in the top-K results returned by the retriever.[^28] BEIR, a widely used benchmark for dense retrievers, packages 18 datasets across nine retrieval tasks and reports nDCG@10, MAP, Precision@K, and Recall@K side by side; researchers comparing chunking strategies often report results on a subset of BEIR plus their own domain corpus.[^28] MTEB (the Massive Text Embedding Benchmark) plays a complementary role for embedding-model comparison.
Chroma's chunking evaluation, published in 2024, generates synthetic queries with GPT-4 Turbo against five diverse corpora totalling 328,208 tokens, filters them by embedding similarity, and computes token-level precision, recall, and Intersection over Union (IoU) to capture both whether the relevant span was retrieved and how much extraneous content the retrieved chunks contain.[^14] The framework allows direct comparison of chunkers (recursive, semantic, LLM-driven) at controlled chunk sizes with the same embedding model. Chroma's headline finding is that default parameter choices in popular libraries are often far from optimal: the OpenAI Assistants default of 800 tokens with 400-token overlap underperformed both smaller and better-tuned alternatives.[^14] Pinecone's practitioner guidance is to run a small representative test set at chunk sizes of 200, 400, and 600 tokens, measure Recall@5, and pick the configuration that maximises it for the specific corpus.[^4]
Hierarchical chunking, also called parent-child chunking, indexes two granularities at once: small "child" chunks that are precise enough for embedding similarity to discriminate, and larger "parent" chunks that contain the surrounding context.[^12] At retrieval time, the system embeds the query, finds the top-K child chunks by vector similarity, and then returns each child's parent as the actual context given to the LLM. The intent is to combine the precision of small chunks with the contextual completeness of larger ones. Amazon Bedrock Knowledge Bases supports hierarchical chunking with two levels, allowing the user to configure parent and child token sizes and an overlap, and warns that the returned number of results may be lower than requested because multiple matched children may map to the same parent.[^12]
LlamaIndex implements a similar idea as the "auto-merging retriever," which retrieves leaf chunks and merges them into their parent when enough siblings are returned, and as the "sentence-window retriever," which embeds individual sentences but returns each retrieved sentence together with a fixed window of surrounding sentences to the generator.[^16]
LangChain groups its text splitters in the langchain_text_splitters package. The defaults include CharacterTextSplitter (Level 1), RecursiveCharacterTextSplitter (Level 2 and the recommended default for generic text), TokenTextSplitter and per-tokenizer variants for OpenAI and HuggingFace models, MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter (Level 3 for structured documents), and integrations with NLTK and spaCy for sentence-level splitting.[^15][^18]
LlamaIndex's analogous abstractions are "node parsers," which produce Node objects (the LlamaIndex equivalent of a chunk). The library ships SentenceSplitter (default chunk_size=1024, chunk_overlap=200), TokenTextSplitter, SentenceWindowNodeParser, MarkdownNodeParser, and SemanticSplitterNodeParser (the embedding-walk implementation).[^16][^22]
Vector database vendors have largely converged on offering chunking inside their managed ingestion pipelines. Pinecone's "Chunking Strategies" guide recommends fixed-size chunking as the starting point for most applications and provides examples spanning content-aware, structure-aware, semantic, and contextual chunking with LLMs.[^4] Weaviate and Qdrant both publish chunking guides and integrations with LangChain and LlamaIndex. Anthropic's Claude Cookbook hosts a reference implementation of contextual retrieval that practitioners can adapt to their own corpora.[^24]
Despite a decade of dense-retrieval research, chunking remains a poorly automated step in most RAG pipelines. Three limitations recur across surveys and evaluations:
First, chunk size is corpus- and query-dependent, and there is no universal best size. Pinecone, NVIDIA, and Chroma all recommend empirical tuning rather than defaults, but doing so requires a labelled or synthetic evaluation set that many production teams do not have.[^4][^13][^14] AI21's 2024 work on multi-scale retrieval argues that the optimal chunk size depends on the query type and proposes indexing the same corpus at multiple granularities to dispatch each query to the appropriate scale.[^29]
Second, chunk boundaries lose context, which is the central problem that contextual retrieval, late chunking, and contextualised chunk embeddings all attempt to solve. Each solution carries its own cost: contextual retrieval requires an LLM pass per chunk at ingestion time (cheap with prompt caching but still nonzero); late chunking depends on long-context encoders that may not exist for every language or domain; contextualised chunk embeddings tie the user to a specific embedding provider.[^23][^25][^8]
Third, evaluation infrastructure is immature. Token-level metrics such as Chroma's IoU give a more faithful picture than document-level Recall@K, but they require synthetic-query generation pipelines that are themselves sensitive to the LLM prompt and the corpus. The same chunking method can win or lose by several percentage points depending on the evaluation dataset, the embedding model, and the chunk size considered, so single-number comparisons across providers should be treated with caution.[^14]