Semantic chunking
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,872 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,872 words
Add missing citations, update stale details, or suggest a clearer explanation.
Semantic chunking is a family of document-segmentation strategies for retrieval augmented generation pipelines that places chunk boundaries based on similarity between sentence embeddings rather than fixed character or token counts. Each sentence (or short sentence group) is embedded with a model such as OpenAI's text-embedding-3 or Sentence-BERT, and the cosine distance between consecutive sentence embeddings is used to detect points where the topic of the document shifts; chunk breaks are inserted at those points so that the resulting chunks group sentences about a single subtopic.[^1][^2] The term entered mainstream RAG vocabulary through Greg Kamradt's "5 Levels of Text Splitting" tutorial published in late 2023 and early 2024 on his FullStackRetrieval site, which referred to embedding-based segmentation as Level 4 and provided a reference implementation later ported into langchain's SemanticChunker and llamaindex's SemanticSplitterNodeParser.[^3][^4][^5] The conceptual lineage runs back to Marti Hearst's TextTiling algorithm (1997) and Freddy Choi's C99 algorithm (2000), both of which detected subtopic boundaries from lexical-cohesion signals using bag-of-words vectors instead of dense embeddings.[^6][^7] Independent benchmarks published in 2024 by Chroma and by researchers at Vectara reported that semantic chunking did not consistently outperform recursive character splitting on retrieval metrics and incurred roughly an order-of-magnitude higher ingestion cost, leading to ongoing debate about when the method is worth the additional embedding compute.[^8][^9]
Semantic chunking sits in a long line of work on linear text segmentation, the task of dividing a continuous document into contiguous subtopic units. The most cited early example is TextTiling, introduced by Marti Hearst of UC Berkeley in a paper titled "TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages" published in Computational Linguistics volume 23, issue 1, pages 33 to 64 in March 1997.[^6] Hearst observed that "a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well," and built a three-stage algorithm: tokenize the document into terms and sentence-sized pseudo-sentences, compute a cohesion score for each pseudo-sentence boundary by comparing two adjacent fixed-width windows of pseudo-sentences using a bag-of-words cosine similarity, then place segment boundaries at the lowest scoring positions after a smoothing step.[^6] The algorithm was evaluated against human-annotated boundaries on 12 magazine articles and produced segmentations that matched human judgment for most boundaries, establishing the basic template that semantic chunking would later inherit: convert text to vectors, compute similarity between adjacent windows, place breaks at valleys in the similarity curve.[^6]
Three years later Freddy Y. Y. Choi presented "Advances in Domain Independent Linear Text Segmentation" at the first meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) in 2000, introducing the C99 algorithm.[^7] C99 retained the lexical-cohesion intuition from TextTiling but replaced the smoothed window comparison with two innovations: it ranked inter-sentence similarities by their relative position in the local context (a rank transform that made the algorithm robust to absolute similarity scale), and it located boundaries using a divisive hierarchical clustering step rather than the smoothing-and-valley-finding procedure.[^7] Choi reported that C99 was "twice as accurate and over seven times as fast as the state-of-the-art approach" on a synthetic benchmark created by concatenating sections drawn from the Brown Corpus, and that benchmark dataset became the standard evaluation set for linear text segmentation for the next two decades.[^7]
Through the 2000s and early 2010s a steady stream of follow-ups extended the same idea, replacing bag-of-words vectors with word embedding vectors, generative topic models such as Latent Dirichlet Allocation, or features derived from neural language models, and gradually layering supervised learning on top of the lexical-cohesion signal.[^10] By the late 2010s the field had largely shifted toward supervised neural segmentation models trained on Wikipedia-derived datasets such as Wiki-50 and Wiki-727k, and a line of work known as DeepTiling reused the TextTiling structure but replaced bag-of-words vectors with sentence embeddings from large language models.[^10] These academic methods predated and influenced what RAG practitioners eventually rediscovered as "semantic chunking" after the release of strong general-purpose text-embedding models in 2022 and 2023.
The modern revival came from a practitioner audience rather than from the segmentation research community. Greg Kamradt, a developer educator who runs the YouTube channel and content site FullStackRetrieval.com, published a video and accompanying Jupyter notebook titled "The 5 Levels Of Text Splitting For Retrieval" on his channel in late 2023, and the source notebook was committed to the public GitHub repository FullStackRetrieval-com/RetrievalTutorials under an MIT licence.[^3][^4][^11] The framework defined five tiers: Level 1 fixed-character splitting, Level 2 recursive character splitting (the default approach used by langchain's RecursiveCharacterTextSplitter), Level 3 document-specific splitting (separate handling for Markdown, code, and PDFs), Level 4 semantic splitting based on embeddings, and Level 5 an experimental "agentic" splitter that uses a language model to choose chunk boundaries one at a time.[^3][^4] The notebook is the first widely circulated source to use the phrase "semantic chunking" for the embedding-based variant, and it is credited explicitly in the source code of both LangChain's SemanticChunker and LlamaIndex's SemanticSplitterNodeParser (see "Implementations" below).[^12][^13]
The canonical semantic-chunking algorithm as implemented in the LangChain and LlamaIndex reference parsers has six steps.[^1][^2][^12][^13]
(?<=[.?!])\s+, which matches whitespace following one of the three end-of-sentence punctuation marks.[^12]buffer_size. With buffer_size=1 the system embeds the concatenation of the previous sentence, the current sentence, and the next sentence; the buffer is intended to smooth out short sentences that would otherwise produce noisy similarities.[^12][^13]text-embedding-3-small, Cohere embeddings, or an open-weight Sentence-BERT model.[^1][^2]Pseudocode for the core loop, mirroring LangChain's SemanticChunker._calculate_breakpoints method, is:
sentences = sentence_split(text)
groups = [join(sentences[i-buf : i+buf+1]) for i in range(len(sentences))]
vectors = embed(groups)
distances = [1 - cosine(vectors[i], vectors[i+1]) for i in range(len(vectors)-1)]
threshold = percentile(distances, breakpoint_amount)
breakpoints = [i for i, d in enumerate(distances) if d > threshold]
chunks = assemble_chunks(sentences, breakpoints)
The threshold is computed from the document's own distance distribution, so the splitter is adaptive: a document with sharp topic transitions will yield distance spikes well above the chosen percentile and produce many chunks, while a document on a single topic will produce a relatively flat distance curve and very few chunks (often a single chunk for the entire document).[^12]
LangChain's SemanticChunker (in the langchain_experimental.text_splitter module) supports four breakpoint_threshold_type values, with the default amounts stored in a BREAKPOINT_DEFAULTS dictionary in the source code.[^12]
| Threshold type | Default amount | Rule |
|---|---|---|
percentile | 95 | A break occurs where the distance exceeds the 95th percentile of all consecutive-sentence distances in the document.[^12] |
standard_deviation | 3 | A break occurs where the distance exceeds the mean distance plus three standard deviations.[^12] |
interquartile | 1.5 | A break occurs where the distance exceeds the third quartile plus 1.5 times the interquartile range, a standard Tukey outlier rule.[^12] |
gradient | 95 | The first derivative of the distance curve is computed, then a percentile threshold is applied to the gradient instead of the raw distances; useful for documents where topic shifts are gradual.[^15] |
The original LangChain release (committed in January 2024) shipped only the percentile rule. The other three types were added by community contributors: PR #16807 by Matt Haigh, merged on 26 February 2024, added the standard-deviation and interquartile types, motivated by Haigh's observation that distance distributions across documents tend to be positively skewed normal distributions, for which standard-deviation thresholds are more predictable than fixed percentiles.[^16] PR #22895 by contributor rrajp added the gradient option in mid-2024.[^15] A separate PR #18019 by Killian Mahé added a number_of_chunks parameter that lets the user fix the chunk count and back-solve for the percentile that produces exactly that many breakpoints.[^17]
LlamaIndex's SemanticSplitterNodeParser, found in llama_index/core/node_parser/text/semantic_splitter.py, exposes a smaller surface area. Its parameters are buffer_size (default 1), breakpoint_percentile_threshold (default 95), and an embed_model argument that takes any LlamaIndex embedding object.[^13] The LlamaIndex docstring describes the splitter as one that "adaptively picks the breakpoint in-between sentences using embedding similarity" and attributes the idea explicitly to Greg Kamradt, with the upstream note that the regex used for sentence splitting "primarily works for English sentences."[^13]
LangChain's SemanticChunker lives in the langchain_experimental package, which is the LangChain organisation's holding area for features that have not yet stabilised into the core library.[^12] The class docstring opens with the line "Taken from Greg Kamradt's wonderful notebook" followed by "All credits to him," and links to the FullStackRetrieval tutorial repository.[^12] The constructor signature, condensed from the source, is:
SemanticChunker(
embeddings: Embeddings,
buffer_size: int = 1,
add_start_index: bool = False,
breakpoint_threshold_type: BreakpointThresholdType = "percentile",
breakpoint_threshold_amount: Optional[float] = None,
number_of_chunks: Optional[int] = None,
sentence_split_regex: str = r"(?<=[.?!])\s+",
min_chunk_size: Optional[int] = None,
)
The class exposes split_text, split_documents, and transform_documents methods and is interchangeable with the other text splitters that LangChain ships, so it can be dropped into existing RAG pipelines that previously used RecursiveCharacterTextSplitter.[^12] In November 2024 an issue (#35553) proposed promoting SemanticChunker out of the experimental package into a stable LangChain library on the grounds that the API had been stable for nearly a year and the implementation was widely used in production, but as of mid-2026 the class remained in langchain_experimental.[^18]
LlamaIndex's parser, written by the run-llama team and available since LlamaIndex 0.9, is structured as a MetadataAwareTextSplitter subclass that operates on Document objects and produces TextNode objects.[^13] The reference example from the LlamaIndex documentation reads:
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=embed_model,
)
nodes = splitter.get_nodes_from_documents(documents)
The documentation notes that raising the threshold from 95 to 99 produces fewer, larger chunks (the splitter becomes more conservative about cutting), while lowering it to 80 produces many small, tightly-themed chunks.[^2] LlamaIndex's docs explicitly state that the parser "implements semantic chunking, a concept proposed by Greg Kamradt" and link to his tutorial.[^2]
Beyond the two main RAG frameworks, semantic chunking has been re-implemented in many smaller projects. Notable variants include KamradtSemanticChunker and KamradtModifiedChunker (faithful port and an iterative-search modification, respectively, both evaluated in the 2024 Chroma chunking study), ClusterSemanticChunker (which trades the breakpoint search for a divisive clustering step similar in spirit to C99), and LLMSemanticChunker (which delegates the boundary decision to a large language model rather than to a threshold rule).[^8] The Hugging Face community has hosted variants such as sentence-transformers-semantic-chunker that use locally-runnable Sentence-BERT models in place of API-based embeddings.[^19] Cloud providers including AWS Bedrock now offer semantic chunking as a built-in option for managed RAG ingestion pipelines.[^20]
The two baseline chunking strategies in modern RAG pipelines are fixed-size chunking, which cuts the document into uniform N-token windows, and recursive character splitting, popularised by LangChain's RecursiveCharacterTextSplitter, which tries an ordered list of separators (double newline, single newline, sentence, word, character) and falls back to the next separator only when the previous one produces a chunk above the target size.[^21][^22] Both methods are cheap (no embedding model call during chunking), deterministic, and produce chunks of predictable length, which makes them friendly to fixed-budget downstream prompting.[^21]
Semantic chunking trades that predictability for adaptivity. Chunk lengths vary by document because the splitter follows the document's own topic structure rather than imposing a uniform window. A document with three obvious sections will produce three chunks; a document with thirty subtle topic shifts will produce thirty.[^1][^12] In practice this can produce chunks that are too short (a single sentence) or too long (the entire document) when the chosen threshold does not match the document's distance distribution, which is why later versions of SemanticChunker added min_chunk_size and number_of_chunks parameters as guard rails.[^12][^17]
A throughput benchmark reported on the firecrawl.dev blog measured the recursive character splitter at roughly 3.54 megabytes per second on a 100,000-article Wikipedia subset, compared with roughly 0.33 megabytes per second for a semantic chunker on the same hardware, an order-of-magnitude difference attributable to the per-sentence embedding calls.[^22]
A related family of methods uses clustering rather than sequential breakpoint detection: every sentence is embedded, the sentence vectors are passed to a clustering algorithm (k-means, agglomerative, or HDBSCAN), and each cluster becomes a chunk regardless of the sentences' positions in the document.[^23] Clustering-based chunking can group together sentences from different parts of a document that discuss the same topic, but it sacrifices the property that chunks are contiguous spans of the source, which complicates retrieval workflows that surface citations alongside answers.[^23] Hybrid methods such as ClusterSemanticChunker (Chroma) keep the contiguous-span property by clustering only within local windows.[^8]
Semantic chunking decides chunk boundaries before embedding; late chunking, introduced by the Jina AI research team in 2024, reverses the order by embedding the full document with a long-context embeddings model and then deriving chunk-level vectors by mean-pooling over the token vectors that fall within each chunk's span.[^24] Anthropic's contextual retrieval method, published in September 2024, takes a third approach: chunks remain fixed-size, but each chunk is prefixed at indexing time with an automatically generated description of how the chunk relates to the rest of the document.[^25] These approaches solve different problems than semantic chunking does and are largely complementary; semantic chunking decides where to cut, while late chunking and contextual retrieval decide how to represent the chunks once the cuts are fixed.
Two independent studies published in 2024 evaluated semantic chunking against simpler baselines on retrieval benchmarks and reached broadly compatible conclusions.
The Chroma research team (Brandon Smith and Anton Troynikov) published "Evaluating Chunking Strategies for Retrieval" in July 2024.[^8] The study introduced token-level precision, recall, and intersection-over-union (IoU) metrics that compare retrieved spans against ground-truth answer spans rather than against document-level labels, and evaluated several chunkers, including KamradtSemanticChunker, KamradtModifiedChunker, ClusterSemanticChunker, LLMSemanticChunker, and the standard LangChain RecursiveCharacterTextSplitter and TokenTextSplitter, on Chroma's own evaluation corpus.[^8] The headline finding was mixed: the authors' new ClusterSemanticChunker with a 200-token cap achieved the highest precision and IoU, the LLMSemanticChunker had the highest recall at 91.9 per cent, but the default-parameter RecursiveCharacterTextSplitter was competitive on most metrics and substantially cheaper to run.[^8] The authors concluded that the choice of chunking strategy can have meaningful effects on retrieval performance, but that the default parameters shipped by popular chunkers (including semantic chunkers) often underperformed simpler baselines when measured carefully.[^8]
A second study by Renyi Qu, Ruixuan Tu, and Forrest Bao at Vectara and the University of Wisconsin–Madison, titled "Is Semantic Chunking Worth the Computational Cost?" (arXiv:2410.13070, October 2024), compared semantic chunking with fixed-size chunking across three downstream tasks: document retrieval, evidence retrieval, and answer generation.[^9] The authors used 10 document-retrieval datasets (six stitched from BEIR, four originals including HotpotQA and MS MARCO), 5 evidence-retrieval datasets from RAGBench, and the same RAGBench datasets for answer generation evaluated with GPT-4o-mini, across three embedding models spanning the MTEB leaderboard.[^9] Their conclusion was blunt: "the computational costs associated with semantic chunking are not justified by consistent performance gains."[^9] Semantic chunking outperformed fixed-size chunking on artificially stitched, high-diversity documents (where the topic shifts were synthetic and abrupt) but lost or tied on natural documents, evidence retrieval showed minimal differences with fixed-size winning on three of five datasets, and BERTScore differences in answer generation were under one percentage point.[^9] The authors found that the choice of embedding model had a substantially larger effect on retrieval quality than the choice of chunking strategy, and recommended that practitioners focus optimisation effort on embedding quality before investing in semantic chunkers.[^9]
These two studies do not contradict each other so much as complement: Chroma showed that careful design of a semantic chunker (with tuned parameters) can produce the best metrics on their corpus, while Vectara showed that out-of-the-box semantic chunking does not reliably beat simpler baselines on broad benchmark suites. Both reinforced the view that semantic chunking is an option to consider, not a default to adopt.
The empirical studies above and the published implementation source motivate several known limitations.
Embedding compute cost. Semantic chunking requires embedding every sentence (or buffered sentence group) in every document at ingestion time. For a corpus of millions of documents, this can translate into millions of embedding API calls before any retrieval ever happens, with associated dollar and latency costs.[^9][^22] Estimates published on RAG-focused blogs put the additional ingestion compute at roughly 5 to 10 times that of recursive character splitting, with the exact ratio depending on document length, embedding model, and batch size.[^26]
Threshold sensitivity. Because the threshold rule is computed from the document's own distance distribution, the right threshold for one document or domain may not be the right threshold for another. A percentile that produces clean chunks on news articles may over-segment dense technical documentation or under-segment chatty social-media transcripts. Both LangChain's and LlamaIndex's documentation note that the threshold "may have to be tuned" for the target corpus.[^2][^12] Domain-specific tuning is rarely automated and adds an extra hyperparameter to the RAG pipeline.
Short-document failure modes. When a document contains only one or two sentences the algorithm has no distances or only a single distance to threshold against, which leads to degenerate behaviour. The LangChain repository contains issues (#17106, #25869) reporting errors and crashes on single-sentence inputs, and on inputs where number_of_chunks equals the number of sentences.[^27][^28]
Loss of length predictability. Many downstream RAG components (rerankers, context-window planners, prompt budgeters) work best with chunks of a known maximum size. Semantic chunking produces variable-length chunks, sometimes much longer than the target embedding model's context window, which forces additional post-processing such as a min_chunk_size or max_chunk_size guard rail.[^12][^17]
Sentence-splitter fragility. Both reference implementations default to a simple regex sentence splitter that handles English declarative sentences but fails on inputs with abbreviations, decimal numbers, ellipses, non-English punctuation, or no punctuation at all (such as transcribed speech). LlamaIndex's docs note that the regex "primarily works for English sentences."[^13]
Single-topic and multi-topic mixed signals. When two adjacent sentences happen to share strong topical overlap with a third sentence further away (for example, a cross-reference back to an earlier example), the algorithm's local distance signal can miss the conceptual structure that a human reader would identify. Clustering-based and graph-based variants attempt to address this but introduce their own failure modes.[^23]
Semantic chunking is one of several embedding-aware preprocessing techniques in modern RAG. Closely related approaches include hierarchical chunking methods such as RAPTOR (which builds a tree of summaries), late chunking (which embeds the whole document and post-segments), contextual retrieval (which augments each chunk with a document-level context summary), and various reranking and query-expansion methods such as hyde that operate on the retrieval side rather than the chunking side.[^24][^25] Vector stores such as pinecone and chroma provide ingestion APIs that accept arbitrary chunkers including semantic ones, leaving the chunking choice to the user.[^29]
The relationship with TextTiling, C99, and the broader linear-text-segmentation literature is one of conceptual continuity rather than direct lineage. Practitioners who arrived at semantic chunking from the RAG side largely rediscovered the older algorithms' intuitions (lexical-cohesion-driven boundary detection) with newer machinery (dense neural embeddings instead of bag-of-words), often without citing the earlier work explicitly. The Recent Trends in Linear Text Segmentation survey published in 2024 maps the connections in detail.[^10]