Prefix caching (automatic prefix caching)
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,148 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,148 words
Add missing citations, update stale details, or suggest a clearer explanation.
Prefix caching is an inference optimization for large language model serving that stores and reuses the key-value (KV) cache computed for a shared prompt prefix, so that the prefill for that prefix is computed once and reused across many requests rather than recomputed every time. Because attention in a decoder-only transformer is causal, the KV state at a given position depends only on that token and the tokens before it; two requests that begin with exactly the same token sequence therefore produce exactly the same KV cache for that shared region, and the cached blocks can be substituted for the redundant computation without changing the output [1][2]. The technique is variously called automatic prefix caching, KV cache reuse, or context caching.
Prefix caching is real, widely deployed, and a standard feature of modern serving stacks. The two best known open-source implementations are Automatic Prefix Caching (APC) in vLLM, built on the block-structured KV cache of PagedAttention, and RadixAttention in SGLang, which organizes cached KV blocks in a radix tree with least-recently-used (LRU) eviction [1][2][4][5]. The same idea underpins commercial prompt caching and context caching from providers including OpenAI, Anthropic, Google, and DeepSeek, and is often the serving-layer mechanism behind what API users experience as prompt caching [6][7][8].
In many real workloads a large fraction of the input tokens are identical from one request to the next. Common sources of shared prefixes include a long system prompt or set of instructions sent with every call, a block of few-shot examples reused across a batch of queries, a document or knowledge base passage that is the same for many questions in retrieval-augmented generation, and the accumulated history of a multi-turn conversation, where each new turn re-sends everything that came before [1][4].
Processing this shared context is not free. LLM inference has two phases: a compute-bound prefill that runs the full prompt through the model to populate the KV cache, and a memory-bound decode that emits output tokens one at a time. Prefill cost grows with prompt length and, for long shared prefixes, can dominate both latency and GPU cost, so recomputing the identical prefix on every request wastes compute and inflates time-to-first-token. Prefix caching amortizes the expensive prefill for the shared region over all requests that reuse it, computing only the new suffix afresh [1][2]. The reuse is exact because in a causal transformer the keys and values for the prefix are a deterministic function of the prefix tokens alone, so a cache hit reproduces the KV state that a full recomputation would have produced.
The dominant approach splits the KV cache into fixed-size blocks of contiguous tokens and indexes those blocks by a content hash. In vLLM, every block is uniquely identified by a hash of the form hash(prefix tokens, tokens in this block), so the identifier captures both the block's own tokens and all the tokens that precede it. The engine keeps a global hash table mapping these hashes to physical KV blocks. When two requests share a prefix, their logical blocks resolve to the same hash and therefore point at the same physical block, so the memory is shared and the computation is skipped [2]. Folding the preceding tokens into the hash is what makes the match a true prefix match: a block is reusable only if every block before it is identical, which guarantees the cached KV is valid in its new context.
On a new request the server hashes the incoming prompt block by block, looks up each hash, and finds the longest run of leading blocks that are already resident. Those blocks are reused directly; computation begins at the first block that misses. Matching happens at block granularity, so reuse is aligned to the block size (16 tokens by default in vLLM, and 64-token chunks in DeepSeek's disk cache, for example), and a partial final block of shared tokens is not reused [2][6]. Crucially, prefix caching only shortens the prefill phase. It does not speed up decoding, so its benefit is largest when shared input is long relative to the generated output and shrinks toward zero when generation dominates or when requests share no common prefix [1].
Block-level KV sharing was introduced by PagedAttention, the memory manager underneath vLLM, described in "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon and colleagues at UC Berkeley and presented at SOSP 2023 [3]. Inspired by virtual memory and paging in operating systems, PagedAttention stores the KV cache in non-contiguous fixed-size blocks tracked by a per-sequence block table, which nearly eliminates fragmentation. It also lets blocks be shared across sequences and requests through reference counting and copy-on-write semantics, originally to support beam search and parallel sampling. Automatic prefix caching generalizes that same machinery to persist and reuse prefix blocks across independent requests [2][3].
vLLM exposes the feature through the engine flag enable_prefix_caching=True (set automatically in some recent versions) [1]. It caches the KV blocks of completed queries so a later query that shares a prefix reuses them instead of recomputing. Sharing is realized entirely through the global block hash table described above, with no separate tree to maintain. When memory runs low, vLLM does not free a block while its reference count is above zero; among blocks with reference count zero it evicts the LRU block, breaking ties in favor of the block at the end of the longest prefix. vLLM notes that for models with full attention this eviction policy is equivalent to the policy RadixAttention applies to the leaves of its prefix tree [2]. The feature was proposed in vLLM's "Automatic Prefix Caching" RFC and merged into the project in 2024 [9].
RadixAttention is the KV-reuse mechanism of SGLang, introduced in "SGLang: Efficient Execution of Structured Language Model Programs" by Lianmin Zheng, Liangsheng Yin, Ying Sheng and collaborators from UC Berkeley, Stanford and others (arXiv preprint first posted December 12, 2023; published at NeurIPS 2024), and announced on the LMSYS blog on January 17, 2024 [4][5]. Rather than discarding the KV cache when a request finishes, RadixAttention retains the KV cache for both prompts and generated text in a radix tree, a space-efficient trie whose edges can be labeled with multi-token sequences rather than single tokens. The tree maps token sequences to their KV tensors (stored in a paged layout on the GPU), while the tree itself lives on the CPU with low maintenance overhead. Incoming requests are matched against the tree for automatic prefix sharing, and an LRU policy recursively evicts leaf nodes when GPU memory is tight. SGLang pairs this with cache-aware scheduling, which orders requests to raise the cache hit rate. The radix-tree layout is well suited to branching workloads where many continuations share a prefix and then diverge. The SGLang paper reports up to 6.4 times higher throughput over prior systems, and the launch blog reported up to 5 times higher throughput against baselines such as vLLM, Guidance and TGI on models including Llama-7B and Mixtral-8x7B [4][5].
Many hosted APIs apply prefix caching automatically and pass the savings on as a discount on cached input tokens. DeepSeek announced Context Caching on Disk on August 2, 2024: enabled by default with no code changes, it caches message prefixes on distributed disk storage in 64-token chunks and bills cache-hit input at roughly one tenth the price of cache-miss tokens, a discount the company put at up to 90 percent [6]. OpenAI applies prompt caching automatically to prompts of 1,024 tokens or more, matching in 128-token increments. Anthropic uses an explicit model in which the developer marks reusable spans with cache_control breakpoints. Google's Gemini API offers both explicit context caching, with a named cache object and a configurable time-to-live, and implicit caching applied automatically since May 2025 [8]. These provider features and the open-source serving optimization are two views of the same idea.
| System | Cache structure | Match granularity | Eviction | Sharing scope |
|---|---|---|---|---|
| vLLM APC | Global block hash table (on PagedAttention) | 16-token blocks (default) | Ref-count 0, then LRU, then longest prefix | Across requests on one engine |
| SGLang RadixAttention | Radix tree of KV blocks | Token sequences (paged) | LRU over leaf nodes + cache-aware scheduling | Across requests on one engine |
| DeepSeek context caching | On-disk prefix cache | 64-token chunks | Provider-managed (idle expiry) | Per account, automatic |
| OpenAI prompt caching | Automatic prefix cache | 128-token increments, 1,024-token floor | Provider-managed TTL | Per organization, automatic |
| Anthropic prompt caching | Explicit cache_control breakpoints | Developer-marked spans | Provider-managed TTL | Per organization, explicit |
Prefix caching is most valuable wherever a long context is reused. In multi-turn chat, each turn re-sends the conversation so far, and caching lets the server skip re-prefilling the entire history, paying only for the newest user message and reply [1][4]. For few-shot prompting, a fixed block of in-context examples (see in-context learning) is shared across many queries and prefilled once. In retrieval-augmented generation and document question answering, a single long passage can be cached and queried repeatedly, which the vLLM documentation highlights as a primary use case [1]. The technique is especially effective for branch-and-explore patterns such as tree-of-thought search, self-consistency sampling, and agentic AI loops, where many candidate continuations share a common prefix before diverging; RadixAttention was designed with exactly these structured, branching programs in mind [4]. Because the savings scale with prefix length and reuse frequency, production teams treat cache hit rate as a cost lever and keep shared, cacheable content at the front of the prompt.
The reuse is exact, not fuzzy. Only a true, identical prefix matches: changing a single earlier token, or even reordering content so that the shared text no longer sits at the very front of the prompt, breaks the chain and forces recomputation from the first differing block. Because matching is block-aligned, a shared region shorter than one block, or one that ends partway through a block, yields no hit for that fragment [2][6]. And because it accelerates only prefill, workloads dominated by long generations or by requests with little shared context see little gain [1].
Cached blocks consume KV memory that could otherwise serve active requests, so every implementation needs an eviction policy. vLLM and SGLang both fall back to LRU eviction of blocks (or leaf nodes) that are no longer referenced, which means a prefix is only cheap to reuse while it remains resident; a long gap between requests can evict it and reintroduce the full prefill cost [2][5]. Hit rates therefore depend heavily on traffic patterns and on scheduling that groups requests sharing a prefix.
The most discussed correctness-adjacent concern is isolation. When a cache is shared globally across users, the latency of a request reveals whether its prefix was already cached, and an attacker can exploit this. The paper "Auditing Prompt Caching in Language Model APIs" (Gu, Li and colleagues, 2025) used time-to-first-token measurements to detect prompt caching in 8 of 17 commercial API providers audited in September and October 2024, and found global, cross-user cache sharing in 7 of them, including OpenAI, which can leak information about other users' prompts through this timing side channel [7]. Subsequent work has proposed mitigations such as restricting cache sharing to within a single user or organization and selective KV-cache sharing schemes that preserve most of the speedup while closing the cross-tenant channel [7]. The practical takeaway is that prefix caching is safe to share freely within a trust boundary (one tenant, one application), but cross-user sharing trades a privacy risk for additional hit rate and should be scoped deliberately.