Prefix caching (automatic prefix caching)

AI Infrastructure Machine Learning

11 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 2,148 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Prefix caching is an inference optimization for large language model serving that stores and reuses the key-value (KV) cache computed for a shared prompt prefix, so that the prefill for that prefix is computed once and reused across many requests rather than recomputed every time. Because attention in a decoder-only transformer is causal, the KV state at a given position depends only on that token and the tokens before it; two requests that begin with exactly the same token sequence therefore produce exactly the same KV cache for that shared region, and the cached blocks can be substituted for the redundant computation without changing the output ^[1]^[2]. The technique is variously called automatic prefix caching, KV cache reuse, or context caching.

Prefix caching is real, widely deployed, and a standard feature of modern serving stacks. The two best known open-source implementations are Automatic Prefix Caching (APC) in vLLM, built on the block-structured KV cache of PagedAttention, and RadixAttention in SGLang, which organizes cached KV blocks in a radix tree with least-recently-used (LRU) eviction ^[1]^[2]^[4]^[5]. The same idea underpins commercial prompt caching and context caching from providers including OpenAI, Anthropic, Google, and DeepSeek, and is often the serving-layer mechanism behind what API users experience as prompt caching ^[6]^[7]^[8].

Motivation

In many real workloads a large fraction of the input tokens are identical from one request to the next. Common sources of shared prefixes include a long system prompt or set of instructions sent with every call, a block of few-shot examples reused across a batch of queries, a document or knowledge base passage that is the same for many questions in retrieval-augmented generation, and the accumulated history of a multi-turn conversation, where each new turn re-sends everything that came before ^[1]^[4].

Processing this shared context is not free. LLM inference has two phases: a compute-bound prefill that runs the full prompt through the model to populate the KV cache, and a memory-bound decode that emits output tokens one at a time. Prefill cost grows with prompt length and, for long shared prefixes, can dominate both latency and GPU cost, so recomputing the identical prefix on every request wastes compute and inflates time-to-first-token. Prefix caching amortizes the expensive prefill for the shared region over all requests that reuse it, computing only the new suffix afresh ^[1]^[2]. The reuse is exact because in a causal transformer the keys and values for the prefix are a deterministic function of the prefix tokens alone, so a cache hit reproduces the KV state that a full recomputation would have produced.

How prefix caching works

The dominant approach splits the KV cache into fixed-size blocks of contiguous tokens and indexes those blocks by a content hash. In vLLM, every block is uniquely identified by a hash of the form hash(prefix tokens, tokens in this block), so the identifier captures both the block's own tokens and all the tokens that precede it. The engine keeps a global hash table mapping these hashes to physical KV blocks. When two requests share a prefix, their logical blocks resolve to the same hash and therefore point at the same physical block, so the memory is shared and the computation is skipped ^[2]. Folding the preceding tokens into the hash is what makes the match a true prefix match: a block is reusable only if every block before it is identical, which guarantees the cached KV is valid in its new context.

On a new request the server hashes the incoming prompt block by block, looks up each hash, and finds the longest run of leading blocks that are already resident. Those blocks are reused directly; computation begins at the first block that misses. Matching happens at block granularity, so reuse is aligned to the block size (16 tokens by default in vLLM, and 64-token chunks in DeepSeek's disk cache, for example), and a partial final block of shared tokens is not reused ^[2]^[6]. Crucially, prefix caching only shortens the prefill phase. It does not speed up decoding, so its benefit is largest when shared input is long relative to the generated output and shrinks toward zero when generation dominates or when requests share no common prefix ^[1].

Implementations

Foundations: PagedAttention

Block-level KV sharing was introduced by PagedAttention, the memory manager underneath vLLM, described in "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon and colleagues at UC Berkeley and presented at SOSP 2023 ^[3]. Inspired by virtual memory and paging in operating systems, PagedAttention stores the KV cache in non-contiguous fixed-size blocks tracked by a per-sequence block table, which nearly eliminates fragmentation. It also lets blocks be shared across sequences and requests through reference counting and copy-on-write semantics, originally to support beam search and parallel sampling. Automatic prefix caching generalizes that same machinery to persist and reuse prefix blocks across independent requests ^[2]^[3].

vLLM Automatic Prefix Caching

vLLM exposes the feature through the engine flag enable_prefix_caching=True (set automatically in some recent versions) ^[1]. It caches the KV blocks of completed queries so a later query that shares a prefix reuses them instead of recomputing. Sharing is realized entirely through the global block hash table described above, with no separate tree to maintain. When memory runs low, vLLM does not free a block while its reference count is above zero; among blocks with reference count zero it evicts the LRU block, breaking ties in favor of the block at the end of the longest prefix. vLLM notes that for models with full attention this eviction policy is equivalent to the policy RadixAttention applies to the leaves of its prefix tree ^[2]. The feature was proposed in vLLM's "Automatic Prefix Caching" RFC and merged into the project in 2024 ^[9].

SGLang RadixAttention

RadixAttention is the KV-reuse mechanism of SGLang, introduced in "SGLang: Efficient Execution of Structured Language Model Programs" by Lianmin Zheng, Liangsheng Yin, Ying Sheng and collaborators from UC Berkeley, Stanford and others (arXiv preprint first posted December 12, 2023; published at NeurIPS 2024), and announced on the LMSYS blog on January 17, 2024 ^[4]^[5]. Rather than discarding the KV cache when a request finishes, RadixAttention retains the KV cache for both prompts and generated text in a radix tree, a space-efficient trie whose edges can be labeled with multi-token sequences rather than single tokens. The tree maps token sequences to their KV tensors (stored in a paged layout on the GPU), while the tree itself lives on the CPU with low maintenance overhead. Incoming requests are matched against the tree for automatic prefix sharing, and an LRU policy recursively evicts leaf nodes when GPU memory is tight. SGLang pairs this with cache-aware scheduling, which orders requests to raise the cache hit rate. The radix-tree layout is well suited to branching workloads where many continuations share a prefix and then diverge. The SGLang paper reports up to 6.4 times higher throughput over prior systems, and the launch blog reported up to 5 times higher throughput against baselines such as vLLM, Guidance and TGI on models including Llama-7B and Mixtral-8x7B ^[4]^[5].

Commercial and API deployments

Many hosted APIs apply prefix caching automatically and pass the savings on as a discount on cached input tokens. DeepSeek announced Context Caching on Disk on August 2, 2024: enabled by default with no code changes, it caches message prefixes on distributed disk storage in 64-token chunks and bills cache-hit input at roughly one tenth the price of cache-miss tokens, a discount the company put at up to 90 percent ^[6]. OpenAI applies prompt caching automatically to prompts of 1,024 tokens or more, matching in 128-token increments. Anthropic uses an explicit model in which the developer marks reusable spans with cache_control breakpoints. Google's Gemini API offers both explicit context caching, with a named cache object and a configurable time-to-live, and implicit caching applied automatically since May 2025 ^[8]. These provider features and the open-source serving optimization are two views of the same idea.

System	Cache structure	Match granularity	Eviction	Sharing scope
vLLM APC	Global block hash table (on PagedAttention)	16-token blocks (default)	Ref-count 0, then LRU, then longest prefix	Across requests on one engine
SGLang RadixAttention	Radix tree of KV blocks	Token sequences (paged)	LRU over leaf nodes + cache-aware scheduling	Across requests on one engine
DeepSeek context caching	On-disk prefix cache	64-token chunks	Provider-managed (idle expiry)	Per account, automatic
OpenAI prompt caching	Automatic prefix cache	128-token increments, 1,024-token floor	Provider-managed TTL	Per organization, automatic
Anthropic prompt caching	Explicit cache_control breakpoints	Developer-marked spans	Provider-managed TTL	Per organization, explicit

Use cases

Prefix caching is most valuable wherever a long context is reused. In multi-turn chat, each turn re-sends the conversation so far, and caching lets the server skip re-prefilling the entire history, paying only for the newest user message and reply ^[1]^[4]. For few-shot prompting, a fixed block of in-context examples (see in-context learning) is shared across many queries and prefilled once. In retrieval-augmented generation and document question answering, a single long passage can be cached and queried repeatedly, which the vLLM documentation highlights as a primary use case ^[1]. The technique is especially effective for branch-and-explore patterns such as tree-of-thought search, self-consistency sampling, and agentic AI loops, where many candidate continuations share a common prefix before diverging; RadixAttention was designed with exactly these structured, branching programs in mind ^[4]. Because the savings scale with prefix length and reuse frequency, production teams treat cache hit rate as a cost lever and keep shared, cacheable content at the front of the prompt.

Limitations and considerations

The reuse is exact, not fuzzy. Only a true, identical prefix matches: changing a single earlier token, or even reordering content so that the shared text no longer sits at the very front of the prompt, breaks the chain and forces recomputation from the first differing block. Because matching is block-aligned, a shared region shorter than one block, or one that ends partway through a block, yields no hit for that fragment ^[2]^[6]. And because it accelerates only prefill, workloads dominated by long generations or by requests with little shared context see little gain ^[1].

Cached blocks consume KV memory that could otherwise serve active requests, so every implementation needs an eviction policy. vLLM and SGLang both fall back to LRU eviction of blocks (or leaf nodes) that are no longer referenced, which means a prefix is only cheap to reuse while it remains resident; a long gap between requests can evict it and reintroduce the full prefill cost ^[2]^[5]. Hit rates therefore depend heavily on traffic patterns and on scheduling that groups requests sharing a prefix.

The most discussed correctness-adjacent concern is isolation. When a cache is shared globally across users, the latency of a request reveals whether its prefix was already cached, and an attacker can exploit this. The paper "Auditing Prompt Caching in Language Model APIs" (Gu, Li and colleagues, 2025) used time-to-first-token measurements to detect prompt caching in 8 of 17 commercial API providers audited in September and October 2024, and found global, cross-user cache sharing in 7 of them, including OpenAI, which can leak information about other users' prompts through this timing side channel ^[7]. Subsequent work has proposed mitigations such as restricting cache sharing to within a single user or organization and selective KV-cache sharing schemes that preserve most of the speedup while closing the cross-tenant channel ^[7]. The practical takeaway is that prefix caching is safe to share freely within a trust boundary (one tenant, one application), but cross-user sharing trades a privacy risk for additional hit rate and should be scoped deliberately.

References

"Automatic Prefix Caching." vLLM Documentation, 2024. https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html ↩
"Automatic Prefix Caching (design)." vLLM Documentation, 2024. https://docs.vllm.ai/en/v0.9.2/design/automatic_prefix_caching.html ↩
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023; arXiv:2309.06180. https://arxiv.org/abs/2309.06180 ↩
Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y. "SGLang: Efficient Execution of Structured Language Model Programs." NeurIPS 2024; arXiv:2312.07104. https://arxiv.org/abs/2312.07104 ↩
Zheng, L., et al. "Fast and Expressive LLM Inference with RadixAttention and SGLang." LMSYS Org Blog, January 17, 2024. https://www.lmsys.org/blog/2024-01-17-sglang/ ↩
"DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude." DeepSeek API Docs, August 2, 2024. https://api-docs.deepseek.com/news/news0802 ↩
Gu, C., Li, X. L., et al. "Auditing Prompt Caching in Language Model APIs." arXiv:2502.07776, February 2025. https://arxiv.org/abs/2502.07776 ↩
"Prompt Caching with OpenAI, Anthropic, and Google Models." PromptHub Blog, 2024. https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models ↩
"[RFC] Automatic Prefix Caching." vLLM GitHub Issue #2614, 2024. https://github.com/vllm-project/vllm/issues/2614 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Amazon SageMaker GPU computing

Overview

Motivation

How prefix caching works

Implementations

Foundations: PagedAttention

vLLM Automatic Prefix Caching

SGLang RadixAttention

Commercial and API deployments

Use cases

Limitations and considerations

References

Improve this article

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here