Prompt caching is a technique used in large language model (LLM) inference that reuses previously computed key-value (KV) representations for repeated prompt prefixes, allowing providers and users to reduce both cost and latency for API calls that share common prompt content. When a prompt begins with the same tokens as a previous request (such as a system prompt, few-shot examples, or a shared document), the model can skip recomputing the attention keys and values for those cached tokens and begin computation only from the point where the prompts diverge. Major LLM API providers, including Anthropic, OpenAI, and Google, have all introduced prompt caching features, with discounts ranging from 50% to 90% on cached input tokens.
To understand prompt caching, it is helpful to first understand the KV cache, which is the underlying mechanism that makes it possible.
In a transformer-based language model, each layer computes attention over the input sequence by projecting tokens into query (Q), key (K), and value (V) representations. During autoregressive generation, the model produces tokens one at a time. Without caching, generating each new token would require recomputing the keys and values for every previous token in the sequence, resulting in O(n^2) total computation across a sequence of length n.
The KV cache avoids this redundancy by storing the key and value tensors computed during previous forward passes. When generating the next token, the model only needs to compute the query, key, and value for the new token, then attend over the cached keys and values from all previous tokens. This reduces per-token computation from O(n) to O(1) (excluding the attention operation itself) and is essential for making autoregressive generation practical at scale [1].
The KV cache is a per-request resource: each inference request builds up its own KV cache as it processes the prompt and generates output tokens. Prompt caching extends this concept across requests, sharing cached KV representations between different API calls that happen to share the same prompt prefix.
Prompt caching operates on a simple principle: if two requests share the same initial tokens (the same prefix), the KV representations for those shared tokens are identical and do not need to be recomputed.
The workflow proceeds as follows:
This means that the computational cost of processing the prompt is reduced in proportion to the fraction of tokens that are cached. For applications where a large system prompt or document is repeated across many requests, the savings can be very significant.
Prompt caching works only for exact prefix matches. The cached tokens must be identical, in the same order, at the beginning of the prompt. If any token in the cached prefix changes, the cache is invalidated from that point forward. This means that prompt caching is most effective when the shared content appears at the start of the prompt, with variable content appended at the end [2].
Most providers enforce a minimum prefix length for caching to be effective, and caches have a limited time-to-live (TTL) before they expire.
| Provider | Minimum Prefix Length | Cache Duration | Cache Type |
|---|---|---|---|
| Anthropic | 1,024 tokens (Claude Haiku 3.5: 2,048) | 5 minutes or 1 hour (configurable) | Explicit (manual) and automatic |
| OpenAI | 1,024 tokens | ~5-10 minutes | Automatic only |
| Google (Gemini) | 2,048 tokens | Configurable (explicit) or automatic (implicit) | Both explicit and implicit |
Anthropic announced prompt caching for the Claude API in August 2024, making it one of the first major providers to offer the feature. Anthropic's implementation gives developers explicit control over caching behavior [2].
Anthropic supports two approaches:
cache_control field at the top level of the API request, and the system automatically manages cache breakpoints as conversations grow. This is recommended as the starting point for most use cases.cache_control markers directly on individual content blocks (such as the system prompt or specific messages) to define exactly what gets cached.Cache writes incur a cost premium: writing to a 5-minute cache costs 1.25x the base input token price, while writing to a 1-hour cache costs 2x the base input token price. Cache reads (hits) cost only 0.1x (10%) of the base input token price. Because of this pricing structure, a cache hit pays for itself after just one read for the 5-minute duration, or after two reads for the 1-hour duration [3].
In February 2026, Anthropic introduced automatic prompt caching that further simplifies usage for agent workflows, addressing what was described as one of the biggest hidden costs in AI agent architectures [4].
| Model | Base Input | 5-min Cache Write | 1-hour Cache Write | Cache Read (Hit) | Output |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5.00/MTok | $6.25/MTok | $10.00/MTok | $0.50/MTok | $25.00/MTok |
| Claude Sonnet 4.6 | $3.00/MTok | $3.75/MTok | $6.00/MTok | $0.30/MTok | $15.00/MTok |
| Claude Haiku 4.5 | $1.00/MTok | $1.25/MTok | $2.00/MTok | $0.10/MTok | $5.00/MTok |
| Claude Haiku 3.5 | $0.80/MTok | $1.00/MTok | $1.60/MTok | $0.08/MTok | $4.00/MTok |
| Claude Haiku 3 | $0.25/MTok | $0.30/MTok | $0.50/MTok | $0.03/MTok | $1.25/MTok |
MTok = million tokens. Pricing as of March 2026 [3].
OpenAI introduced automatic prompt caching in October 2024 for GPT-4o, GPT-4o-mini, o1-preview, and o1-mini models. Unlike Anthropic's approach, OpenAI's caching is fully automatic: there is no explicit API parameter to enable or configure it. Any prompt longer than 1,024 tokens automatically benefits from caching if a matching prefix has been recently seen [5].
The cache operates at a granularity of 128-token increments after the initial 1,024-token minimum. The system caches the longest prefix of a prompt that matches a previously computed prefix. There is no additional cost for cache writes; cached input tokens are simply billed at a discounted rate [5].
OpenAI's cached token discount varies by model family:
| Model Family | Cached Token Discount | Input Price | Cached Input Price | Output Price |
|---|---|---|---|---|
| GPT-5 family | 90% off (pay 10%) | Varies by model | 10% of input price | Varies by model |
| GPT-4.1 family | 75% off (pay 25%) | Varies by model | 25% of input price | Varies by model |
| GPT-4o / o-series | 50% off (pay 50%) | $2.50/MTok (GPT-4o) | $1.25/MTok (GPT-4o) | $10.00/MTok (GPT-4o) |
| GPT-4o-mini | 50% off (pay 50%) | $0.15/MTok | $0.075/MTok | $0.60/MTok |
Pricing as of early 2026. Newer model families receive steeper discounts [5][6].
A key difference from Anthropic's implementation is that OpenAI charges no premium for cache writes. Cache misses are simply billed at the standard input rate, and cache hits receive the discount automatically. This makes OpenAI's approach zero-risk from a cost perspective: developers never pay more than they would without caching [5].
Google was an early pioneer of context caching, introducing the feature for Gemini models in May 2024 under the name "context caching." Google offers both explicit and implicit caching [7].
Explicit caching. Developers create a named cache object via the API, specifying the content to cache and a TTL. The cached content can be a system instruction, a document, or any prefix content. Cached tokens are billed at the standard input rate for the initial write, plus a per-hour storage cost. Subsequent requests that reference the cache pay a reduced per-token rate for the cached portion [7].
Implicit caching. Rolled out in May 2025 for the Gemini API, implicit caching automatically passes cache cost savings to developers without requiring them to create explicit cache objects. There are no storage costs for implicit caching [8].
| Gemini Model | Input Price | Cached Input Price (Read) | Cache Discount | Storage Cost (Explicit Only) |
|---|---|---|---|---|
| Gemini 2.5 Pro | $1.00/MTok | $0.10/MTok | 90% off | $4.50/MTok/hour |
| Gemini 2.5 Flash | $0.30/MTok | $0.03/MTok | 90% off | $1.00/MTok/hour |
| Gemini 2.0 Flash | $0.10/MTok | $0.025/MTok | 75% off | $1.00/MTok/hour |
Pricing as of early 2026 via the Gemini Developer API. Vertex AI pricing may differ [7][9].
The minimum cacheable content for Gemini is 2,048 tokens, and the feature supports caching up to the model's full context window (over 1 million tokens for Gemini 2.5 Pro) [7].
The following table summarizes the key differences between prompt caching implementations across providers:
| Feature | Anthropic (Claude) | OpenAI (GPT) | Google (Gemini) |
|---|---|---|---|
| Launch date | August 2024 | October 2024 | May 2024 (explicit), May 2025 (implicit) |
| Cache type | Explicit + automatic | Automatic only | Explicit + implicit |
| Developer control | High (breakpoints, TTL choice) | None (fully automatic) | Medium (explicit caches with TTL) |
| Minimum prefix | 1,024 tokens | 1,024 tokens | 2,048 tokens |
| Cache write cost | 1.25x-2x base input | No premium (standard input rate) | Standard input rate + storage (explicit) |
| Cache read discount | 90% off base input | 50-90% off (varies by model) | 75-90% off (varies by model) |
| Storage cost | None (included in write premium) | None | $1-4.50/MTok/hour (explicit only) |
| Cache duration | 5 minutes or 1 hour | ~5-10 minutes | Configurable TTL (explicit), automatic (implicit) |
| Break-even point | 1 read (5-min), 2 reads (1-hr) | Immediate (no write premium) | Depends on storage cost vs. read savings |
Prompt caching is most valuable in scenarios where a significant portion of the prompt remains constant across multiple API calls.
Most LLM applications include a system prompt that defines the model's role, behavior constraints, and output format. This system prompt is identical across every user interaction. Without caching, the model reprocesses these tokens on every request. With caching, the system prompt is processed once and reused, reducing both cost and latency for all subsequent requests [2].
For applications with large system prompts (thousands of tokens containing detailed instructions, personas, or tool definitions), the savings are proportionally larger. A 5,000-token system prompt that is repeated across 100 requests per minute would see its effective input cost reduced by roughly 90% for those tokens.
Applications that use few-shot learning include multiple input-output examples in the prompt to guide the model's behavior. These examples are typically static and repeated across all requests. Prompt caching allows these examples to be processed once and reused, which is particularly valuable when using many examples or when examples include long text passages [2].
When using an LLM to analyze a document (answering questions, extracting information, summarizing), the document content is included in the prompt. If multiple questions are asked about the same document across different API calls, prompt caching avoids reprocessing the document each time. This is directly relevant to retrieval-augmented generation (RAG) workflows where a retrieved context is queried multiple times [2].
In conversational applications, each new message in a conversation typically includes the full conversation history as context. As the conversation grows, the prompt becomes longer with each turn. Prompt caching helps because each new turn's prompt shares a long common prefix (the conversation history up to the previous turn) with the prior request. The model only needs to process the new user message and any new system content [2].
AI agent systems that use tool calling and multi-step reasoning often make many sequential API calls with similar or overlapping prompts. The system prompt, tool definitions, and accumulated context tend to remain stable across calls within a single agent run. Prompt caching can significantly reduce the cost of these repetitive computations, which is why Anthropic specifically highlighted agent workflows as a key use case for automatic prompt caching [4].
To maximize cache hit rates, developers should structure their prompts so that static content appears at the beginning and variable content at the end. A recommended ordering is:
This structure ensures that the longest possible prefix remains cacheable across requests. Placing variable content before static content (e.g., putting the user message before the system prompt) would prevent caching of the system prompt [2].
For Anthropic's explicit caching, developers can "warm up" the cache by making an initial request that includes the content to be cached with appropriate cache_control markers. Subsequent requests will hit the warm cache. For applications with predictable traffic patterns, cache warm-up can be triggered shortly before expected peak usage [2].
All three major providers return cache hit information in their API responses. Anthropic includes cache_creation_input_tokens and cache_read_input_tokens fields. OpenAI includes cached_tokens in the usage object. Google returns information about cached token usage as well. Monitoring these fields allows developers to measure cache hit rates and optimize their prompt structures accordingly [2][5][7].
Prompt caching is not limited to proprietary APIs. Open-source LLM inference engines have implemented similar mechanisms, often called "prefix caching" or "automatic prefix caching" (APC).
vLLM, the widely used high-throughput LLM serving engine, implements Automatic Prefix Caching (APC) that caches KV representations at the block level. vLLM's PagedAttention system manages the KV cache in fixed-size blocks (typically 16 tokens), and APC stores these blocks indexed by their token content. When a new request arrives, vLLM checks if any prefix blocks match previously computed blocks and reuses them [10].
vLLM's APC requires exact token-level matches and works at block boundaries. The approach is deterministic: given the same token sequence, the same cache blocks will be matched. This works well for structured prompts with consistent prefixes but requires manual optimization for variable prompt structures [10].
SGLang takes a different approach to prefix caching using a radix tree (also called a radix attention tree). The radix tree structure allows SGLang to automatically discover and exploit shared prefixes at the token level, without requiring block-aligned matches. This makes SGLang's caching more flexible, particularly for multi-turn conversations where the shared prefix length varies unpredictably [11].
SGLang's radix tree approach has been shown to be particularly effective for multi-turn conversation workloads, where the conversation history grows incrementally and the shared prefix length changes with each turn [11].
| Feature | vLLM APC | SGLang Radix Attention |
|---|---|---|
| Matching granularity | Block-level (16 tokens) | Token-level |
| Data structure | Hash table on block contents | Radix tree |
| Best for | Batch inference, templated prompts | Multi-turn conversations, variable prefixes |
| Automatic discovery | Yes (within block boundaries) | Yes (any token boundary) |
| Memory management | PagedAttention blocks | Radix tree with LRU eviction |
LMCache is an open-source KV caching layer that works with both vLLM and SGLang. It extracts and stores KV caches generated by the inference engine and moves them out of GPU memory to CPU memory, disk, or even a distributed cache. This allows KV caches to be shared across multiple GPU workers or even across different machines, enabling prefix caching at the cluster level rather than just the single-GPU level [12].
The performance benefits of prompt caching fall into two categories: cost reduction and latency reduction.
The cost savings are straightforward to calculate based on the provider's pricing. For a workload where 80% of input tokens are cacheable:
| Scenario | Standard Cost | With Caching (90% discount) | Savings |
|---|---|---|---|
| 10,000 tokens cached, 2,500 tokens new (Anthropic Sonnet 4.6) | $37.50/MTok (all at $3/MTok input) | $3.00 + $0.30*10 = $6.00/MTok equivalent | ~84% on input costs |
| System prompt of 5,000 tokens, 100 requests (OpenAI GPT-4o) | $1.25 total (500K tokens at $2.50/MTok) | $0.625 total (50% cache discount) | 50% on cached portion |
| Document analysis, 50K tokens, 10 queries (Google Gemini 2.5 Pro) | $0.50 total (500K tokens at $1/MTok) | $0.095 total (first query full, 9 at 90% off) | ~81% on input costs |
Cached tokens do not need to go through the full forward pass of the model, which reduces the time-to-first-token (TTFT). Anthropic reports up to 85% latency reduction for long cached prompts. The actual latency improvement depends on the ratio of cached to uncached tokens and the model size. For prompts where the vast majority of tokens are cached, the TTFT approaches that of a very short prompt [2].
Prompt caching is one of several techniques for reducing the cost and latency of LLM inference. It is complementary to most other approaches:
As of early 2026, prompt caching has become a standard feature across all major LLM API providers and is widely adopted in production applications. Several trends characterize the current landscape:
Automatic caching as the default. Both OpenAI and Google have moved toward fully automatic caching that requires no developer configuration. Anthropic's introduction of automatic prompt caching in early 2026 follows this trend, while still maintaining explicit control options for advanced users [4].
Deeper discounts for newer models. Pricing trends show that newer model families receive steeper caching discounts. OpenAI's GPT-5 family offers 90% off cached tokens (up from 50% for GPT-4o), and Google's Gemini 2.5 models offer 90% off (up from 75% for Gemini 2.0). This suggests that providers view caching as an important competitive differentiator and a way to encourage higher-volume usage [6][9].
Longer cache durations. Anthropic's introduction of the 1-hour cache option (at a higher write cost) responds to demand from applications where requests are spaced further apart. The trend is toward more flexible cache management that accommodates different usage patterns.
Integration with agent frameworks. As AI agent architectures become more common, prompt caching has become essential for managing costs. Agent workflows that involve dozens of sequential LLM calls with overlapping context are among the biggest beneficiaries of caching. Framework developers have begun building prompt caching awareness directly into agent orchestration libraries [4].
Open-source convergence. vLLM, SGLang, and other open-source inference engines continue to improve their prefix caching implementations. LMCache and similar projects are extending caching across multi-GPU and multi-node deployments, bringing the benefits of prompt caching to self-hosted model serving [10][11][12].
The economics of prompt caching are clear: for any application that repeatedly sends similar prompts to an LLM API, caching offers substantial cost savings with minimal implementation effort. As LLM applications mature and prompt engineering practices become more sophisticated (with larger system prompts, more few-shot examples, and longer context windows), the value of prompt caching will only increase.