# Prompt Caching

> Source: https://aiwiki.ai/wiki/prompt_caching
> Updated: 2026-06-21
> Categories: Large Language Models, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Prompt caching is an [large language model](/wiki/large_language_model) (LLM) inference optimization that stores the computed key-value (KV) state of a repeated prompt prefix so it can be reused across API calls, cutting both cost and latency for the cached portion. The most-cited figures come from [Anthropic](/wiki/anthropic), which says prompt caching can reduce costs by up to 90% and latency by up to 85% for long prompts [1]. When a prompt begins with the same tokens as a previous request (such as a system prompt, few-shot examples, or a shared document), the model skips recomputing the [attention](/wiki/attention) keys and values for those cached tokens and begins work only from the point where the prompts diverge. Every major LLM API provider now offers the feature: Anthropic launched it in August 2024, [OpenAI](/wiki/openai) added automatic prompt caching in October 2024 at a 50% discount on cached input tokens, and [Google](/wiki/google_deepmind) introduced Gemini context caching, with cached-token discounts across providers ranging from 50% to 90% [1][2][3].

## Background: The KV Cache

To understand prompt caching, it is helpful to first understand the KV cache, which is the underlying mechanism that makes it possible.

In a [transformer](/wiki/transformer)-based language model, each layer computes attention over the input sequence by projecting tokens into query (Q), key (K), and value (V) representations. During autoregressive generation, the model produces tokens one at a time. Without caching, generating each new token would require recomputing the keys and values for every previous token in the sequence, resulting in O(n^2) total computation across a sequence of length n.

The KV cache avoids this redundancy by storing the key and value tensors computed during previous forward passes. When generating the next token, the model only needs to compute the query, key, and value for the new token, then attend over the cached keys and values from all previous tokens. This reduces per-token computation from O(n) to O(1) (excluding the attention operation itself) and is essential for making autoregressive generation practical at scale [4].

The KV cache is a per-request resource: each inference request builds up its own KV cache as it processes the prompt and generates output tokens. Prompt caching extends this concept across requests, sharing cached KV representations between different API calls that happen to share the same prompt prefix.

## How Does Prompt Caching Work?

Prompt caching operates on a simple principle: if two requests share the same initial tokens (the same prefix), the KV representations for those shared tokens are identical and do not need to be recomputed.

The workflow proceeds as follows:

1. **First request.** The model processes the full prompt and generates the KV cache for all input tokens. The provider stores (or "writes") this KV cache, associating it with the exact token sequence.
2. **Subsequent requests.** When a new request arrives whose prompt begins with the same tokens, the system detects the matching prefix and loads the stored KV cache. The model only needs to compute KV representations for the tokens that follow the cached prefix.
3. **Generation.** The model generates output tokens using the combined KV cache (cached prefix plus newly computed tokens) as normal.

This means that the computational cost of processing the prompt is reduced in proportion to the fraction of tokens that are cached. For applications where a large system prompt or document is repeated across many requests, the savings can be very significant.

### What Are the Requirements and Constraints?

Prompt caching works only for exact prefix matches. The cached tokens must be identical, in the same order, at the beginning of the prompt. If any token in the cached prefix changes, the cache is invalidated from that point forward. As Anthropic's documentation puts it, "the cache has a 5 minute time to live (TTL)" by default, and the cached content must appear at the start of the prompt with variable content appended at the end [2].

Most providers enforce a minimum prefix length for caching to be effective, and caches have a limited time-to-live (TTL) before they expire.

| Provider | Minimum Prefix Length | Cache Duration | Cache Type |
|---|---|---|---|
| Anthropic | 1,024 tokens (Claude Haiku 3.5: 2,048) | 5 minutes or 1 hour (configurable) | Explicit (manual) and automatic |
| OpenAI | 1,024 tokens | ~5-10 minutes | Automatic only |
| Google (Gemini) | 2,048 tokens | Configurable (explicit) or automatic (implicit) | Both explicit and implicit |

## Provider Implementations

### When Did Anthropic Launch Prompt Caching?

Anthropic announced prompt caching for the [Claude](/wiki/claude) API on August 14, 2024, making it one of the first major providers to offer the feature. It launched in public beta with support for Claude 3.5 Sonnet and Claude 3 Haiku, later expanding to Claude 3 Opus [1]. Anthropic's implementation gives developers explicit control over caching behavior [2].

At launch, Anthropic described the benefit directly: "With prompt caching, customers can provide Claude with more background knowledge and example outputs, all while reducing costs by up to 90% and latency by up to 85% for long prompts" [1].

Anthropic supports two approaches:

- **Automatic caching.** Developers add a single `cache_control` field at the top level of the API request, and the system automatically manages cache breakpoints as conversations grow. This is recommended as the starting point for most use cases.
- **Explicit cache breakpoints.** Developers place `cache_control` markers directly on individual content blocks (such as the system prompt or specific messages) to define exactly what gets cached. A maximum of 4 cache breakpoints is allowed per request.

Cache writes incur a cost premium: writing to a 5-minute cache costs 1.25x the base input token price, while writing to a 1-hour cache costs 2x the base input token price. Cache reads (hits) cost only 0.1x (10%) of the base input token price [2]. Because of this pricing structure, a 5-minute cache pays for itself once it is read twice (1.25x write + 0.1x read = 1.35x, versus 2x to send the same content uncached twice); a 1-hour cache needs at least three reads to break even (2x write + 0.2x reads = 2.2x, versus 3x uncached) [2].

In February 2026, Anthropic introduced automatic prompt caching that further simplifies usage for agent workflows, addressing what was described as one of the biggest hidden costs in AI agent architectures [5].

#### Anthropic Prompt Caching Pricing

| Model | Base Input | 5-min Cache Write | 1-hour Cache Write | Cache Read (Hit) | Output |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5.00/MTok | $6.25/MTok | $10.00/MTok | $0.50/MTok | $25.00/MTok |
| Claude Sonnet 4.6 | $3.00/MTok | $3.75/MTok | $6.00/MTok | $0.30/MTok | $15.00/MTok |
| Claude Haiku 4.5 | $1.00/MTok | $1.25/MTok | $2.00/MTok | $0.10/MTok | $5.00/MTok |
| Claude Haiku 3.5 | $0.80/MTok | $1.00/MTok | $1.60/MTok | $0.08/MTok | $4.00/MTok |
| Claude Haiku 3 | $0.25/MTok | $0.30/MTok | $0.50/MTok | $0.03/MTok | $1.25/MTok |

MTok = million tokens. Pricing as of March 2026 [3].

### When Did OpenAI Add Prompt Caching?

OpenAI introduced automatic prompt caching on October 1, 2024, for GPT-4o, GPT-4o-mini, o1-preview, and o1-mini models. Unlike Anthropic's approach, OpenAI's caching is fully automatic: there is no explicit API parameter to enable or configure it. As OpenAI stated, "API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens," with "a 50% discount" applied to cached input tokens and "no action required" [2].

The cache operates at a granularity of 128-token increments after the initial 1,024-token minimum. The system caches the longest prefix of a prompt that matches a previously computed prefix. There is no additional cost for cache writes; cached input tokens are simply billed at a discounted rate [2].

OpenAI's cached token discount varies by model family:

| Model Family | Cached Token Discount | Input Price | Cached Input Price | Output Price |
|---|---|---|---|---|
| GPT-5 family | 90% off (pay 10%) | Varies by model | 10% of input price | Varies by model |
| GPT-4.1 family | 75% off (pay 25%) | Varies by model | 25% of input price | Varies by model |
| GPT-4o / o-series | 50% off (pay 50%) | $2.50/MTok (GPT-4o) | $1.25/MTok (GPT-4o) | $10.00/MTok (GPT-4o) |
| GPT-4o-mini | 50% off (pay 50%) | $0.15/MTok | $0.075/MTok | $0.60/MTok |

Pricing as of early 2026. Newer model families receive steeper discounts [2][6].

A key difference from Anthropic's implementation is that OpenAI charges no premium for cache writes. Cache misses are simply billed at the standard input rate, and cache hits receive the discount automatically. This makes OpenAI's approach zero-risk from a cost perspective: developers never pay more than they would without caching [2]. OpenAI also reports that prompt caching can reduce time-to-first-token latency by up to 80% for long prompts [2].

### How Does Google Context Caching Work?

Google was an early pioneer of context caching, introducing the feature for [Gemini](/wiki/gemini) models in May 2024 under the name "context caching." Google offers both explicit and implicit caching [7].

**Explicit caching.** Developers create a named cache object via the API, specifying the content to cache and a TTL. The cached content can be a system instruction, a document, or any prefix content. Cached tokens are billed at the standard input rate for the initial write, plus a per-hour storage cost. Subsequent requests that reference the cache pay a reduced per-token rate for the cached portion [7].

**Implicit caching.** Rolled out on May 8, 2025, for the Gemini API, implicit caching automatically passes cache cost savings to developers without requiring them to create explicit cache objects. It is enabled by default for Gemini 2.5 and newer models, applies the same token discount when a request shares a common prefix with a prior request, and incurs no storage costs [8].

| Gemini Model | Input Price | Cached Input Price (Read) | Cache Discount | Storage Cost (Explicit Only) |
|---|---|---|---|---|
| Gemini 2.5 Pro | $1.00/MTok | $0.10/MTok | 90% off | $4.50/MTok/hour |
| Gemini 2.5 Flash | $0.30/MTok | $0.03/MTok | 90% off | $1.00/MTok/hour |
| Gemini 2.0 Flash | $0.10/MTok | $0.025/MTok | 75% off | $1.00/MTok/hour |

Pricing as of early 2026 via the Gemini Developer API. Vertex AI pricing may differ [7][9].

The minimum cacheable content for Gemini is 2,048 tokens for implicit caching on Gemini 2.5 Pro (1,024 tokens for 2.5 Flash), and the feature supports caching up to the model's full context window (over 1 million tokens for Gemini 2.5 Pro) [7][8].

## How Do Prompt Caching Implementations Differ Across Providers?

The following table summarizes the key differences between prompt caching implementations across providers:

| Feature | Anthropic (Claude) | OpenAI (GPT) | Google (Gemini) |
|---|---|---|---|
| Launch date | August 2024 | October 2024 | May 2024 (explicit), May 2025 (implicit) |
| Cache type | Explicit + automatic | Automatic only | Explicit + implicit |
| Developer control | High (breakpoints, TTL choice) | None (fully automatic) | Medium (explicit caches with TTL) |
| Minimum prefix | 1,024 tokens | 1,024 tokens | 2,048 tokens |
| Cache write cost | 1.25x-2x base input | No premium (standard input rate) | Standard input rate + storage (explicit) |
| Cache read discount | 90% off base input | 50-90% off (varies by model) | 75-90% off (varies by model) |
| Storage cost | None (included in write premium) | None | $1-4.50/MTok/hour (explicit only) |
| Cache duration | 5 minutes or 1 hour | ~5-10 minutes | Configurable TTL (explicit), automatic (implicit) |
| Break-even point | 2 reads (5-min), 3 reads (1-hr) | Immediate (no write premium) | Depends on storage cost vs. read savings |

## What Is Prompt Caching Used For?

Prompt caching is most valuable in scenarios where a significant portion of the prompt remains constant across multiple API calls.

### System Prompts

Most LLM applications include a system prompt that defines the model's role, behavior constraints, and output format. This system prompt is identical across every user interaction. Without caching, the model reprocesses these tokens on every request. With caching, the system prompt is processed once and reused, reducing both cost and latency for all subsequent requests [2].

For applications with large system prompts (thousands of tokens containing detailed instructions, personas, or tool definitions), the savings are proportionally larger. A 5,000-token system prompt that is repeated across 100 requests per minute would see its effective input cost reduced by roughly 90% for those tokens.

### Few-Shot Examples

Applications that use [few-shot learning](/wiki/few-shot_learning) include multiple input-output examples in the prompt to guide the model's behavior. These examples are typically static and repeated across all requests. Prompt caching allows these examples to be processed once and reused, which is particularly valuable when using many examples or when examples include long text passages [2].

### Document Analysis

When using an LLM to analyze a document (answering questions, extracting information, summarizing), the document content is included in the prompt. If multiple questions are asked about the same document across different API calls, prompt caching avoids reprocessing the document each time. This is directly relevant to [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG) workflows where a retrieved context is queried multiple times [2].

### Multi-Turn Conversations

In conversational applications, each new message in a conversation typically includes the full conversation history as context. As the conversation grows, the prompt becomes longer with each turn. Prompt caching helps because each new turn's prompt shares a long common prefix (the conversation history up to the previous turn) with the prior request. The model only needs to process the new user message and any new system content [2].

### Agentic Workflows

AI agent systems that use tool calling and multi-step reasoning often make many sequential API calls with similar or overlapping prompts. The system prompt, tool definitions, and accumulated context tend to remain stable across calls within a single agent run. Prompt caching can significantly reduce the cost of these repetitive computations, which is why Anthropic specifically highlighted agent workflows as a key use case for automatic prompt caching [5].

## Implementation Strategies

### How Should Prompts Be Structured for Caching?

To maximize cache hit rates, developers should structure their prompts so that static content appears at the beginning and variable content at the end. The render order on the Anthropic API is tools, then system, then messages, so the most stable content (frozen system prompt, deterministic tool list) belongs first [2]. A recommended ordering is:

1. [System prompt](/wiki/system_prompt) (static)
2. Few-shot examples (static)
3. Shared context or documents (semi-static)
4. Conversation history (growing)
5. Current user message (variable)

This structure ensures that the longest possible prefix remains cacheable across requests. Placing variable content before static content (e.g., putting the user message before the system prompt, or interpolating a timestamp into the system prompt) would prevent caching of everything that follows [2].

### Cache Warm-Up

For Anthropic's explicit caching, developers can "warm up" the cache by making an initial request that includes the content to be cached with appropriate cache_control markers. Subsequent requests will hit the warm cache. For applications with predictable traffic patterns, cache warm-up can be triggered shortly before expected peak usage [2].

### Monitoring Cache Performance

All three major providers return cache hit information in their API responses. Anthropic includes `cache_creation_input_tokens` and `cache_read_input_tokens` fields. OpenAI includes `cached_tokens` in the usage object. Google returns information about cached token usage as well. Monitoring these fields allows developers to measure cache hit rates and optimize their prompt structures accordingly. If `cache_read_input_tokens` stays at zero across repeated identical-prefix requests, a silent invalidator (a timestamp, a random ID, or non-deterministic JSON serialization in the prefix) is usually at work [2][5][7].

## Prefix Caching in Open-Source Inference Engines

Prompt caching is not limited to proprietary APIs. Open-source LLM inference engines have implemented similar mechanisms, often called "prefix caching" or "automatic prefix caching" (APC).

### vLLM

[vLLM](/wiki/vllm), the widely used high-throughput LLM serving engine, implements Automatic Prefix Caching (APC) that caches KV representations at the block level. vLLM's PagedAttention system manages the KV cache in fixed-size blocks (typically 16 tokens), and APC stores these blocks indexed by their token content. When a new request arrives, vLLM checks if any prefix blocks match previously computed blocks and reuses them [10].

vLLM's APC requires exact token-level matches and works at block boundaries. The approach is deterministic: given the same token sequence, the same cache blocks will be matched. This works well for structured prompts with consistent prefixes but requires manual optimization for variable prompt structures [10].

### SGLang

[SGLang](/wiki/sglang) takes a different approach to prefix caching using a radix tree (also called a radix attention tree). The radix tree structure allows SGLang to automatically discover and exploit shared prefixes at the token level, without requiring block-aligned matches. This makes SGLang's caching more flexible, particularly for multi-turn conversations where the shared prefix length varies unpredictably [11].

SGLang's radix tree approach has been shown to be particularly effective for multi-turn conversation workloads, where the conversation history grows incrementally and the shared prefix length changes with each turn [11].

| Feature | vLLM APC | SGLang Radix Attention |
|---|---|---|
| Matching granularity | Block-level (16 tokens) | Token-level |
| Data structure | Hash table on block contents | Radix tree |
| Best for | Batch inference, templated prompts | Multi-turn conversations, variable prefixes |
| Automatic discovery | Yes (within block boundaries) | Yes (any token boundary) |
| Memory management | PagedAttention blocks | Radix tree with LRU eviction |

### LMCache

LMCache is an open-source KV caching layer that works with both vLLM and SGLang. It extracts and stores KV caches generated by the inference engine and moves them out of GPU memory to CPU memory, disk, or even a distributed cache. This allows KV caches to be shared across multiple GPU workers or even across different machines, enabling prefix caching at the cluster level rather than just the single-GPU level [12].

## How Much Does Prompt Caching Save?

The performance benefits of prompt caching fall into two categories: cost reduction and latency reduction.

### Cost Reduction

The cost savings are straightforward to calculate based on the provider's pricing. For a workload where 80% of input tokens are cacheable:

| Scenario | Standard Cost | With Caching (90% discount) | Savings |
|---|---|---|---|
| 10,000 tokens cached, 2,500 tokens new (Anthropic Sonnet 4.6) | $37.50/MTok (all at $3/MTok input) | $3.00 + $0.30*10 = $6.00/MTok equivalent | ~84% on input costs |
| System prompt of 5,000 tokens, 100 requests (OpenAI GPT-4o) | $1.25 total (500K tokens at $2.50/MTok) | $0.625 total (50% cache discount) | 50% on cached portion |
| Document analysis, 50K tokens, 10 queries (Google Gemini 2.5 Pro) | $0.50 total (500K tokens at $1/MTok) | $0.095 total (first query full, 9 at 90% off) | ~81% on input costs |

### Latency Reduction

Cached tokens do not need to go through the full forward pass of the model, which reduces the time-to-first-token (TTFT). Anthropic reports up to 85% latency reduction for long cached prompts, and OpenAI reports up to 80% TTFT reduction [1][2]. The actual latency improvement depends on the ratio of cached to uncached tokens and the model size. For prompts where the vast majority of tokens are cached, the TTFT approaches that of a very short prompt [2].

## Relationship to Other Optimization Techniques

Prompt caching is one of several techniques for reducing the cost and latency of LLM inference. It is complementary to most other approaches:

- **[Quantization](/wiki/quantization).** Reduces model size and memory usage. Orthogonal to prompt caching; both can be applied simultaneously.
- **[Speculative decoding](/wiki/speculative_decoding).** Accelerates output token generation using a smaller draft model. Complements prompt caching, which primarily accelerates input processing.
- **Batch processing.** Groups multiple requests for efficiency. Anthropic and OpenAI both offer batch API discounts that stack with prompt caching discounts.
- **[Knowledge distillation](/wiki/knowledge_distillation).** Trains a smaller model to mimic a larger one. An alternative to prompt caching for reducing costs, though with potential quality trade-offs.
- **Model routing.** Directs simpler queries to smaller, cheaper models. Complementary to prompt caching, as both reduce costs through different mechanisms.

## Current State (2025-2026)

As of early 2026, prompt caching has become a standard feature across all major LLM API providers and is widely adopted in production applications. Several trends characterize the current landscape:

**Automatic caching as the default.** Both OpenAI and Google have moved toward fully automatic caching that requires no developer configuration. Anthropic's introduction of automatic prompt caching in early 2026 follows this trend, while still maintaining explicit control options for advanced users [5].

**Deeper discounts for newer models.** Pricing trends show that newer model families receive steeper caching discounts. OpenAI's [GPT-5](/wiki/gpt-5) family offers 90% off cached tokens (up from 50% for GPT-4o), and Google's Gemini 2.5 models offer 90% off (up from 75% for Gemini 2.0). This suggests that providers view caching as an important competitive differentiator and a way to encourage higher-volume usage [6][9].

**Longer cache durations.** Anthropic's introduction of the 1-hour cache option (at a higher write cost) responds to demand from applications where requests are spaced further apart. The trend is toward more flexible cache management that accommodates different usage patterns.

**Integration with agent frameworks.** As [AI agent](/wiki/ai_agent) architectures become more common, prompt caching has become essential for managing costs. [Agent](/wiki/agent) workflows that involve dozens of sequential LLM calls with overlapping context are among the biggest beneficiaries of caching. Framework developers have begun building prompt caching awareness directly into agent orchestration libraries [5].

**Open-source convergence.** vLLM, SGLang, and other open-source inference engines continue to improve their prefix caching implementations. LMCache and similar projects are extending caching across multi-GPU and multi-node deployments, bringing the benefits of prompt caching to self-hosted model serving [10][11][12].

The economics of prompt caching are clear: for any application that repeatedly sends similar prompts to an LLM API, caching offers substantial cost savings with minimal implementation effort. As LLM applications mature and prompt engineering practices become more sophisticated (with larger system prompts, more few-shot examples, and longer context windows), the value of prompt caching will only increase.

## References

1. Anthropic. (2024). "Prompt caching with Claude." https://www.anthropic.com/news/prompt-caching
2. OpenAI. (2024). "Prompt Caching in the API." https://openai.com/index/api-prompt-caching/
3. Anthropic. (2024). "Prompt Caching." Claude API Documentation. https://platform.claude.com/docs/en/build-with-claude/prompt-caching
4. Pope, R., Douglas, S., Chowdhery, A., et al. (2022). "Efficiently Scaling Transformer [Inference](/wiki/inference)." MLSys 2023. https://arxiv.org/abs/2211.05102
5. Njenga, J. (2026). "Anthropic Just Fixed the Biggest Hidden Cost in AI Agents (Automatic Prompt Caching)." Medium. https://medium.com/ai-software-engineer/anthropic-just-fixed-the-biggest-hidden-cost-in-ai-agents-using-automatic-prompt-caching-9d47c95903c5
6. OpenAI. (2026). "Pricing." https://openai.com/api/pricing/
7. Google. (2025). "Context Caching." Gemini API Documentation. https://ai.google.dev/gemini-api/docs/caching
8. Google Developers Blog. (2025). "Gemini 2.5 Models Now Support Implicit Caching." https://developers.googleblog.com/en/gemini-2-5-models-now-support-implicit-caching/
9. Google. (2025). "Gemini Developer API Pricing." https://ai.google.dev/gemini-api/docs/pricing
10. vLLM Documentation. "Automatic Prefix Caching." https://docs.vllm.ai/en/stable/design/prefix_caching/
11. Moon, D. (2025). "Prefix Caching: SGLang vs vLLM." Medium. https://medium.com/byte-sized-ai/prefix-caching-sglang-vs-vllm-token-level-radix-tree-vs-block-level-hashing-b99ece9977a1
12. Hu, Y., et al. (2025). "LMCache: An Efficient [KV Cache](/wiki/kv_cache) Layer for Enterprise-Scale LLM Inference." https://arxiv.org/abs/2510.09665

