Context caching
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,471 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,471 words
Add missing citations, update stale details, or suggest a clearer explanation.
Context caching is a family of large-language-model API features that store parts of a request's input (system prompts, instructions, attached documents, or earlier conversation turns) on the provider's infrastructure so that subsequent requests reusing the same prefix are billed at a reduced rate and respond with lower latency. Google introduced context caching for the Gemini API on 27 June 2024, originally for Gemini 1.5 Pro and 1.5 Flash, and later extended it to the Gemini 2.x and 3.x families with both explicit and implicit modes.[1][2] Anthropic offers a closely related feature called prompt caching using a cache_control parameter with an "ephemeral" type, and OpenAI added automatic prompt caching for GPT-4o and the o-series in October 2024.[3][4] All three providers cache an exact byte-for-byte prefix of the input and bill cached tokens at roughly 10 to 50 percent of the standard input rate, but they differ in TTL behaviour, minimum token thresholds, and whether caching requires explicit opt-in. Context caching is distinct from the model-internal KV cache (the per-layer key and value tensors held in GPU memory during a single forward pass), although both ideas share the goal of avoiding recomputation over a fixed prefix; see kv cache for that lower-level mechanism.
The transformer architecture used by modern large language models computes a self-attention pattern over the entire input on every forward pass, which makes inference cost grow roughly linearly with prompt length for the prefill phase and quadratically with sequence length when memory bandwidth is included.[5] As production workloads shifted toward long shared system prompts, document-grounded question answering, multi-turn agentic tools, and retrieval-augmented generation, a large fraction of input tokens started to repeat verbatim across requests. A customer-support bot, for example, might send the same 8,000-token policy document with every user query; a coding assistant might attach the same project files to every turn of a long session.[6]
The internal optimisation that allows a single decode step to reuse already-computed key and value tensors is the kv cache, which is held in GPU memory and discarded when the request completes.[5] Context caching is the higher-level, customer-visible counterpart: the provider persists the computed prefix state across requests, charges a one-time write fee, and then serves the cached prefix at a deep discount on cache hits.[1][4] Google's product documentation explicitly frames the feature as a way to "reduce costs for tasks that use the same tokens across multiple prompts."[2]
The first commercial implementation arrived with Google's June 2024 launch on the Gemini API.[2] Anthropic followed with prompt caching in public beta on 14 August 2024, and OpenAI added automatic prompt caching to its API on 1 October 2024.[3][4] By 2025 all three major frontier-model vendors offered some form of context caching, with implementations converging on the same underlying idea but diverging in interface and pricing.
Every commercial context-caching implementation uses an exact-prefix cache key: the provider hashes the leading bytes of the serialised request (including system prompts, tool definitions, attached files, and prior messages) and looks for a previously computed activation state keyed on that hash.[4][7] Any byte change to the prefix, including reordering tools or rewording the system message, invalidates the cache. OpenAI's documentation states that "the system checks if the initial portion (prefix) of your prompt is stored in the cache" and that the cache "automatically routes" requests with matching prefixes to a server that holds the precomputed state.[7]
Because matching is purely lexical, developers must design prompts so that variable content (user questions, retrieved chunks that change per request) appears at the end, and stable content (instructions, examples, large attached documents) appears at the beginning. Google explicitly recommends this layout: "you should keep the content at the beginning of the request the same and add things like a user's question or extra context that may change with each request at the end."[8]
On a cache miss, the model performs a normal forward pass and the provider stores the resulting prefix activations for later reuse. The customer pays a write fee, which is either equal to the standard input rate (Google explicit, OpenAI) or a multiple of it (Anthropic, where 5-minute writes cost 1.25 times base input and 1-hour writes cost 2 times base input).[1][4][9] On a subsequent hit, the prefix tokens are billed at a fraction of the normal rate (10 percent at Google and Anthropic, 50 percent at OpenAI) and the model only needs to extend the cached state with the new tail of the request.[1][9][4]
Anthropic's documentation describes a "refresh" behaviour: "The cache is refreshed for no additional cost each time the cached content is used."[9] That is, each hit resets the TTL countdown, so a continuously active cache can survive far longer than its nominal lifetime. OpenAI's caches behave similarly in practice, although the provider does not publish an explicit refresh contract.[7]
Time to live (TTL) is the maximum interval the provider promises to keep a cache entry before evicting it. The three major implementations diverge here:
| Provider | Default TTL | Configurable | Maximum |
|---|---|---|---|
| Google Gemini (explicit) | 1 hour | Yes, set by the developer | Up to model context limit, billed by storage time[10] |
| Google Gemini (implicit, 2.5+) | Up to 24 hours, automatic | No | 24 hours[1] |
| OpenAI | 5 to 10 minutes of inactivity | No (until GPT-5.1) | 1 hour in-memory; 24 hours with prompt_cache_retention="24h" on GPT-5.1, GPT-5.1-codex, GPT-5, and GPT-4.1[7][11] |
| Anthropic ephemeral 5m | 5 minutes, refreshed on each hit | Yes, by selecting TTL | 5 minutes between hits[9] |
| Anthropic ephemeral 1h | 1 hour | Selected via "ttl": "1h" | 1 hour between hits[9][12] |
Google's explicit caches charge an hourly storage fee per million tokens stored: $1.00 per million tokens per hour for Gemini 2.5 Flash and 3.5 Flash, and $4.50 per million tokens per hour for Gemini 3.1 Pro Preview, prorated to the minute.[10] OpenAI and Anthropic do not charge a separate storage fee; their cost recovery is bundled into the write multiplier.[7][9]
All three implementations refuse to cache inputs below a minimum size, on the theory that the bookkeeping overhead of cache lookup exceeds the savings for short prompts. As of May 2026 the thresholds are:
Requests below the threshold are processed without caching and the provider returns zero cached tokens in the usage block, with no error raised.[9][7]
Implementations differ in how they expose the cache to developers:
caches.create() with a list of contents and a TTL, receives a name identifier, and then passes that name as cached_content on subsequent generate_content calls.[15] The cache object can be updated (TTL extension) or deleted explicitly.usage.prompt_tokens_details.cached_tokens field reports how many tokens hit the cache.[4][7]"cache_control": {"type": "ephemeral"} to any content block (system, tool, or message); the cache breakpoint covers all content up to and including that block. Up to four cache breakpoints may be set per request, allowing nested reuse (for example, a long system prompt cached with a one-hour TTL plus a growing conversation cached with a five-minute TTL).[9]Context caching is sometimes confused with the kv cache, which is a lower-level optimisation internal to a single inference request. The KV cache stores key and value tensors produced by each attention head at each transformer layer for each previously processed token, so that the autoregressive decode phase does not recompute attention over already-seen tokens.[5] It is allocated in GPU memory at the start of a request, grows linearly with sequence length, and is freed when the request returns.[5]
Context caching, by contrast, is a customer-facing, persistent, cross-request abstraction. It is implemented on top of (and shares engineering with) the KV cache: when a provider says it serves a cache hit, it is typically materialising or otherwise rapidly reconstructing the same KV tensors that the model would have computed during prefill, but doing so from a saved snapshot rather than recomputing them from input tokens.[16] The two notions answer different questions:
| Property | KV cache | Context caching |
|---|---|---|
| Visible to the developer | No (internal) | Yes (billing, usage fields) |
| Persists across requests | No | Yes |
| Cache key | Position in the current request | Hash of the input prefix |
| Lifetime | Single inference | Minutes to 24 hours |
| Cost reduction | Latency only | Latency and dollar cost |
| Storage location | GPU VRAM | Provider infrastructure[16] |
Recent research systems such as RAGCache and SGLang explore exposing portions of the KV cache directly to applications by serialising activation states for shared document chunks, blurring the boundary between the two concepts; in those systems, context caching is essentially a serialised KV cache shared across requests.[16]
Google's June 2024 launch post (authored by Logan Kilpatrick, Shrestha Basu Mallick, and Ronen Kofman) announced context caching alongside the opening of the 2-million-token context window for gemini 2 5 pro's predecessor gemini 1.5 Pro and the addition of code execution.[2] Logan Kilpatrick posted on X the same day that the price was "2x cheaper than we previously announced."[17] In May 2024 Google had previewed context caching for Vertex AI at Google Cloud Next; the Gemini API rollout followed in June.[14]
Implicit caching was added to the gemini 2 5 pro and 2.5 Flash family on 8 May 2025, with a 75 percent discount on cached tokens and no developer action required.[1] As of May 2026 implicit caching is on by default for gemini 3 pro and the 3.5 Flash line.[1] Vertex AI's documentation notes that "customers pay only 10 percent of standard input token cost for cached tokens" and that explicit caching pays 90 percent less on cached read tokens.[14] Storage is billed separately on Vertex AI but not on the consumer-facing ai.google.dev pricing page for older models.[10]
OpenAI announced Prompt Caching on 1 October 2024.[4] The launch post stated that prompts longer than 1,024 tokens would automatically receive a 50 percent discount on cached input tokens with "no action required" from the developer.[4] Initial supported models were gpt 4o, GPT-4o mini, o1-preview, and o1-mini.[4] OpenAI's pricing page later expanded the discount: the prompt-caching guide states that the feature "can reduce latency by up to 80 percent and input token costs by up to 90 percent" for newer GPT-5 family models.[7]
The 24-hour extended retention option was added with the gpt-5 family launch. OpenAI's documentation says: "extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours" on GPT-5.1, GPT-5.1-codex, GPT-5, and GPT-4.1, enabled by passing prompt_cache_retention="24h" on the Responses or Chat Completions API.[11] For GPT-5.5 and later, the default retention is 24 hours and the older in-memory option is no longer supported.[11]
Anthropic launched prompt caching in public beta on 14 August 2024 for Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku.[3] The launch post quoted "up to 90 percent" cost reduction and "up to 85 percent" latency reduction for long prompts, with 5-minute cache writes costing 1.25 times the base input rate and cache hits costing 0.1 times.[3] General availability followed on 17 December 2024.[9]
The 1-hour extended TTL was launched in beta in May 2025 and reached general availability in August 2025, available across the anthropic api, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.[12] On the company's official Anthropic account on X, the launch announcement stated: "In addition to the standard 5-minute prompt caching TTL, we now offer an extended 1-hour TTL. This reduces costs by up to 90 percent and reduces latency by up to 85 percent for long prompts, making extended agent workflows more practical."[12] Pricing for 1-hour writes is 2 times the base input rate; reads remain at 0.1 times.[12][9] claude models from the claude 3 5 sonnet generation onward all support prompt caching, with the minimum token threshold scaling by model tier.[9]
Major model-hosting platforms expose context caching through provider-compatible APIs. Amazon Bedrock supports prompt caching for claude and Amazon Nova models with an analogous cachePoint content block, charging the same proportional rates as the native APIs.[18] Microsoft Foundry (formerly Azure OpenAI) added prompt caching for GPT-4o and o-series models in late 2024 with the same 50 percent discount and 1,024-token minimum.[19] Google Vertex AI exposes the same context-caching API as the public Gemini API.[14] LangChain and the OpenAI Agents SDK both ship middleware that automates cache-breakpoint placement for Anthropic models.[20]
Pricing varies by provider and model. The table below lists rates as published on each vendor's pricing page in May 2026 for one representative model from each family.
| Model | Standard input ($/1M tokens) | Cached input ($/1M tokens) | Cache write surcharge | Storage |
|---|---|---|---|---|
| Gemini 3.5 Flash | $1.50 base | $0.15 | None (one-time write at base) | $1.00 / 1M tokens / hour[10] |
| Gemini 3.1 Pro Preview (<=200k) | $2.00 base | $0.20 | None | $4.50 / 1M tokens / hour[10] |
| Gemini 2.5 Flash | $0.30 base | $0.03 | None | $1.00 / 1M tokens / hour[10] |
| GPT-4o (current) | varies | 50 percent of input | None | None[7] |
| GPT-5.1 (24h retention) | varies | up to 90 percent off | None | None[7][11] |
| Claude Opus 4.7, 5m write | $15.00 (illustrative) | $1.50 (0.1x) | $18.75 (1.25x) | None[9] |
| Claude Opus 4.7, 1h write | $15.00 | $1.50 (0.1x) | $30.00 (2x) | None[9] |
Two pricing models are visible in the table. Google charges close to the standard input rate when a cache is written but recoups operating cost through hourly storage fees. Anthropic charges no storage but folds the cost into a higher write multiplier; the difference between the 5-minute and 1-hour writes is the key tradeoff for developers. OpenAI charges nothing for the write and discounts the read, relying on opportunistic prefix matching against recent traffic.[7][9][10]
For a workload that submits a 100,000-token shared prompt and then issues one short user query per second over 30 minutes, the dollar arithmetic favours all three caches by roughly an order of magnitude relative to uncached pricing. A worked example using Anthropic numbers from the Claude pricing page (illustrative figures for Opus 4.7) shows a 1-hour cache write costing 2 times base input on the first call, then 0.1 times base input on each of the following 1,799 calls, for a total cost equivalent to about 181.9 uncached calls, versus 1,800 uncached calls if no caching were used.[9]
In a chat session that grows turn by turn, the natural cache strategy is to place a breakpoint after the most recent assistant message. On Anthropic, this is the "automatic caching" mode: the SDK or middleware moves the breakpoint forward each turn so that the entire history up to (but not including) the new user message hits the cache.[9] On OpenAI, the same effect is automatic: each new turn shares its prefix with the prior turn and triggers the discount without code changes.[7] On Google explicit caching, developers must call caches.update() or recreate the cache to extend its content; implicit caching on Gemini 2.5+ handles this transparently.[15][1]
For retrieval augmented generation systems that attach the same corpus or a small set of documents to many queries, the largest savings come from caching the document payload. The recommended pattern is: place the documents at the top of the prompt, immediately after the system message; place the retrieved chunks (which vary per query) below them; place the user question at the very end. The cache breakpoint sits at the end of the static portion. PromptHub reported case studies of customer-support deployments dropping per-query token cost by 80 to 90 percent under this pattern.[21] A 2026 study evaluating prompt caching across long-horizon agentic tasks ran 500 agent sessions with 10,000-token system prompts and measured both API cost and time to first token, finding that consistent cache hits reduced both metrics by roughly an order of magnitude relative to uncached baselines.[22]
Agent loops issue many model calls within a single task, often with a stable system prompt, tool definitions, and previously seen environment observations. Anthropic's 1-hour TTL was launched with this case in mind; the announcement specifically cited "extended agent workflows."[12] OpenAI's 24-hour retention on GPT-5.1 plays the same role: a long-running coding agent that calls the model dozens of times per hour can keep a multi-turn project context warm without paying full input cost on every call.[11]
Reported cost reductions cluster around 50 to 90 percent on cached input tokens, depending on provider and TTL choice. OpenAI's documentation reports up to 80 percent latency reduction on cached prefixes,[7] Anthropic reports up to 85 percent latency reduction for the 1-hour cache,[12] and Google reports a 75 percent discount on cached tokens for the 2.5 implicit cache.[1]
Independent measurements are consistent with vendor claims for typical workloads. The PromptHub comparison study found that Anthropic's cache reads ran at 10 percent of normal token cost, OpenAI's at 50 percent, and Google's at 25 percent at the time of writing in late 2024, in line with the public pricing pages.[21] An evaluation by Don't Break the Cache (Berriaud et al., 2026) measured 500 agent sessions across multiple providers and found that median time to first token (TTFT) dropped from roughly 4.5 seconds to under 0.5 seconds when prompts repeatedly hit the cache.[22]
For RAG workloads specifically, the academic RAGCache system (Liu et al., 2024) demonstrated that caching the key and value tensors corresponding to a small set of high-frequency document chunks gave up to 4 times reduction in TTFT and 2.1 times higher throughput compared to a vanilla vLLM and FAISS baseline.[23] Although RAGCache exposes a different interface from commercial context caching, the underlying activation reuse is similar; commercial implementations of context caching adopt many of the same data-management ideas.
Because the cache key is the byte-exact prefix, even small differences (an updated timestamp in the system prompt, reordered tool definitions, whitespace changes) invalidate the cache.[7][9] Production systems must therefore freeze the leading portion of the prompt; introducing per-request variability above the cache breakpoint defeats caching.
Caches are isolated per organisation or per project. Anthropic states explicitly that "caches are isolated between organizations to ensure no sharing, even with identical prompts."[9] OpenAI uses prefix routing at the organisation level, which avoids cross-tenant data exposure but also means a customer cannot share a warm cache with another customer's workload.[7]
All three providers describe caching as best-effort. OpenAI's documentation warns that "in practice, caches are typically cleared after 5 to 10 minutes of inactivity and are always removed within one hour."[7] Google's implicit cache "caches are always deleted within 24 hours" and there is no guarantee of a hit even within that window.[14] Anthropic's caches are also best-effort; the docs warn that "Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned."[9]
Google's explicit caches charge an hourly storage fee that, for large caches kept for long periods, can exceed the savings from cache hits. The break-even analysis for Vertex AI shows that a 1-million-token cache held for an hour on Gemini 3.1 Pro Preview costs $4.50 in storage alone; this is recouped only after several thousand cached reads at $0.20 per million tokens.[10] Storage cost is the main reason Google's explicit caching has not displaced implicit caching as the default workflow for the 2.5 and 3.x lines.[1]
The OpenAI Cookbook notes that "overflow traffic" can reduce cache effectiveness: when request rates exceed approximately 15 per minute for an identical prefix-key combination, requests can be routed to servers without the warm cache, causing miss rates to rise.[7][4] High-traffic enterprise applications often pin requests to a single deployment or use sticky routing headers to mitigate this.
The proliferation of write multipliers, read discounts, storage fees, TTL choices, and minimum thresholds makes accurate cost modelling difficult. A 2025 ProjectDiscovery case study reported a 59 percent cost reduction after introducing Anthropic prompt caching but only after several iterations of breakpoint placement and TTL tuning.[24] A spring 2025 XDA Developers article reported user complaints about claude Code's effective cache lifetime shortening below expectations under heavy traffic.[25]
The terms "context caching" and "prompt caching" are largely interchangeable in industry usage. Google uses "context caching," Anthropic uses "prompt caching," and OpenAI uses "Prompt Caching" (capitalised in its launch post).[2][3][4] All three describe the same underlying mechanism: caching a serialised input prefix on the provider side so that subsequent requests reusing that prefix pay less. The aiwiki.ai article on prompt caching focuses on Anthropic's variant and its cache_control interface; this article uses the generic phrase "context caching" to cover the whole family.
Where the implementations differ:
cache_control marker.[4][1][9]prompt_cache_retention.[15][9][11]Two trends are visible in 2025 and 2026 announcements. First, all three vendors are moving toward longer TTLs by default: Google's implicit cache extends to 24 hours automatically, Anthropic added an explicit 1-hour tier, and OpenAI's GPT-5.5 line makes 24-hour retention the default.[1][12][11] Longer TTLs match the needs of agentic and long-context workloads where a single coding or research task can span hours.
Second, providers are exposing finer-grained cache control. Anthropic's four cache breakpoints permit composition of static and dynamic segments; future versions may allow arbitrary breakpoint trees.[9] Vertex AI's explicit caches can be updated with caches.update() to extend TTL without rewriting content.[15] On the research side, projects such as RAGCache and SGLang's RadixAttention explore caching arbitrary subtrees of activation state, not just prefixes, which could allow document-level caching of any quoted passage regardless of where it appears in a prompt.[23][16]