Context caching

25 min read

Updated Jul 23, 2026

Context caching is a large-language-model API feature that stores parts of a request's input (system prompts, instructions, attached documents, or earlier conversation turns) on the provider's infrastructure so that later requests reusing the same prefix are billed at a reduced rate, often as little as 10 percent of the standard input token price, and respond with lower latency. Google introduced context caching for the Gemini API on 27 June 2024, first for Gemini 1.5 Pro and 1.5 Flash, and later extended it to the Gemini 2.x and 3.x families with both explicit and implicit modes.^[1]^[2] Anthropic offers a closely related feature it calls prompt caching, set with a cache_control parameter of type "ephemeral", and OpenAI added automatic prompt caching for GPT-4o and the o-series on 1 October 2024.^[3]^[4] All three providers cache an exact byte-for-byte prefix of the input and bill cache hits at roughly 10 to 50 percent of the standard input rate, but they differ in TTL behaviour, minimum token thresholds, and whether caching requires explicit opt-in. Context caching is distinct from the model-internal KV cache (the per-layer key and value tensors held in GPU memory during a single forward pass), although both share the goal of avoiding recomputation over a fixed prefix; see kv cache for that lower-level mechanism.

Why do LLM providers offer context caching?

The transformer architecture used by modern large language models computes a self-attention pattern over the entire input on every forward pass, which makes inference cost grow roughly linearly with prompt length for the prefill phase and quadratically with sequence length when memory bandwidth is included.^[5] As production workloads shifted toward long shared system prompts, document-grounded question answering, multi-turn agentic tools, and retrieval-augmented generation, a large fraction of input tokens started to repeat verbatim across requests. A customer-support bot, for example, might send the same 8,000-token policy document with every user query; a coding assistant might attach the same project files to every turn of a long session.^[6]

The internal optimisation that allows a single decode step to reuse already-computed key and value tensors is the kv cache, which is held in GPU memory and discarded when the request completes.^[5] Context caching is the higher-level, customer-visible counterpart: the provider persists the computed prefix state across requests, charges a one-time write fee, and then serves the cached prefix at a deep discount on cache hits.^[1]^[4] Google's product documentation frames the feature as a way to "reduce costs for tasks that use the same tokens across multiple prompts."^[2]

The first commercial implementation arrived with Google's June 2024 launch on the Gemini API.^[2] Anthropic followed with prompt caching in public beta on 14 August 2024, and OpenAI added automatic prompt caching to its API on 1 October 2024.^[3]^[4] By 2025 all three major frontier-model vendors offered some form of context caching, with implementations converging on the same underlying idea but diverging in interface and pricing.

How does context caching work?

Cache key and prefix matching

Every commercial context-caching implementation uses an exact-prefix cache key: the provider hashes the leading bytes of the serialised request (including system prompts, tool definitions, attached files, and prior messages) and looks for a previously computed activation state keyed on that hash.^[4]^[7] Any byte change to the prefix, including reordering tools or rewording the system message, invalidates the cache. OpenAI's documentation states that "the system checks if the initial portion (prefix) of your prompt exists in the cache on the selected machine" and that a matching request reuses the precomputed state.^[7]

Because matching is purely lexical, developers must design prompts so that variable content (user questions, retrieved chunks that change per request) appears at the end, and stable content (instructions, examples, large attached documents) appears at the beginning. Google explicitly recommends this layout, advising developers to "try putting large and common contents at the beginning of your prompt" and to place content that changes with each request at the end.^[8]

Cache writes, hits, and refreshes

On a cache miss, the model performs a normal forward pass and the provider stores the resulting prefix activations for later reuse. The customer pays a write fee, which is either equal to the standard input rate (Google explicit, OpenAI) or a multiple of it (Anthropic, where 5-minute writes cost 1.25 times base input and 1-hour writes cost 2 times base input).^[1]^[4]^[9] On a subsequent hit, the prefix tokens are billed at a fraction of the normal rate (10 percent at Google and Anthropic, 50 percent at OpenAI on legacy models) and the model only needs to extend the cached state with the new tail of the request.^[1]^[9]^[4]

Anthropic's documentation describes a "refresh" behaviour: "The cache is refreshed for no additional cost each time the cached content is used."^[9] That is, each hit resets the TTL countdown, so a continuously active cache can survive far longer than its nominal lifetime. OpenAI's caches behave similarly in practice, although the provider does not publish an explicit refresh contract.^[7]

Time to live

Time to live (TTL) is the maximum interval the provider promises to keep a cache entry before evicting it. The three major implementations diverge here:

Provider	Default TTL	Configurable	Maximum
Google Gemini (explicit)	1 hour	Yes, set by the developer	Up to model context limit, billed by storage time^[10]
Google Gemini (implicit, 2.5+)	Up to 24 hours, automatic	No	24 hours^[1]
OpenAI	5 to 10 minutes of inactivity	No (until GPT-5.1)	1 hour in-memory; 24 hours with `prompt_cache_retention="24h"` on GPT-5.1, GPT-5.1-codex, GPT-5, and GPT-4.1^[7]^[11]
Anthropic ephemeral 5m	5 minutes, refreshed on each hit	Yes, by selecting TTL	5 minutes between hits^[9]
Anthropic ephemeral 1h	1 hour	Selected via `"ttl": "1h"`	1 hour between hits^[9]^[12]

Google's explicit caches charge an hourly storage fee per million tokens stored: $1.00 per million tokens per hour for Gemini 2.5 Flash and 3.5 Flash, and $4.50 per million tokens per hour for Gemini 3.1 Pro Preview, prorated to the minute.^[10] OpenAI and Anthropic do not charge a separate storage fee; their cost recovery is bundled into the write multiplier.^[7]^[9]

Minimum token threshold

All three implementations refuse to cache inputs below a minimum size, on the theory that the bookkeeping overhead of cache lookup exceeds the savings for short prompts. As of July 2026 the thresholds are:

Google Gemini: minimums vary by model and API surface. Implicit caching on Gemini 2.5 launched with a 1,024-token minimum for 2.5 Flash and 2,048 tokens for 2.5 Pro.^[1] The current Gemini API caching documentation lists a 2,048-token minimum for Gemini 2.5 Flash and 2.5 Pro and a 4,096-token minimum for Gemini 3.1 Pro Preview and 3.5 Flash.^[8]^[13] Vertex AI sets a minimum cache size of 2,048 tokens.^[14]
OpenAI: 1,024 tokens, with the cached prefix growing in 128-token increments beyond that floor.^[4]^[7]
Anthropic: 512 tokens for Claude Fable 5 and Mythos 5; 1,024 tokens for Opus 4.8, Sonnet 5, Sonnet 4.6, Sonnet 4.5, Opus 4.1, and Opus 4; 2,048 tokens for Opus 4.7 and Haiku 3.5; 4,096 tokens for Opus 4.6, Opus 4.5, and Haiku 4.5. On Amazon Bedrock the Fable 5 and Mythos 5 minimum is 1,024 tokens instead of 512.^[9]

Requests below the threshold are processed without caching and the provider returns zero cached tokens in the usage block, with no error raised. Anthropic states plainly: "Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned."^[9]^[7]

Cache writes versus reads in the API

Implementations differ in how they expose the cache to developers:

Google explicit caching treats the cache as a first-class object. The developer calls caches.create() with a list of contents and a TTL, receives a name identifier, and then passes that name as cached_content on subsequent generate_content calls.^[15] The cache object can be updated (TTL extension) or deleted explicitly.
Google implicit caching is automatic on Gemini 2.5 and newer: any request that shares a long prefix with a recent prior request from the same project receives the cache discount without further action.^[1] Implicit caching was launched on 8 May 2025.^[1]
OpenAI prompt caching is fully automatic. The developer makes no API change; eligible prompts receive the discount based on prefix matching against recent traffic, and the response's usage.prompt_tokens_details.cached_tokens field reports how many tokens hit the cache.^[4]^[7]
Anthropic prompt caching is explicit but lightweight. The developer adds "cache_control": {"type": "ephemeral"} to any content block (system, tool, or message); the cache breakpoint covers all content up to and including that block. Up to four cache breakpoints may be set per request, allowing nested reuse (for example, a long system prompt cached with a one-hour TTL plus a growing conversation cached with a five-minute TTL).^[9]

How is context caching different from the KV cache?

Context caching is sometimes confused with the kv cache, which is a lower-level optimisation internal to a single inference request. The KV cache stores key and value tensors produced by each attention head at each transformer layer for each previously processed token, so that the autoregressive decode phase does not recompute attention over already-seen tokens.^[5] It is allocated in GPU memory at the start of a request, grows linearly with sequence length, and is freed when the request returns.^[5]

Context caching, by contrast, is a customer-facing, persistent, cross-request abstraction. It is implemented on top of (and shares engineering with) the KV cache: when a provider says it serves a cache hit, it is typically materialising or otherwise rapidly reconstructing the same KV tensors that the model would have computed during prefill, but doing so from a saved snapshot rather than recomputing them from input tokens.^[16] OpenAI's own documentation confirms this equivalence for its extended cache, noting that "only the key/value tensors may be persisted in local storage; the original customer content, such as prompt text, is only retained in memory."^[7] The two notions answer different questions:

Property	KV cache	Context caching
Visible to the developer	No (internal)	Yes (billing, usage fields)
Persists across requests	No	Yes
Cache key	Position in the current request	Hash of the input prefix
Lifetime	Single inference	Minutes to 24 hours
Cost reduction	Latency only	Latency and dollar cost
Storage location	GPU VRAM	Provider infrastructure^[16]

Recent research systems such as RAGCache and SGLang explore exposing portions of the KV cache directly to applications by serialising activation states for shared document chunks, blurring the boundary between the two concepts; in those systems, context caching is essentially a serialised KV cache shared across requests.^[16]

Which providers support context caching?

Google Gemini

Google's June 2024 launch post (authored by Logan Kilpatrick, Shrestha Basu Mallick, and Ronen Kofman) announced context caching alongside the opening of the 2-million-token context window for gemini 2 5 pro's predecessor gemini 1.5 Pro and the addition of code execution.^[2] Logan Kilpatrick announced the feature on X in June 2024, posting "Context caching for the Gemini API is here."^[17] In May 2024 Google had previewed context caching for Vertex AI at Google Cloud Next; the Gemini API rollout followed in June.^[14]

Implicit caching was added to the gemini 2 5 pro and 2.5 Flash family on 8 May 2025, with a 75 percent discount on cached tokens and no developer action required.^[1] In October 2025 Google increased that discount to 90 percent; Logan Kilpatrick wrote, "we increased the implicit caching discount for Gemini 2.5 models to 90% (up from 75%)."^[26] As of July 2026 implicit caching is on by default for gemini 3 pro and the 3.5 Flash line.^[1] Vertex AI's documentation notes that customers "pay only 10 percent of standard input token cost for cached tokens" on cached read tokens.^[14] Storage is billed separately on Vertex AI but not on the consumer-facing ai.google.dev pricing page for older models.^[10]

OpenAI

OpenAI announced Prompt Caching on 1 October 2024.^[4] The launch post stated that prompts longer than 1,024 tokens would automatically receive a 50 percent discount on cached input tokens with "no action required" from the developer, cutting latency by up to 80 percent for long prompts.^[4] Initial supported models were gpt 4o, GPT-4o mini, o1-preview, and o1-mini.^[4] OpenAI's prompt-caching guide reports that the feature "can reduce latency by up to 80%" and cut input token cost by up to 90 percent on newer GPT-5 family models.^[7]

The 24-hour extended retention option was added with the gpt-5 family launch. OpenAI's documentation says "extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours" on GPT-5.1, GPT-5.1-codex, GPT-5, and GPT-4.1, enabled by passing prompt_cache_retention="24h" on the Responses or Chat Completions API.^[11] The extended cache works by offloading key and value tensors to GPU-local storage when in-memory capacity fills; OpenAI notes that "only the key/value tensors may be persisted in local storage; the original customer content, such as prompt text, is only retained in memory."^[7] On 29 May 2026, OpenAI made 24-hour retention the default for organizations without Zero Data Retention enabled across the GPT-5 series, including GPT-5.5, GPT-5.4, and GPT-5.2, replacing the older in-memory default.^[27]

Anthropic

Anthropic launched prompt caching in public beta on 14 August 2024 for Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku.^[3] The launch post quoted "up to 90%" cost reduction and "up to 85%" latency reduction for long prompts, with 5-minute cache writes costing 1.25 times the base input rate and cache hits costing 0.1 times.^[3] General availability followed on 17 December 2024.^[9]

The 1-hour extended TTL was launched in beta in May 2025 and reached general availability in August 2025, available across the anthropic api, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.^[12] On the company's official account on X, the launch announcement stated: "In addition to the standard 5-minute prompt caching TTL, we now offer an extended 1-hour TTL. This reduces costs by up to 90% and reduces latency by up to 85% for long prompts, making extended agent workflows more practical."^[12] Pricing for 1-hour writes is 2 times the base input rate; reads remain at 0.1 times.^[12]^[9] claude models from the claude 3 5 sonnet generation onward all support prompt caching, with the minimum token threshold scaling by model tier.^[9]

Third-party platforms

Major model-hosting platforms expose context caching through provider-compatible APIs. Amazon Bedrock supports prompt caching for claude and Amazon Nova models with an analogous cachePoint content block, charging the same proportional rates as the native APIs; cache reads receive a 90 percent discount, and cache writes are free for the Nova models.^[18] Amazon Bedrock added support for the 1-hour cache duration in January 2026, matching Anthropic's extended TTL.^[28] Microsoft Foundry (formerly Azure OpenAI) added prompt caching for GPT-4o and o-series models in late 2024 with the same 50 percent discount and 1,024-token minimum.^[19] Google vertex ai exposes the same context-caching API as the public Gemini API.^[14] LangChain and the OpenAI Agents SDK both ship middleware that automates cache-breakpoint placement for Anthropic models.^[20]

How much does context caching cost?

Pricing varies by provider and model. The table below lists rates as published on each vendor's pricing page as of July 2026 for representative models from each family.

Model	Standard input ($/1M tokens)	Cached input ($/1M tokens)	Cache write surcharge	Storage
Gemini 3.5 Flash	$1.50 base	$0.15	None (one-time write at base)	$1.00 / 1M tokens / hour^[10]
Gemini 3.1 Pro Preview (<=200k)	$2.00 base	$0.20	None	$4.50 / 1M tokens / hour^[10]
Gemini 2.5 Flash	$0.30 base	$0.03	None	$1.00 / 1M tokens / hour^[10]
GPT-4o (current)	varies	50 percent of input	None	None^[7]
GPT-5.1 (24h retention)	varies	up to 90 percent off	None	None^[7]^[11]
Claude Opus 4.8, 5m write	$5.00	$0.50 (0.1x)	$6.25 (1.25x)	None^[9]
Claude Opus 4.8, 1h write	$5.00	$0.50 (0.1x)	$10.00 (2x)	None^[9]
Claude Opus 4.7, 5m write	$15.00 (illustrative)	$1.50 (0.1x)	$18.75 (1.25x)	None^[9]
Claude Opus 4.7, 1h write	$15.00	$1.50 (0.1x)	$30.00 (2x)	None^[9]

Two pricing models are visible in the table. Google charges close to the standard input rate when a cache is written but recoups operating cost through hourly storage fees. Anthropic charges no storage but folds the cost into a higher write multiplier; the difference between the 5-minute and 1-hour writes is the key tradeoff for developers. OpenAI charges nothing for the write and discounts the read, relying on opportunistic prefix matching against recent traffic.^[7]^[9]^[10] Anthropic publishes the multipliers directly: for Claude Opus 4.8 the base input is $5.00 per million tokens, a 5-minute write is $6.25 (1.25x), a 1-hour write is $10.00 (2x), and cache hits and refreshes are $0.50 (0.1x).^[9]

For a workload that submits a 100,000-token shared prompt and then issues one short user query per second over 30 minutes, the dollar arithmetic favours all three caches by roughly an order of magnitude relative to uncached pricing. A worked example using Anthropic numbers (illustrative figures for Opus 4.7) shows a 1-hour cache write costing 2 times base input on the first call, then 0.1 times base input on each of the following 1,799 calls, for a total cost equivalent to about 181.9 uncached calls, versus 1,800 uncached calls if no caching were used.^[9]

How do developers apply context caching?

Multi-turn chat

In a chat session that grows turn by turn, the natural cache strategy is to place a breakpoint after the most recent assistant message. On Anthropic, this is the "automatic caching" mode: the SDK or middleware moves the breakpoint forward each turn so that the entire history up to (but not including) the new user message hits the cache.^[9] On OpenAI, the same effect is automatic: each new turn shares its prefix with the prior turn and triggers the discount without code changes.^[7] On Google explicit caching, developers must call caches.update() or recreate the cache to extend its content; implicit caching on Gemini 2.5+ handles this transparently.^[15]^[1]

Document grounding and RAG

For retrieval augmented generation systems that attach the same corpus or a small set of documents to many queries, the largest savings come from caching the document payload. The recommended pattern is: place the documents at the top of the prompt, immediately after the system message; place the retrieved chunks (which vary per query) below them; place the user question at the very end. The cache breakpoint sits at the end of the static portion. PromptHub reported case studies of customer-support deployments dropping per-query token cost by 80 to 90 percent under this pattern.^[21] A 2026 study evaluating prompt caching across long-horizon agentic tasks (Lumer et al.) ran over 500 agent sessions with 10,000-token system prompts on the DeepResearch Bench benchmark and reported that consistent cache hits cut API cost by 41 to 80 percent and improved time to first token by 13 to 31 percent across OpenAI, Anthropic, and Google.^[22]

Agentic tool use

Agent loops issue many model calls within a single task, often with a stable system prompt, tool definitions, and previously seen environment observations. Anthropic's 1-hour TTL was launched with this case in mind; the announcement specifically cited "extended agent workflows."^[12] OpenAI's 24-hour retention on GPT-5.1 plays the same role: a long-running coding agent that calls the model dozens of times per hour can keep a multi-turn project context warm without paying full input cost on every call.^[11]

How much does context caching reduce cost and latency?

Reported cost reductions cluster around 50 to 90 percent on cached input tokens, depending on provider and TTL choice. OpenAI's documentation reports up to 80 percent latency reduction on cached prefixes,^[7] Anthropic reports up to 85 percent latency reduction for the 1-hour cache,^[12] and Google reported a 75 percent discount on cached tokens for the 2.5 implicit cache at launch, raised to 90 percent in October 2025.^[1]^[26]

Independent measurements are consistent with vendor claims for typical workloads. The PromptHub comparison study found that Anthropic's cache reads ran at 10 percent of normal token cost, OpenAI's at 50 percent, and Google's at 25 percent at the time of writing in late 2024, in line with the public pricing pages.^[21] An evaluation by Don't Break the Cache (Lumer et al., 2026) measured over 500 agent sessions across OpenAI, Anthropic, and Google on the DeepResearch Bench benchmark and found that prompt caching cut API cost by 41 to 80 percent and improved time to first token by 13 to 31 percent, while cautioning that naive full-context caching can paradoxically increase latency unless dynamic content is kept out of the cached prefix.^[22]

For RAG workloads specifically, the academic RAGCache system (Jin et al., 2024) demonstrated that caching the key and value tensors corresponding to a small set of high-frequency document chunks gave up to 4 times reduction in TTFT and 2.1 times higher throughput compared to a vanilla vLLM and FAISS baseline.^[23] Although RAGCache exposes a different interface from commercial context caching, the underlying activation reuse is similar; commercial implementations of context caching adopt many of the same data-management ideas.

What are the limitations of context caching?

Exact-prefix dependency

Because the cache key is the byte-exact prefix, even small differences (an updated timestamp in the system prompt, reordered tool definitions, whitespace changes) invalidate the cache.^[7]^[9] Production systems must therefore freeze the leading portion of the prompt; introducing per-request variability above the cache breakpoint defeats caching.

Multi-tenant isolation

Caches are isolated per organisation or per project. Anthropic states that "Different organizations never share caches, even if they use identical prompts," and since February 2026 its caches on the Claude API, Claude Platform on AWS, and Microsoft Foundry are isolated per workspace within an organization, while Bedrock and Google Cloud remain organization-level.^[9] OpenAI uses prefix routing at the organisation level, which avoids cross-tenant data exposure but also means a customer cannot share a warm cache with another customer's workload.^[7]

Eviction and best-effort guarantees

All three providers describe caching as best-effort. OpenAI's documentation says cached prefixes "generally remain active for 5 to 10 minutes of inactivity, up to a maximum of one hour."^[7] Google's implicit caches are always deleted within 24 hours, and there is no guarantee of a hit even within that window.^[14] Anthropic's caches are also best-effort; the docs warn that "Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned."^[9]

Storage overhead and economics

Google's explicit caches charge an hourly storage fee that, for large caches kept for long periods, can exceed the savings from cache hits. The break-even analysis for Vertex AI shows that a 1-million-token cache held for an hour on Gemini 3.1 Pro Preview costs $4.50 in storage alone; this is recouped only after several thousand cached reads at $0.20 per million tokens.^[10] Storage cost is the main reason Google's explicit caching has not displaced implicit caching as the default workflow for the 2.5 and 3.x lines.^[1]

Cache stampede and contention

The OpenAI Cookbook notes that "overflow traffic" can reduce cache effectiveness: when request rates exceed approximately 15 per minute for an identical prefix-key combination, requests can be routed to servers without the warm cache, causing miss rates to rise.^[7]^[4] High-traffic enterprise applications often pin requests to a single deployment or use sticky routing headers to mitigate this.

Pricing complexity

The proliferation of write multipliers, read discounts, storage fees, TTL choices, and minimum thresholds makes accurate cost modelling difficult. A 2025 ProjectDiscovery case study reported a 59 percent cost reduction after introducing Anthropic prompt caching but only after several iterations of breakpoint placement and TTL tuning.^[24] A spring 2025 XDA Developers article reported user complaints about claude Code's effective cache lifetime shortening below expectations under heavy traffic.^[25]

Is context caching the same as prompt caching?

The terms "context caching" and "prompt caching" are largely interchangeable in industry usage. Google uses "context caching," Anthropic uses "prompt caching," and OpenAI uses "Prompt Caching" (capitalised in its launch post).^[2]^[3]^[4] All three describe the same underlying mechanism: caching a serialised input prefix on the provider side so that subsequent requests reusing that prefix pay less. The aiwiki.ai article on prompt caching focuses on Anthropic's variant and its cache_control interface; this article uses the generic phrase "context caching" to cover the whole family.

Where the implementations differ:

Granularity: Anthropic allows up to four cache breakpoints per request, enabling nested caches with different TTLs (for example a long-lived system prompt plus a shorter-lived conversation segment).^[9] Google's explicit caches and OpenAI's automatic cache support only a single prefix per request.
Opt-in versus automatic: OpenAI is fully automatic and exposes no caching API beyond a usage field. Google's implicit caching (2.5+) is automatic, but explicit caching requires the developer to create a cache object. Anthropic's caching requires an explicit cache_control marker.^[4]^[1]^[9]
TTL flexibility: Google offers the most TTL flexibility on explicit caches (any positive duration, billed per minute). Anthropic offers two discrete tiers (5 minutes and 1 hour). OpenAI offers none on legacy models and a 24-hour option on GPT-5.1+ via prompt_cache_retention.^[15]^[9]^[11]
Storage pricing model: Google charges storage; Anthropic charges a write surcharge; OpenAI charges neither (it folds cost into the smaller 50 percent discount on legacy models).^[10]^[9]^[7]

Where is context caching heading?

Two trends are visible in 2025 and 2026 announcements. First, all three vendors are moving toward longer TTLs by default: Google's implicit cache extends to 24 hours automatically, Anthropic added an explicit 1-hour tier, and OpenAI's GPT-5 series makes 24-hour retention the default for most organizations.^[1]^[12]^[11]^[27] Longer TTLs match the needs of agentic and long-context workloads where a single coding or research task can span hours.

Second, providers are exposing finer-grained cache control. Anthropic's four cache breakpoints permit composition of static and dynamic segments; future versions may allow arbitrary breakpoint trees.^[9] Vertex AI's explicit caches can be updated with caches.update() to extend TTL without rewriting content.^[15] On the research side, projects such as RAGCache and SGLang's RadixAttention explore caching arbitrary subtrees of activation state, not just prefixes, which could allow document-level caching of any quoted passage regardless of where it appears in a prompt.^[23]^[16]

References

^Google AI for Developers, "Gemini 2.5 models now support implicit caching", Google Developers Blog, 2025-05-08. developers.googleblog.com/...port-implicit-caching Accessed 2026-07-12.
^Logan Kilpatrick, Shrestha Basu Mallick, Ronen Kofman, "Gemini 1.5 Pro 2M context window, code execution capabilities, and Gemma 2 are available today", Google Developers Blog, 2024-06-27. developers.googleblog.com/...-and-google-ai-studio Accessed 2026-07-12.
^Anthropic, "Prompt caching with Claude", Anthropic news, 2024-08-14. anthropic.com/...prompt-caching. Accessed 2026-07-12.
^OpenAI, "Prompt Caching in the API", OpenAI blog, 2024-10-01. openai.com/...api-prompt-caching Accessed 2026-07-12.
^Pierre Lienhart, "LLM Inference Series: 3. KV caching explained", Medium, 2023-12-22. medium.com/...-3-kv-caching-unveiled-048152e461c8. Accessed 2026-07-12.
^Cobus Greyling, "OpenAI Prompt Caching", Medium, 2024-10-02. cobusgreyling.medium.com/...-caching-10c79f7cd1f1. Accessed 2026-07-12.
^OpenAI, "Prompt caching", OpenAI API documentation, 2026. developers.openai.com/...prompt-caching. Accessed 2026-07-12.
^Google AI for Developers, "Gemini API context caching", Google AI for Developers documentation, 2026. ai.google.dev/...caching. Accessed 2026-07-12.
^Anthropic, "Prompt caching", Claude Platform documentation, 2026. platform.claude.com/...prompt-caching. Accessed 2026-07-12.
^Google AI for Developers, "Gemini API pricing", Google AI for Developers, 2026. ai.google.dev/pricing. Accessed 2026-07-12.
^OpenAI, "Introducing GPT-5.1 for developers", OpenAI blog, 2025-11. openai.com/...gpt-5-1-for-developers Accessed 2026-07-12.
^Anthropic, "1-hour prompt caching TTL", Anthropic on X, 2025-05-22. x.com/...1925633128174899453. Accessed 2026-07-12.
^Google Cloud, "Context caching overview", Gemini Enterprise Agent Platform documentation, 2026. docs.cloud.google.com/...context-cache-overview. Accessed 2026-07-12.
^Google Cloud, "Vertex AI context caching", Google Cloud Blog, 2024-06. cloud.google.com/...vertex-ai-context-caching. Accessed 2026-07-12.
^Google AI for Developers, "Caching reference", Gemini API reference, 2025. ai.google.dev/...caching. Accessed 2026-07-12.
^Ashith Raghunath, "Understanding Attention, Context Caching and how it fits into Transformers: A Deep Dive", Medium, 2025-12. medium.com/...ansformers-a-deep-dive-ac91040abd6b. Accessed 2026-07-12.
^Logan Kilpatrick, "Context caching for the Gemini API is here", X, 2024-06-18. x.com/...1803096828595863608. Accessed 2026-07-12.
^Amazon Web Services, "Prompt caching for faster model inference", Amazon Bedrock User Guide, 2026. docs.aws.amazon.com/...prompt-caching.html. Accessed 2026-07-12.
^Microsoft, "Prompt caching with Azure OpenAI in Microsoft Foundry Models", Microsoft Learn, 2025. learn.microsoft.com/...prompt-caching. Accessed 2026-07-12.
^LangChain, "AnthropicPromptCachingMiddleware", LangChain Python reference, 2025. reference.langchain.com/...romptCachingMiddleware. Accessed 2026-07-12.
^PromptHub, "Prompt Caching with OpenAI, Anthropic, and Google Models", PromptHub Blog, 2024-10. prompthub.us/...penai-anthropic-and-google-models. Accessed 2026-07-12.
^Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah, "Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks", arXiv preprint 2601.06007, 2026-01. arxiv.org/...2601.06007. Accessed 2026-07-12.
^Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, "RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation", arXiv preprint 2404.12457, 2024-04-25. arxiv.org/...2404.12457. Accessed 2026-07-12.
^ProjectDiscovery, "How We Cut LLM Costs by 59 percent With Prompt Caching", ProjectDiscovery Blog, 2025-09. projectdiscovery.io/...m-cost-with-prompt-caching. Accessed 2026-07-12.
^Adam Conway, "Anthropic quietly nerfed Claude Code's 1-hour cache, and your token budget is paying the price", XDA Developers, 2025. xda-developers.com/...code-hour-cache-token-budget Accessed 2026-07-12.
^Logan Kilpatrick, "We increased the implicit caching discount for Gemini 2.5 models to 90% (up from 75%)", X, 2025-10. x.com/...1983564882482970925. Accessed 2026-07-12.
^TheRouter.ai, "OpenAI Prompt Cache Retention Defaults to 24h: GPT-5.5 and API Routing Impact", TheRouter.ai, 2026. therouter.ai/...-24h-default-gpt5-operator-routing Accessed 2026-07-12.
^Amazon Web Services, "Amazon Bedrock now supports 1-hour duration for prompt caching", AWS What's New, 2026-01. aws.amazon.com/...one-hour-duration-prompt-caching Accessed 2026-07-12.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · v4 · 5,002 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

Context window Gemini 3 Flash Gemini 3 Pro Kimi K3 Prefix caching (automatic prefix caching)Prompt Caching

Why do LLM providers offer context caching?

How does context caching work?

Cache key and prefix matching

Cache writes, hits, and refreshes

Time to live

Minimum token threshold

Cache writes versus reads in the API

How is context caching different from the KV cache?

Which providers support context caching?

Google Gemini

OpenAI

Anthropic

Third-party platforms

How much does context caching cost?

How do developers apply context caching?

Multi-turn chat

Document grounding and RAG

Agentic tool use

How much does context caching reduce cost and latency?

What are the limitations of context caching?

Exact-prefix dependency

Multi-tenant isolation

Eviction and best-effort guarantees

Storage overhead and economics

Cache stampede and contention

Pricing complexity

Is context caching the same as prompt caching?

Where is context caching heading?

See also

References

Improve this article

Related Articles

Fireworks AI

NVIDIA Triton Inference Server

TensorFlow Serving

NVIDIA NIM

NVIDIA Dynamo

ExLlamaV2 (EXL2)

What links here

Related Articles

Fireworks AI

NVIDIA Triton Inference Server

TensorFlow Serving

NVIDIA NIM

NVIDIA Dynamo

ExLlamaV2 (EXL2)

What links here