Prompt Caching: Don't Pay Twice for the Same Math

The Bill That Broke The Camel

The customer support bot had a fifty thousand token system prompt. Policies, examples, escalation rules, brand voice. A small book of instructions. The thing worked beautifully. The bot was polite, correct, on-brand, never went off the rails.

Then the bill arrived.

Every conversation reloaded the whole policy document. Every. Single. Time. The team was paying for the model to read the same fifty thousand tokens fresh on each user query. A thousand conversations a day, fifty million tokens of system prompt, none of it new. The number at the bottom of the invoice did not match the engineering effort to get there.

This was not a hypothetical situation. It used to be the default story for anybody building serious products with large language models. There was no way around it. You sent a prompt, the model processed the prompt, you paid for the processing. Repeat tomorrow.

Then a feature shipped called prompt caching, and the math changed.

What The Model Is Actually Doing

To understand why prompt caching matters, you have to know what an inference call actually costs in terms of compute. The model has two phases when it processes a request. The first phase is called prefill. The model reads the entire prompt and computes intermediate values for every token in it. These intermediate values are called keys and values, often shortened to K and V. They are how the attention mechanism inside the transformer decides which earlier tokens matter for the next prediction.

The thing about attention is that it is quadratic in sequence length. If your prompt is twice as long, prefill is roughly four times as expensive. A fifty thousand token prompt has fifty thousand times fifty thousand interactions to compute, which works out to two and a half billion little arithmetic operations per attention head, per layer. Modern models have many heads and many layers. That is a lot of math.

Once prefill is done, the model has all the K and V values stored in GPU memory. Then it enters the second phase, decode, where it produces output tokens one at a time. Decode is much cheaper per token because each new token only attends back to existing K and V values, it does not recompute them.

Here is the thing the providers noticed. Those K and V values from prefill are deterministic. Given the same prompt prefix, the same model, and the same temperature setting, the K and V values will be identical every time. They are sitting in GPU memory anyway. Why throw them away after the request finishes? Why not keep them around for a few minutes? If another request comes in with the same prefix, skip prefill, start straight from decode. Save the GPU time. Pass some of the savings on to the user.

That is all prompt caching is. It is the receipt the GPU already paid. Now you can hand it in instead of paying again.

Who Shipped It First

Anthropic shipped first, in August of twenty twenty four. They called it prompt caching, and they made it explicit, which is to say you had to ask for it. You marked specific parts of your prompt with a flag called cache control, and the system would cache up to that breakpoint for five minutes. Cache hits cost ten percent of input price, a ninety percent discount. Cache writes cost twenty five percent more than normal input price, the surcharge for the work of saving the K and V state to a place where the system could find it later.

The explicit model felt like a strange choice at first. Most caching, in most parts of computing, is transparent. The CPU caches memory without telling you. The database caches query results without asking. Why would the AI API make you opt in?

The answer is that prompt caching is not actually free. There is a real cost to writing the cache. There is a real cost to keeping memory reserved. And the discount on the read side has to be large enough to make caching worthwhile, but only if you actually re-use the cache. If you sprinkled caching everywhere, you would often pay the write premium and never get the read discount back. Anthropic decided to make the developer choose. Their philosophy was, you know what you will re-use, you tell us, we will do the rest.

A few months later, OpenAI shipped their version, and they made the opposite choice.

The Two Schools

You do not have to do anything. That was OpenAI's pitch when they released prompt caching in October of twenty twenty four. Their version is automatic. Any prompt over a thousand and twenty four tokens, they hash the first part of it, route the request to a server that recently handled the same prefix, and check whether that server still has the K and V values in memory. If yes, you get cached pricing. If no, you pay full and your prompt now seeds the cache for the next request.

This is the OpenAI school of thought. Make it free. Make it default. Make it require zero code changes. The trade off is that you do not get to choose whether you hit the cache, the routing system does, and routing is imperfect. Researchers running experiments found that hitting OpenAI's cache, even when sending the same request twice in a row, lands a hit about half the time. The other half, the second request happened to get routed to a different server, and that server had not seen the prompt yet. You got full pricing on a request that you knew was identical.

Anthropic's school of thought is the opposite. Make it explicit. Make the developer carry some weight. In exchange, when you ask for a cache hit, you get one. The same researchers, running the same test against Anthropic with proper cache control breakpoints, get hits one hundred percent of the time on a cached prefix. The control is the product.

Then Google came along and said, why not both.

The Google Compromise

Google's Gemini API offers two flavors of caching. Implicit caching is on by default, exactly like OpenAI's automatic model. Send a prompt with a repeated prefix, Google's infrastructure will probably notice, and you will see a discount in your usage metadata. Probably. Implicit caching with Gemini two point five and later models offers up to a ninety percent discount on cache hits when they happen, which is a much steeper cut than OpenAI offers, but you do not get to predict when they happen.

For developers who want predictability, Google also offers explicit caching, similar in spirit to Anthropic's. You create a cache object through an API call, you reference it in subsequent requests, and you get a guaranteed ninety percent discount on every reference. The cost is that you pay storage. Google charges by the hour for the privilege of keeping your cached content available on their hardware. This is an honest representation of what is actually happening. Somebody's GPU memory is reserved for your cache, and that memory has an opportunity cost.

Google's explicit cache has a higher minimum size than the other providers. You need at least thirty two thousand tokens to even create one. The reasoning is that for smaller contexts, the management overhead is not worth it. Below the threshold, you should be relying on implicit caching anyway.

So now there are three schools. OpenAI says, do not worry about it, we will do our best. Anthropic says, tell us what to cache and we will deliver. Google says, take whichever you like, or both.

There is a fourth voice worth mentioning before we move on. DeepSeek built their entire pricing model around aggressive automatic caching. They cache on disk rather than just in memory, which means cache hits can persist for much longer than the others. Their cache hit pricing is dramatically cheaper than cache miss pricing, and they assume their users will structure prompts to benefit. For workloads that fit DeepSeek's caching model, the total cost can be lower than any of the three Western providers. The trade off is that you have less control and the model itself is different.

The Numbers, Carefully

Let us talk about money. The big question is always, how much does this actually save.

For Anthropic, the cache hit price is ten percent of the input token price. On Claude Opus four point seven, where input costs five dollars per million tokens, a cached read costs fifty cents per million. On Sonnet four point six, where input is three dollars, a cached read is thirty cents. The cache write costs one and a quarter times the input price for the five minute time to live, or two times the input price for the one hour time to live. So if you are going to re-use a cached prefix more than three or four times within the window, you have already paid for the write surcharge.

For OpenAI, the math depends on the model and shifts over time. On older models, cache hits cost half of input, a fifty percent discount. On the newer GPT family, cache hits cost a quarter of input, a seventy five percent discount. There is no write surcharge. The trade off, as we covered, is that you cannot predict when a hit happens.

For Google's Gemini two point five and later, both implicit and explicit caching offer a ninety percent discount on cache hits. Explicit caching adds storage costs, which are billed per million tokens per hour. If you are caching a hundred thousand tokens for an hour, that is a small charge, fractions of a cent. If you are caching ten million tokens for a day, the storage starts to matter.

There is one more wrinkle. You can stack caching with batch processing. Batch processing on the Anthropic API gives you a flat fifty percent discount, in exchange for accepting up to twenty four hours of latency. If you combine batching with caching, the discounts compose. A cached read inside a batch request costs five percent of standard input pricing. That is not a typo. Five percent. Half of ten percent. If your workload is asynchronous and shares a common prefix, you can pay one twentieth of full price.

What This Means For Your Architecture

The cost saving math is one thing. The architectural implications are another. Caching is going to shape how you write prompts, whether you think about it or not.

The first rule, true across all three providers, is to put your stable content at the start of your prompt and your variable content at the end. Caches match on prefixes. If you put a user's name at the very beginning of the system prompt, you have just made the entire rest of the prompt uncacheable. Every new user gets a fresh cache miss. If you put the user's name in the user message instead, where it belongs, the system prompt stays cacheable across users.

The second rule is that caching changes the economics of long context. There is a habit, especially among developers new to large language models, of including everything that might be relevant in every prompt. Why summarize, why filter, just dump it all in. Without caching, this is expensive. With caching, it can be very cheap, as long as the dumped content is the same across requests. Caching makes context engineering more forgiving for static content, and more punishing for dynamic content that varies request by request.

The third rule is that caching has a time to live, and the time to live is short. Five minutes is the default for Anthropic. OpenAI is similar, five to ten minutes, occasionally up to an hour during quiet periods. If you are building a workflow where the same context is queried repeatedly within a short window, like an interactive chat session, caching just works. If the context is queried once an hour, the cache will have evicted before the next request lands, and you will pay full price every time. Knowing your traffic pattern matters.

For high volume continuous workloads, both Anthropic and OpenAI offer extended cache windows. Anthropic's is one hour, at the cost of a higher write surcharge. OpenAI's is up to twenty four hours on some models. Google's explicit cache lets you set any time to live you want, but you pay storage by the hour.

A Quick Story About What Doesn't Cache

There is a class of mistakes that look like they should benefit from caching, but do not. Knowing them saves real money.

Tool definitions cache. If you have a list of twenty function definitions you send to the model on every call, those tokens cache. Do not reorder them between requests, do not reformat them, keep the list stable.

System prompts cache. Same rules. If you are A B testing two versions of a system prompt, you are cutting your cache in half. The second variant has a different prefix, the cached K and V values do not apply.

Long documents in the context cache. If you are doing question answering against the same hundred page report all day, that report should sit at the start of your prompt as cached content. New questions go at the end.

Conversation history half caches. The earliest turns of a conversation cache well because they do not change. As the conversation grows, you are adding turns at the end, which does not break the cache for the earlier turns. This is good. But if you summarize old turns and replace them with a summary, you have changed the prefix. You have also probably saved a lot of tokens. There is a real tension between cache friendly and context window friendly that gets interesting in long conversations.

Things that do not cache, or cache badly, include anything that varies by user, by request, by time, by anything dynamic placed early in the prompt. A timestamp at the top of every system message will completely defeat caching. Do not do that. If you need a timestamp, put it at the end.

The model does not understand what your content means. It only knows whether the bytes match. One different character early in the prompt and you are paying for prefill again from scratch. Caching is byte level prefix matching. The K and V values for token number five only apply if tokens one through five are exactly identical. Any change anywhere, any whitespace difference, any reordered word, the chain breaks and the cache has nothing to offer you.

The Honest Trade-Offs

Now for the part where I get to be honest about which approach is better, because they all have real failure modes.

OpenAI's automatic caching is the easiest to start with. You write your application, you do not think about caching, you get free discounts when the routing gods smile on you. The failure mode is that the discount is unpredictable. You cannot make a confident cost projection because you do not know your cache hit rate in advance. Production traffic at scale typically lands hit rates between forty and seventy percent on the right kind of workload. For some applications, that is fine. For applications where cost predictability matters more than peak savings, it is a problem.

Anthropic's explicit caching is the most work. You have to think about where your breakpoints go. You have a maximum of four breakpoints per request, which is plenty for most prompts but sometimes feels constraining in complex agentic workflows. You pay a small write premium. In exchange, you get one hundred percent cache hit rate on properly structured requests, and you can budget against it. For production systems where the same code path runs millions of times, this predictability matters more than the convenience of automation.

Google's hybrid is the most flexible. You can start with implicit and not think about it, and migrate to explicit when you want guarantees, and use them together. The hybrid is the right answer if you are not sure yet what your workload looks like. The downside is that explicit caching has higher minimum sizes than the other providers, and storage costs add a layer of complexity that the others do not have.

Where This Is All Going

A few things are clearly true about the future of prompt caching.

Cache durations are getting longer. OpenAI's extended caching now reaches twenty four hours. Anthropic offers a one hour tier. The next steps will be days. The reason is that the actual storage cost of K and V values, while real, is not infinite. As models get more efficient at long context handling, the relative cost of keeping K and V values around drops.

Cache hit rates are getting better. OpenAI's routing has improved measurably over the past year. The companies are tuning their systems to make cache hits more likely. The gap between explicit and implicit caching, in terms of effective hit rate, is narrowing.

Storage will eventually be priced separately from compute. Right now most providers bundle storage into the read price, with the exception of Google's explicit cache. This is convenient but it limits product design. If you want to cache something for a week, the providers should let you, and charge you accordingly. Expect this to become a separate dimension on the pricing page.

The most interesting development is at the edge of caching's definition. There are research papers from twenty twenty five showing how to do partial cache reuse, where two prompts share most but not all of their prefix, and the model can re-use the matching part without recomputing it. This is harder than exact prefix matching, but the payoff is huge for applications where prompts are nearly identical but not quite. None of the major providers have shipped this in production yet. When they do, the math will shift again.

What To Actually Do

If you take one practical thing away from this, it is the structure rule. Static content first, dynamic content last. This costs you nothing to do, it works across all providers, and it is the single change that has the biggest impact on your bill if you are not already doing it.

If you are on Anthropic, look at your system prompts and your tool definitions. Drop cache control breakpoints at the end of stable sections. The five minute window is fine for most chat applications. The one hour window is worth the write surcharge for agentic systems that re-use the same scaffolding across long task chains.

If you are on OpenAI, your job is mostly to structure your prompts to be cache friendly and check the cached tokens field in your usage response to see what you are actually getting. If you have predictable high volume workloads, look at Flex processing as a way to get more cache consistency.

If you are on Google's Gemini, start with implicit caching, which costs you nothing to enable, and migrate to explicit caching for the workloads where you have measured a meaningful gap between potential and actual savings.

If you are processing things in batch where you do not need real time responses, stack caching with batching. The combined discount on Anthropic gets you down to five percent of standard input pricing for cached content. That is the kind of math that turns previously expensive workloads into cheap ones.

The Charging Stop

The thing about prompt caching is that it sneaks up on you. The first time you build with an AI model, you do not think about it. You are focused on getting the prompt right and the output usable. Caching is something you find out about when the bill gets uncomfortable, or when somebody tells you about it on a podcast.

But once you have thought about it, you cannot unsee it. Every system prompt becomes a question of, should this be cached. Every long document in your context becomes a question of, where does the variable part start. Every dynamic element near the front of a prompt becomes a small voice in your head saying, that just defeated the cache.

The change is bigger than the percentage discount, which is real and worth chasing. The deeper change is that caching makes long context applications economically possible. Five years ago, sending a hundred thousand tokens of context to a model on every request would have been a strange thing to do, because it would have been impractical. With caching, it is just normal. The whole shape of what you can build with these systems shifts because the cost curve underneath them shifted.

And it shifted because three companies decided, more or less independently, that throwing away GPU memory at the end of every request was wasteful. They built three different versions of the same idea, and each version reflects how those three companies think about developer ergonomics. OpenAI says, we will figure it out for you. Anthropic says, you tell us what you need. Google says, take whichever feels right.

You are paying for math. Nobody likes to pay for math they already paid for once. Caching is the receipt that says, you do not have to.

Drive safe.