The Patience Discount: Why Waiting Cuts Your AI Bill in Half

The Name Was The Problem

A developer sat down to use Anthropic's Message Batches API for the first time. They read the docs. They saw the word batch, and they saw the words fifty percent discount. They thought, alright, I need to bundle. I need to find a set of requests that go together, group them up, send them all at once. They started looking through their codebase for clusters of related work. They tried to figure out which prompts could share a batch.

They were doing extra work for no reason. The word batch was misleading them.

It turns out the discount has almost nothing to do with bundling requests together. The discount is for accepting that the response might not come back for a while. You can put one request in a batch. You will still get the fifty percent off. The provider is not rewarding you for the volume. The provider is rewarding you for the willingness to wait.

This is one of those situations where the engineering name and the marketing name and the actual mechanism are three different things, and they all hide what is really going on. Let us pull them apart.

The Mechanism Underneath

There are two ways to run a large language model inference service. The first way, the one everyone started with, is synchronous. You send a request, a server is allocated to you immediately, the model runs, the response comes back. From the provider's point of view, they have to keep enough GPU capacity available to handle whatever might come in at any moment. They size for peak load. The cost of that excess capacity, sitting idle during quiet hours, is baked into the price you pay.

The second way is asynchronous. You send a request, the provider says, fine, we will get to it sometime in the next twenty four hours. They put it in a queue. When their GPUs are not busy serving the real time customers, they pull jobs from the queue and run them. From the provider's point of view, this work is nearly free. The GPUs are already running, the electricity bill is already paid, the data centers are already cooled. Filling otherwise idle cycles costs the provider very little.

So they share the savings with you. Both Anthropic and OpenAI converged on the same number for this. Fifty percent off, both input and output. You give up immediacy. You get half price compute.

The fundamental insight here is that the discount is not about volume. It is not about bundling. It is about timing. The async commitment is the product. Whatever the API endpoint happens to be called.

Anthropic's Version

Anthropic ships this as the Message Batches API. The shape is what you would expect. You POST a batch creation request to the API with a list of message requests inside it. Each one has its own custom identifier so you can match results back later. The batch starts processing. You poll for status every thirty seconds or so. When the batch is done, you fetch a results file in JSON Lines format and parse it.

Each individual request inside the batch is otherwise identical to a normal Messages API call. Model, max tokens, system prompt, messages, tools, all the same parameters. You can use Claude Opus four point seven, Sonnet four point six, Haiku, whatever you want. The discount applies uniformly. Fifty percent off input. Fifty percent off output.

The advertised processing window is twenty four hours. In practice, most batches finish in under an hour, and many in just a few minutes. There are no webhooks, no callback mechanism, you have to poll. That is the biggest operational friction, but a thirty second polling loop in a worker process handles it fine.

A batch can contain up to a hundred thousand requests. Anthropic bumped this number up from ten thousand earlier in twenty twenty six. For really large workloads, the practical advice is to split into chunks of two to five thousand. It makes retries easier when something goes wrong. It avoids the size limit on a single batch payload, which is two hundred fifty six megabytes.

And here is the thing that the developer at the start of this episode missed. A batch with a single request is a perfectly valid batch. You get the same fifty percent discount on one request that you would get on a hundred thousand. The accounting is on a per token basis. The discount comes from the patience you agreed to, not from the bundle size.

OpenAI's Four Tiers

OpenAI got more elaborate. They started with a Batch API in roughly the same shape as Anthropic's, fifty percent off, twenty four hour window, polling-based, async. That is still there. But then they added more tiers, and now you can pick where on the patience-versus-price curve you want to sit.

The tiers, top to bottom, are Priority, Standard, Flex, and Batch.

Standard is the default. You do not ask for anything special, you get pay-as-you-go pricing with best-effort latency. No service level agreement, but in practice it is fast for most traffic. This is what every developer hits when they first write a Python script with their API key.

Priority sits above Standard. You pay a premium, somewhere around twenty five to fifty percent more depending on the model, in exchange for tighter latency guarantees and an enterprise service level agreement. You set service tier equals priority in the request. This is meant for user-facing production traffic where slow responses cost you customers or break a real time experience. Voice agents, interactive chat, anything where the user is waiting and watching.

Flex sits below Standard. You set service tier equals flex, the request goes through the normal Responses or Chat Completions endpoint, you get a response back when one is available. The pricing is matched to Batch API rates, which means fifty percent off. The trade off is that your request might come back slowly, in seconds or minutes, and might sometimes return a four twenty nine error meaning the resource is unavailable right now. Flex is currently in beta and limited to certain models, including the reasoning models o three and o four mini. You retry with exponential backoff. If you really need it, you can fall back to standard processing.

The interesting thing about Flex is that it gives you most of the Batch API discount without the full async machinery. You still get a response on the same request. You just might wait longer. For many workloads this is a much nicer developer experience than spinning up a batch job, polling, parsing results files, and handling expired requests.

And then there is Batch, at the bottom of the price curve. Same model as Anthropic's. Submit a batch file, wait up to twenty four hours, fetch results.

OpenAI also offers Scale tier for committed throughput, where you reserve capacity for a minimum thirty day term. That sits parallel to the others rather than on the same axis. It is the right answer when your workload is high and steady, predictable enough to commit to.

The four tier system gives you a clean choice. Need it now? Priority. Normal traffic? Standard. Can wait a bit and tolerate some retries? Flex. Can wait until tomorrow? Batch.

A Wrinkle About Caching

One thing OpenAI did with Flex is meaningful. Flex requests run through the normal Responses API, which means they hit the normal caching infrastructure. Anthropic and OpenAI both let cache pricing stack with async pricing, but the cache hit rate inside an async system is not always great.

Inside the Batch API on OpenAI's side, you cannot use older models for cached inference at all. The cache machinery was not retrofitted to pre-GPT-5 models in batch mode. If you are running o three or o four mini and you want batch pricing plus caching, you need to be on Flex, not Batch. The OpenAI cookbook documents this clearly. In a head to head test on ten thousand identical requests, Flex with extended prompt caching enabled produced an eight and a half percent higher cache hit rate than the same workload through Batch. That translates to roughly twenty three percent lower input token cost.

For Anthropic, the story is simpler. Their Message Batches API supports prompt caching natively. The discounts stack multiplicatively. A cached read inside a batch request costs five percent of standard input pricing. Five percent. Not five percent off, five percent of. That is the math that turns expensive bulk workloads into nearly free ones.

Google's Version

Google offers batch prediction on Vertex AI. The mechanics are roughly similar to the others. You submit a job, the job processes asynchronously, you fetch results when it is done. Inputs can come from a BigQuery table or a Cloud Storage bucket, which is convenient if you are already in Google's data world. Outputs land in the same kinds of places.

The pricing is fifty percent off standard rates, in line with the other providers. Most Gemini models support it. And the storage-and-data-handling integration is the differentiator. If your input data already lives in BigQuery, you save the engineering work of streaming it into request bodies. The batch system reads it for you, runs the inferences, writes the results back to a destination table or bucket.

This shape suits some workloads very well. A classic example is enriching a database column. You have a million rows of customer feedback. You want to add a sentiment label, an extracted topic, a translated version. The batch job reads from BigQuery, runs each row through Gemini, writes back. No queue management code, no polling loop, no JSON Lines parsing. The integration handles it.

Vertex AI's batch system also supports the same context caching mechanism we covered in the last episode. Implicit caching is on by default. Explicit caching works through batch as well, with the same ninety percent discount on cached tokens.

DeepSeek's Different Angle

DeepSeek does not have a batch API in the same sense as the others. Instead, they have a time of day discount. From four thirty PM UTC to twelve thirty AM UTC, prices drop. Cache miss requests are roughly fifty percent off. Cache hits go down another seventy five percent. The discount is automatic and synchronous. You do not queue, you just send your request during the discount window and the bill is lower.

This is a different shape of the same idea. The provider is selling slack capacity. DeepSeek's slack happens to be at predictable times of day rather than across a twenty four hour scheduling window, and they expose it as a clock rather than a queue. The economics are the same. The implementation is different.

For workloads that can wait until off peak hours, DeepSeek can be dramatically cheaper than anyone else. The downside is that not all workloads can wait, and not all jobs can be scheduled to land in a specific window. But for nightly batch jobs in particular, where you are going to run things overnight anyway, aligning to DeepSeek's discount window is a no code change optimization.

The Operational Reality

Async work brings operational complexity that sync work does not. The patterns are well established, but you have to know them.

First, polling. No major provider supports webhooks for batch completion right now. You poll. The standard pattern is a worker that wakes up every thirty seconds, checks status, and either continues waiting or starts processing results. This is fine, but it means a long running worker, which means orchestration, which means something like Airflow or Temporal or a simple cron job that manages state. For a single batch a day, a hand rolled script works. For continuous batching, you want real infrastructure.

Second, error handling. Inside a batch, individual requests can fail. The batch object reports counts of succeeded, errored, expired, canceled. You have to read the results file and look at each result. The simple pattern is, fetch results, separate successes from failures, resubmit failures in a new batch with the same payload but maybe with extra logging.

Third, result expiration. Both Anthropic and OpenAI hold results for a window of about twenty nine days after the batch completes. The batch object itself stays around longer, but the results file becomes inaccessible. If you do not copy results to your own storage immediately, you might lose them. The standard pattern is to fetch results, write them to your own object store or database, then forget about the batch identifier.

Fourth, request expiration. If a batch does not complete within twenty four hours, individual requests can expire. You do not pay for expired requests, which is nice. You also do not get results, which is less nice. During high traffic periods, very large batches are more likely to have some expired requests. The mitigation is to split into smaller chunks. Two to five thousand requests per batch is the rule of thumb.

Fifth, idempotency. The custom identifier you assign each request inside a batch must be unique within that batch. If you have two requests with the same custom identifier, the batch creation fails. This sounds obvious but it is a common error when batch identifiers come from a database query that has duplicates.

None of this is hard. It is just operational hygiene that you do not have to think about with synchronous APIs.

When Async Is The Right Answer

The decision tree for using async pricing is simpler than the four tier list suggests.

If a human is waiting for the result, do not use async. The fifty percent saving is not worth a five minute response time when somebody is staring at a screen.

If a human will look at the result later, async is great. Nightly reports, weekly digests, scheduled summaries, anything where the receiving human is going to read it in the morning.

If the result feeds a downstream pipeline that runs on its own schedule, async is great. Data enrichment, classification, embedding generation, model evaluation, anything where the result becomes input to something that does not care when it arrives.

If you are prototyping or evaluating, async is great. You are going to run a hundred test cases, look at the outputs, tune the prompt, run them again. Each pass can be a batch. You can leave them overnight.

If the workload is small but you genuinely do not need it back immediately, async is also great. This is the case the developer at the start of this episode discovered. You do not need to be running ten thousand requests. One request through the batch endpoint gets the same discount.

There is a fuzzy zone in between, where you sort of need the response soon but not right now. That is where OpenAI's Flex tier lives. The Responses endpoint with service tier flex gives you the batch discount with the ergonomics of a synchronous call. It might be slower, it might return a four twenty nine, you might have to retry. But you do not have to spin up a whole batch processing system.

Anthropic does not currently have an equivalent Flex tier. They have Standard and Batch. If you want the discount, you go async. There is no middle ground yet. Whether that changes is up to Anthropic.

The Stacking Math

Let us spend a minute on the actual numbers because they matter for sizing your bill.

Take a workload running on Claude Sonnet four point six. Standard input is three dollars per million tokens. Standard output is fifteen dollars per million tokens. You have a workload that sends fifty thousand tokens of static context plus a thousand tokens of unique query, and receives a thousand tokens of output. You run this a hundred thousand times a month.

The naive cost works out to about fifteen thousand three hundred dollars of input plus fifteen hundred dollars of output. Roughly sixteen thousand eight hundred dollars a month.

Now with batching. Half off across the board. Eight thousand four hundred a month. Solid saving for accepting twenty four hour latency.

Now with batching plus caching, assuming the fifty thousand tokens of context are stable across requests and cache properly. The cached input is now five percent of standard input pricing, fifteen cents per million instead of three dollars. The math comes out to about seven hundred sixty five dollars for cached input, a hundred fifty dollars for the unique query tokens at the half off batch rate, and seven hundred fifty dollars for output. Total around sixteen hundred sixty five dollars a month.

You went from sixteen thousand eight hundred to sixteen hundred sixty five. About a ninety percent reduction. Same model, same workload, two configuration choices.

This is why the stacking matters. The numbers are real. The work to enable them is small. The trade off is that your workload has to tolerate the async window and your prompts have to be structured to actually hit cache, both of which we covered in the previous episode.

The Strategic Posture

If you are running anything at scale on AI APIs, your bill is structured around four levers. The model you choose, the prompt structure, the caching strategy, and the synchronicity tier. The first one is obvious and the last one is often forgotten.

Most teams start with everything on Standard or Priority. As volume grows, the bill grows linearly. At some point somebody notices the bill, looks at the workload, and finds that thirty or forty percent of it does not actually need real time responses. That portion gets moved to Batch or Flex. The bill drops by twenty to thirty percent overnight.

This is the most common optimization story in production AI deployments right now. The model is fine. The prompts are fine. The caching is in place. The remaining lever is, why is this evaluation pipeline running on Priority. Why is this nightly enrichment job running on Standard. Move them down a tier. Reclaim half their cost.

The teams that do this well treat the tier choice as a per workload decision rather than a per project decision. The user facing chat endpoint runs Priority. The internal admin tools run Standard. The data labeling pipeline runs Flex. The weekly report generator runs Batch. Each workload gets the tier that matches its real time requirements, not a default.

What is harder, and what does not get done as often, is the inverse decision. Some workloads are running on Batch that probably should be on Standard. Either because the user actually does want the result sooner, or because the operational complexity of running a batch pipeline is costing more in engineering time than the discount is saving. The reverse migration is rarer but it happens.

The Charging Stop

The pricing tier you choose is not a technical decision. It is a product decision in disguise. The question of whether your workload runs on Batch or Standard or Priority is really the question of how patient your user is willing to be, and what they will pay for impatience.

If your product can deliver in twenty four hours and that is fine for the use case, you should be running on Batch. The discount is half. The math is simple.

If your product needs to deliver in seconds, you are on Standard, and the question is whether you can put a real time service level agreement in front of paying customers, in which case Priority might earn back its premium.

If your product can deliver in a couple of minutes and the user will accept that, Flex is the right answer where it is available. You get the discount with the same API shape as Standard.

And underneath all of this, the same insight that the developer at the start missed. The word batch sounds like it is about bundles. It is not. It is about patience. The provider has more GPU capacity than they have real time demand for. They will share the savings with anybody willing to wait. The size of your bundle is a detail. Even a bundle of one qualifies.

The discount is for not being in a hurry. That is the whole thing.

Drive safe.