Inference: What This Means for You

The Meter Is Running

This is the practical companion to episode eleven of Actually, AI, inference.

You know the story now. Your prompt travels to a data center, gets shattered into tokens, flows through billions of operations on hardware that costs more than a house, and comes back one token at a time. You know about prefill and decode, about the key-value cache, about Tri Dao and Georgi Gerganov and the thousand-fold price crash. The question is: what does any of that mean for the way you use AI tomorrow morning?

Quite a lot, actually. Because inference is where the money goes. Not in some abstract corporate sense. In the sense that every prompt you send has a cost, that cost varies wildly depending on how you send it, and understanding the mechanics gives you real control over what you spend and what you get back.

The Bill You Are Already Paying

If you use a free tier of any AI service, someone is paying for your inference. That someone is the company, burning through venture capital or subscription revenue to cover the GPU time your questions consume. If you pay twenty dollars a month for a subscription, that flat fee covers a certain amount of inference, and the company is betting that most users will not exceed it. If you use an API directly, you see the actual cost per token, and the numbers become very concrete very fast.

Here is how the pricing works, and why it looks the way it does. API providers charge separately for input tokens and output tokens. Input tokens are your prompt, everything you send to the model. Output tokens are the response, everything the model generates back. Output tokens always cost more. Sometimes two to four times more. Sometimes ten times more.

This price difference maps directly to the two phases from the main episode. Input tokens go through prefill, that big parallel computation that processes everything at once. It is efficient. The GPU is doing its best work, running large matrix multiplications in parallel, the kind of operation it was designed for. Output tokens go through decode, the serial one-at-a-time generation where the GPU is mostly waiting on memory. It is inefficient. Each output token requires a full forward pass through the model, reading billions of weights from memory, to produce a single word.

So when you see a pricing page that says three dollars per million input tokens and fifteen dollars per million output tokens, you are looking at the prefill-decode split expressed in money. The input is cheap because parallel computation is cheap. The output is expensive because serial generation is expensive.

The practical consequence: a prompt where you paste in a long document and ask for a one-sentence summary is cheap. A prompt where you write one sentence and ask for a two-thousand-word essay is expensive. Same model, same quality, very different bill. If you are building anything on top of an AI API, this distinction matters more than almost any other design decision.

Why the Words Stream In

You have watched words appear one by one on screen, that typewriter effect we talked about in the main episode. Now you know that is not a design choice. The model is literally generating one token at a time. But here is the part that matters for how you work.

Streaming exists because the alternative is worse. Without streaming, you would press send and stare at a blank screen for ten, twenty, sometimes thirty seconds while the model generates the entire response, and then the full text would appear all at once. With streaming, you start reading the first sentence while the model is still generating the tenth. The total time is the same, but the experience is radically different.

This has a practical implication for your workflow. If you are waiting for the full response before you start reading, you are wasting time. The first paragraph of a long response is final. The model does not go back and revise earlier tokens based on what it generates later. Token forty-seven does not change because of what happens at token three hundred. Start reading while the words are still appearing. If the first paragraph is heading in the wrong direction, stop the generation and rephrase. You do not need to wait for the model to finish a response you are going to discard.

Some interfaces have a "stop generating" button. Use it. Every token the model generates after you have decided the response is wrong is wasted compute, and on an API, wasted money.

The Hidden Cost of Thinking

If you have used models like OpenAI's o1 or o3, or DeepSeek R1, you have used thinking models. They take longer, they cost more, and the reason maps directly to the inference mechanics you now understand.

A regular model receives your prompt, processes it through prefill, and starts generating visible tokens immediately. A thinking model does something different. After prefill, it generates a long chain of reasoning tokens, sometimes hundreds, sometimes thousands, that you never see. These invisible tokens are the model working through the problem step by step before committing to an answer. Only after the thinking chain is complete does the model begin generating the visible response.

Every one of those invisible thinking tokens costs exactly as much as a visible token. The GPU does not care whether a token will be shown to you. It runs the same forward pass, reads the same billions of weights, consumes the same electricity. A model that thinks for five hundred tokens before writing a two-hundred-token response costs you seven hundred tokens of output, not two hundred.

This is why thinking models can be dramatically more expensive for simple questions. If you ask a thinking model "what is the capital of France," it might generate three hundred reasoning tokens deliberating about the question before writing "Paris." You paid for three hundred and one output tokens to get one word. A regular model would have given you "Paris" as its first token.

The practical rule: use thinking models for hard problems, regular models for easy ones. If the task involves math, logic, code debugging, or multi-step reasoning, the thinking tokens are worth their cost. The model genuinely performs better with that internal deliberation. If the task is summarization, translation, creative writing, or simple Q and A, a regular model gives you the same quality at a fraction of the price.

Some APIs let you set the "reasoning effort," a dial that controls how many thinking tokens the model is allowed to generate. Low effort for simple tasks. High effort for hard ones. If your API supports it, use it. You are directly trading money for quality on every request.

Running It Yourself

The main episode told you about Georgi Gerganov, the engineer who made it possible to run language models on a laptop. Here is what that means for you in practice.

Tools like Ollama, LM Studio, and llama.cpp let you download a model and run it on your own hardware with no cloud connection, no API key, no per-token charges. A seven billion parameter model runs comfortably on a modern laptop. A seventy billion parameter model needs a machine with sixty-four gigabytes of memory and even then it will be slow. The bigger the model, the more memory you need and the slower it runs.

When does running locally make sense? Three scenarios. First, privacy. If you are processing confidential documents, medical records, legal contracts, proprietary code, running locally means none of that data leaves your machine. No API call, no cloud server, no third-party data policy to worry about.

Second, volume. If you need to process ten thousand documents, the API bill adds up fast. A local model is free per token after the initial hardware cost. If you have the hardware already, the breakeven point comes surprisingly fast. A few hundred thousand tokens of API output can cost more than the electricity your laptop uses running a local model for a week.

Third, experimentation. When you are prototyping, trying dozens of prompts, testing different approaches, the ability to iterate without watching a meter is liberating. You try things you would not try if each attempt cost money.

When does running locally not make sense? When you need the best possible quality. The frontier models from OpenAI, Anthropic, and Google are substantially better than anything you can run on consumer hardware. A local seven billion parameter model is not going to match Claude or GPT-4o on complex reasoning, nuanced writing, or code generation. You are trading quality for cost and privacy. That trade is sometimes excellent and sometimes terrible, depending on the task.

The speed difference is also significant. A cloud provider runs your prompt on clusters of H100 GPUs with hundreds of gigabytes of high-bandwidth memory. Your laptop has a fraction of that throughput. A response that takes three seconds from an API might take thirty seconds locally. For interactive use, that difference matters. For batch processing overnight, it does not matter at all.

The Triangle

Every decision about how to use AI sits inside a triangle with three corners: speed, quality, and cost. You get to pick two.

Fast and high quality means a frontier model through an API with no rate limiting. That is expensive. Fast and cheap means a small local model or a budget API tier. The quality drops. High quality and cheap means a powerful model on batch processing, where the provider runs your requests overnight when the GPUs would otherwise be idle. OpenAI's batch API charges half the normal price for results delivered within twenty-four hours. Anthropic offers similar batch pricing. You give up speed and get the same model at half cost.

This is not a metaphor. It is the direct consequence of GPU economics. A data center has fixed hardware costs whether the GPUs are busy or idle. Overnight, demand drops. The company can fill that idle capacity with batch requests at a discount and still make money. During peak hours, the same GPU time costs a premium because demand exceeds supply.

If your use case can tolerate a delay, batch processing is one of the most underused levers available. Translating a hundred documents, generating summaries of a thousand articles, processing a backlog of customer support tickets, none of these need real-time responses. Send them as a batch, pay half price, get the results in the morning.

Try This Right Now

Open whatever AI tool you use most. Ask it a question you have asked before, something where you know what a good answer looks like. Now pay attention to the timing. How long before the first word appears? That gap is latency, the prefill computation processing your prompt. How fast do the words flow after that? That is decode speed, the serial token generation.

Now try making your prompt longer. Paste in a few paragraphs of context before the same question. Watch the first-word delay increase while the word-by-word speed after that stays roughly the same. You are watching the prefill phase get heavier while the decode phase stays constant. The main episode explained why. Now you can see it happen.

If you have access to an API with usage tracking, send the same question to two different models, one large and one small. Compare the cost and the quality. That is the speed-quality-cost triangle in action. Finding the smallest model that is good enough for each task you do regularly is one of the highest-value optimizations available to anyone who uses AI seriously.

What Comes Next

That was the practical companion for episode eleven. The main story showed you the journey from keyboard to data center. The deep dive took you inside the GPU, through speculative decoding and quantization and the economics of inference hardware. And now you know how to think about cost, speed, and quality as levers you can actually pull.

Episode twelve is about benchmarks. How we measure what AI can do, why every measurement gets gamed within months, and what that tells us about what intelligence even means.

That was the practical companion for episode eleven of Actually, AI.