Inference Deep Dive: Inside the Machine

The Split

This is the deep dive companion to episode eleven of Actually, AI, inference.

In the main episode, we described prefill and decode as two phases. That undersells how different they are. They are not two steps of the same process. They are two fundamentally different types of computation that happen to share the same hardware, and the entire history of inference optimization is the story of people realizing this and pulling them apart.

Prefill is compute-bound. Your entire prompt, hundreds or thousands of tokens, gets processed in one parallel operation. The GPU's arithmetic units are saturated. The math is dense. Large matrix multiplied by large matrix, the kind of work GPUs were designed for back when their only job was rendering video game triangles. The bottleneck is raw computational power. Throw more floating-point operations at the problem and it goes faster.

Decode is memory-bandwidth-bound. You are generating one token at a time, which means the operation is a matrix multiplied by a single vector. The GPU's arithmetic units are mostly idle, waiting. The bottleneck is not computation but how fast the hardware can fetch data, reading the model weights and the entire key-value cache from memory for every single token. On an H100, the high-bandwidth memory can deliver about three terabytes per second. That sounds fast until you realize that a seventy billion parameter model in sixteen-bit precision is one hundred and forty gigabytes of weights, and the key-value cache might be another forty gigabytes, and you need to read all of it for every token you generate.

This asymmetry has a practical consequence that modern serving systems exploit. If prefill needs compute and decode needs memory bandwidth, why run them on the same hardware? A system called DistServe, published at the USENIX operating systems conference in twenty twenty-four, physically separates prefill and decode onto different pools of GPUs. The prefill pool gets GPUs optimized for raw computation. The decode pool gets GPUs optimized for memory throughput. The key-value cache transfers between pools over high-speed interconnects, adding less than one decode step of overhead. The result was seven times more requests served, or twelve times tighter latency guarantees, compared to the best existing system. By twenty twenty-five, the authors noted that almost every production-grade serving framework had adopted some form of this disaggregation.

The Drafting Trick

Decode is slow because the model generates one token at a time. Each token requires a full forward pass through every layer. And most of those tokens are boring. If the model is writing "the United States of America," the tokens "States," "of," and "America" are nearly predetermined after "the United." The model is spending billions of operations to confirm what a much smaller model could have guessed.

In November twenty twenty-two, Yaniv Leviathan at Google had an idea. What if you used a small, fast model to draft a sequence of candidate tokens, and then used the big model to check them all at once? The checking step is a parallel operation, like prefill, meaning the big model can verify, say, five drafted tokens in roughly the same time it takes to generate one token from scratch. If the draft matches what the big model would have chosen, you just got five tokens for the price of one. If the draft diverges, you accept the tokens up to the point of disagreement and regenerate from there.

The remarkable property of this technique, called speculative decoding, is that it produces exactly the same output distribution as the big model running alone. Not approximately the same. Exactly. The mathematics of the rejection sampling guarantee that the probability of generating any particular sequence is identical whether you used the drafting trick or not. You get a two to five times speedup with no quality loss whatsoever.

Some tokens are easier to generate than others. The small model can often predict what the bigger model would say.

Two months later, in February twenty twenty-three, a separate team at DeepMind independently published the same idea under the name "speculative sampling." The parallel discovery underlines how natural the insight is once you frame the problem correctly. Google now uses speculative decoding to power AI Overviews in search. Groq, the inference chip company, achieved sixteen hundred and sixty tokens per second on a seventy billion parameter model with speculative decoding, roughly six times faster than without it. The reason is that their hardware architecture, built entirely around fast on-chip memory, makes the verification step nearly instant.

Packing the Bus

A single user generating tokens one at a time leaves most of the GPU idle. The hardware has enormous parallel capacity, and a single decode stream uses a fraction of it. The obvious solution is batching: serve multiple users simultaneously, packing their computations together so the GPU has more work to do per memory access.

The naive approach groups a batch of requests, processes them together, and waits until the longest one finishes before accepting new work. This is terrible. If one user asks for a haiku and another asks for a five-hundred-word essay, the haiku is done in seconds but the GPU slot sits idle until the essay finishes. In a study on A100 GPUs with a thirteen billion parameter model, naive batching utilized the hardware so poorly that modern alternatives achieved twenty-three times higher throughput.

The fix came from a paper called Orca, published at the USENIX operating systems conference in July twenty twenty-two by Gyeong-In Yu and colleagues at Seoul National University. Their innovation was iteration-level scheduling. Instead of grouping requests into fixed batches, the system re-evaluates after every single token generation step. A finished request is immediately replaced by a new one. The batch composition changes every iteration. This single change delivered an eightfold throughput improvement.

Then Woosuk Kwon at UC Berkeley attacked the memory side. The key-value cache for each request was being allocated as a single contiguous block in GPU memory. If the system reserved space for a two-thousand-token response and the actual response was three hundred tokens, the remaining seventeen hundred token slots were wasted. Across many concurrent requests, sixty to eighty percent of allocated cache memory contained nothing useful.

Kwon's insight was that this looked exactly like the memory fragmentation problem that operating systems faced in the nineteen sixties. The solution in operating systems was paging: divide memory into small fixed-size blocks that do not need to be contiguous. Kwon applied the same idea to the key-value cache. PagedAttention stores cache blocks in whatever GPU memory happens to be available, using a lookup table to find them, just like a page table in an operating system. Memory waste dropped to under four percent. Combined with continuous batching, vLLM achieved twenty-four times higher throughput than the previous standard library.

The KV cache memory problem looked exactly like the virtual memory fragmentation problem that operating systems solved decades ago with paging.

When the LMSYS chatbot arena deployed vLLM, they saw a thirty-fold throughput increase and were able to cut their GPU fleet in half while handling thirty thousand daily requests with peaks of sixty thousand.

Shrinking the Numbers

A language model is, at its core, a collection of numbers. Billions of them. Each number is a weight, a single value that was adjusted during training. And here is a question that turns out to matter enormously: how precisely do you need to store those numbers?

During training, weights are typically stored in sixteen-bit floating point, meaning each number takes two bytes. A seventy billion parameter model in sixteen-bit precision occupies one hundred and forty gigabytes. That fills nearly two entire H100 GPUs just for the weights, before you allocate a single byte for the key-value cache or the computation itself.

Quantization is the practice of storing those numbers in fewer bits. Sixteen-bit to eight-bit cuts memory in half. Eight-bit to four-bit cuts it in half again. A seventy billion parameter model that needed one hundred and forty gigabytes in sixteen-bit precision fits in thirty-five gigabytes at four bits. That is one GPU instead of two. The question is: how much quality do you lose?

The answer, discovered empirically across dozens of studies, is surprisingly little. At eight-bit precision, the accuracy drop is typically less than one percent compared to the sixteen-bit original. At four-bit, the drop is around four to six percent on most benchmarks. The reason four-bit works at all is that most weights in a trained model cluster near zero. The few weights that carry disproportionate importance, the outliers, can be identified and protected. A technique called Activation-Aware Quantization, AWQ, does exactly this: it finds the one percent of weights that matter most based on how they interact with actual inputs, applies scaling factors to preserve their precision, and aggressively compresses everything else.

There is a size threshold. Models with seventy billion or more parameters handle four-bit quantization gracefully. Smaller models, seven billion and below, show more degradation. The intuition is that larger models have more redundancy, more weights that can absorb the precision loss without meaningful impact on output quality.

This is what made Georgi Gerganov's llama.cpp transformative. His GGUF format became the standard way to package quantized models for consumer hardware. A seventy billion parameter model that needs a data center rack in full precision can run on a gaming laptop at four bits. Not as fast, not with the same throughput, but functional. Gerganov did not invent quantization. He made it accessible.

The Thousand-Dollar Question

Every time you ask a chatbot a question, somewhere a GPU worth more than a car is doing math on your behalf. An H100, the workhorse of most AI data centers, costs between twenty-five and forty thousand dollars. A full server with eight of them costs more than most houses. Cloud rental runs two to five dollars per GPU per hour, with prices that dropped sixty-four to seventy-five percent from their twenty twenty-three peaks as supply caught up with demand.

NVIDIA holds roughly ninety percent of the AI GPU market. Their data center revenue hit forty-seven and a half billion in fiscal twenty twenty-four, exceeded one hundred billion in twenty twenty-five, and is projected above one hundred and thirty billion for twenty twenty-six. At a technology conference in twenty twenty-six, Jensen Huang, NVIDIA's CEO, projected over one trillion dollars in chip sales through twenty twenty-seven.

The number of tokens that are being generated has really, really gone exponential, and so we need to inference at a much higher speed.

The token-level economics are revealing. Processing a typical GPT-4o query consumes roughly zero point three watt-hours of energy. That sounds trivial. But ChatGPT handles an estimated seven hundred million tokens per second across its infrastructure, running on what has been estimated at nearly twenty-nine thousand GPUs spread across Microsoft Azure data centers. Anthropic runs Claude on a combination of Google TPUs, Amazon Trainium chips, and additional cloud capacity, committing to over one million next-generation TPU chips coming online in twenty twenty-six. The electricity alone is staggering. Data center energy demand is projected to reach one thousand terawatt-hours globally by twenty twenty-six, with AI as a primary driver.

The price per token has been falling at roughly ten times per year. The sixty dollars per million tokens that GPT-3 charged in late twenty twenty-one became fifteen cents per million tokens for GPT-4o Mini in mid twenty twenty-four. That is a thousand-fold reduction in three years. Six factors drove it: faster hardware, quantization, better serving software, more efficient model architectures, instruction tuning that makes smaller models match the quality of larger ones, and the competitive pressure of open-source models offering comparable performance for free.

Yet even with that crash, OpenAI is substantially unprofitable. Revenue of roughly three point seven billion dollars in twenty twenty-four against total spending well over four billion. Inference alone accounted for an estimated one point eight billion. Training is a one-time cost. Inference is the utility bill. And it scales with every new user, every longer conversation, every thinking model that generates invisible reasoning tokens before answering.

The Experts Who Take Turns

Mixture of Experts is an architectural trick that changes the inference math. In a standard dense model, every parameter activates for every token. All seventy billion weights participate in generating the word "the." Mixture of Experts replaces the large feed-forward layer inside each transformer block with multiple smaller "expert" sub-networks. A lightweight routing network looks at each incoming token and decides which two or three experts should handle it. The rest stay idle.

The result is a model with enormous total capacity but modest per-token cost. Mixtral eight by seven B has forty-six point seven billion total parameters but only activates twelve point nine billion per token. It runs at the speed of a thirteen billion parameter model while having the knowledge capacity of something much larger. DeepSeek V3 pushes this further, with six hundred and seventy-one billion total parameters, two hundred and fifty-six fine-grained experts, and only thirty-seven billion active per token. That is three point one percent of the total network firing for any given word.

The catch is memory. Even though most experts are idle for any given token, all of them must be loaded into GPU memory, or at least distributed across a cluster of GPUs, ready to activate on demand. Mixture of Experts saves compute but not memory. This is why MoE models still need large clusters to run, even though each individual token costs less to process than it would in a dense model of equivalent quality.

DeepSeek V3 is notable for another reason. The team reported training costs of five point six million dollars, using twenty forty-eight H800 GPUs for fifty-five days, roughly eleven times more efficient than Meta's Llama 3 training run. That number has been widely cited, but it excludes the cost of the hardware itself, roughly fifty-one million dollars, plus research and development spending and failed experiments that bring the real total well above five hundred million. The headline number is not wrong, but it measures only the marginal cost of the successful run, not the full investment required to reach it.

The Full Stack

Let us trace the journey of a single prompt. You type a question and press send. What happens next involves more systems talking to each other than most web applications ever touch, and all of it must complete before you see the first word.

Your text arrives as an HTTP request at a load balancer, which routes it to one of many model instances based on current load. Some modern load balancers are cache-aware, meaning they try to route your request to a GPU that already has a relevant key-value cache from your previous messages, avoiding the cost of rebuilding it from scratch.

A preprocessing service tokenizes your text using Byte Pair Encoding, retrieves your conversation history from a fast cache like Redis, and assembles the full context. This assembled token sequence enters the model for prefill, building the key-value cache for the entire prompt. Then decode begins, generating tokens one at a time. Each token triggers a callback that enqueues it into a Server-Sent Events controller, which pushes it to your browser as a tiny JSON message. Content type: text event stream. Each word you see is a separate network event.

For models too large to fit on a single GPU, the layers are distributed across multiple GPUs using two strategies. Tensor parallelism splits individual layers across GPUs within a single server node, connected by NVLink at hundreds of gigabytes per second. Pipeline parallelism assigns chunks of layers to different nodes, data flowing like an assembly line. A four hundred and five billion parameter model in sixteen-bit precision needs roughly eight hundred gigabytes of memory for the weights alone, requiring at minimum sixteen H100s across two server nodes. In eight-bit precision, it barely squeezes onto a single eight-GPU node.

After the final token is generated, a stop token or a length limit, postprocessing detokenizes the output back into text, applies any safety filtering, and closes the stream.

The Groq Bet

Most inference hardware is an adaptation. GPUs were designed for graphics and repurposed for AI. They work remarkably well, but they carry the architectural baggage of their original purpose. Groq asked a different question: what would a chip designed exclusively for inference look like?

The Groq Language Processing Unit is built around SRAM, the same fast on-chip memory that Flash Attention tries to exploit within a GPU. But where a GPU has about twenty megabytes of SRAM and eighty gigabytes of slower main memory, the Groq approach distributes SRAM across the entire chip with no main memory at all. The compute schedule is determined at compile time, not at runtime. There are no caches to miss, no memory hierarchies to navigate, no dynamic scheduling overhead. The chip knows exactly what data will be where at every clock cycle.

The result is deterministic latency and extraordinary speed. On a seventy billion parameter model, Groq achieves two hundred and eighty to three hundred tokens per second as a baseline and sixteen hundred and sixty with speculative decoding. Time to first token is two to three tenths of a second, which feels instant to a human. The tradeoff is flexibility. The static scheduling means the chip cannot easily adapt to variable workloads, and the SRAM-only architecture limits model size per chip. But for pure inference throughput on known model architectures, nothing else is close.

AI in Your Pocket

The opposite end of the spectrum from cloud data centers is on-device inference, running language models directly on the phone in your pocket with no server connection at all.

Apple introduced its on-device foundation model at their developer conference in twenty twenty-four. A model with roughly three billion parameters, trained with aggressive two-bit quantization-aware training, meaning the model was trained from the start knowing it would be compressed to two bits per weight rather than being compressed after the fact. The architecture splits the model into two blocks that share a key-value cache between them, reducing cache memory by over a third. On an iPhone fifteen Pro, the model generates thirty tokens per second with a time to first token of about zero point six milliseconds per prompt token. It supports fifteen languages and runs entirely on the phone's Neural Engine, a dedicated chip for AI workloads built into every modern Apple processor.

Qualcomm's Hexagon processor, the AI accelerator in most Android flagship phones, delivers up to seventy-three trillion operations per second, with prefill speeds ten times faster than running on the phone's main processor. Google has its Edge TPU. Every major mobile chip maker now builds dedicated AI silicon into their processors. The phone in your pocket has more inference capability than a data center GPU from a decade ago.

But it was Gerganov's llama.cpp that turned this hardware capability into something people could actually use. By writing inference code in pure C and C++ with no dependencies, optimized for the specific instruction sets of Apple Silicon, Intel, and ARM processors, he gave every hobbyist and developer a way to run capable language models without a cloud account. The GGUF format he created provides quantized model weights at multiple precision levels, letting users choose their own tradeoff between quality and speed. Meta's ExecuTorch framework, which reached version one in October twenty twenty-five with a fifty-kilobyte base footprint and support for twelve hardware backends, represents the industry catching up to what one engineer from Bulgaria had already demonstrated was possible.

Thinking Tokens and the New Economics

There is a new dimension to inference cost that did not exist before September twenty twenty-four. Models like OpenAI's o1 and o3 generate internal reasoning tokens before producing a visible answer. These thinking tokens do not appear in the response, but they consume the same GPU resources as any other generated token. A model might produce five hundred invisible reasoning tokens before writing a single word the user sees.

Chain-of-thought enables the model to perform far more floating-point operations of computation for each token of the answer.

The performance improvement follows a logarithmic curve. The first thousand thinking tokens help enormously. Going from ten thousand to eleven thousand helps less. There is a diminishing return, but the initial payoff is substantial enough that OpenAI, DeepSeek, and others are betting heavily on inference-time compute as a complement to larger models. DeepSeek R1, released in January twenty twenty-five, proved the approach could work in an open-source model.

The economic implication is significant. Thinking models cost more per query not because the visible output is longer but because the invisible reasoning chain might be many times longer than the response itself. OpenAI exposes a reasoning effort parameter that lets users choose between low, medium, and high thinking budgets, effectively letting the user trade money for quality on each request. This is the new frontier of inference economics: not just how much it costs to generate a token, but how many invisible tokens should the model generate before it starts producing visible ones.

The Jargon Jar

This episode's term: latency.

If you text a friend, latency is how long you wait before you see the typing indicator appear. Not the time for the full message. Just the first sign that something is happening. That is latency in inference too: the gap between pressing send and seeing the first token of the response.

The marketing version: "ultra-low latency AI responses."

In practice, latency is determined by three things that fight each other. Prompt length, because longer prompts mean more computation during prefill before the first token can even begin generating. Model size, because more parameters means more data to process for every layer. And congestion, because when many users hit the system simultaneously, requests queue up waiting for a GPU to become available. The first token is the expensive one, not because it is special, but because it cannot start until the entire prompt has been processed. After that, tokens flow at a relatively steady rate regardless of how long the original prompt was. When a service feels slow, it is almost always latency, the wait for that first word, not generation speed. The first word is doing all the heavy lifting.

That was the deep dive for episode eleven.