This is episode eleven of Actually, AI.
You type a question. A few seconds later, words begin appearing on your screen, one by one, like someone typing back to you from the other side of a very fast internet connection. The whole thing feels instant. Effortless. Like a search engine that speaks in paragraphs. It is not. What happens between your enter key and the first word appearing is one of the most computationally violent events in all of consumer technology. Your text travels to a data center, gets broken into tokens, flows through billions of mathematical operations distributed across dozens of specialized processors, and generates a response one piece at a time. Those words streaming onto your screen are not being retrieved from somewhere. They are being constructed, live, each one chosen from a probability distribution over tens of thousands of candidates, and the choice of each word changes the probability of every word that follows.
The typewriter effect is not a design choice. It is not an animation meant to look thoughtful. You are watching the machine think. Each word appears because the model just finished deciding, at that exact moment, that this particular word is the most likely next piece of the sequence. The technical name for this process is inference, and it is the single most expensive, most engineered, and least understood stage of what AI companies actually do.
Most AI research focuses on training, on making models smarter. Papers about training run attract thousands of citations, earn conference best-paper awards, make careers. But once you have a brilliant model, someone has to actually run it for a hundred million users simultaneously, and that problem turns out to be at least as hard and far less glamorous.
Tri Dao was a PhD student at Stanford when he noticed something that the rest of the field had overlooked. The standard attention mechanism, the core operation inside every modern language model, was spending most of its time not on computation but on moving data back and forth between two levels of memory inside the GPU. The math was fast. The memory shuttling was slow. In twenty twenty-two, Dao published Flash Attention, a technique that reorganized the computation so it could happen in the GPU's tiny but fast on-chip memory without needing to store enormous intermediate results in the slower main memory. The speedup was two to four times. The memory savings were five to twenty times. The idea was so effective that within two years, essentially every major open-source language model used it. Dao is now an assistant professor at Princeton and co-founder of Together AI. His paper has been cited over twenty-nine thousand times.
Then there was Georgi Gerganov, a software engineer from Bulgaria who asked a question nobody at the big labs was asking: what if you did not need a data center at all? In March twenty twenty-three, two weeks after Meta released the weights for its Llama model, Gerganov published llama.cpp, a pure C and C++ implementation that could run a seven billion parameter language model on a MacBook. No GPU required. No cloud server. No special hardware. He had already built a similar tool for speech recognition, whisper.cpp, and understood that quantization, reducing the precision of the model's numbers from sixteen bits down to four, could shrink models enough to fit on consumer devices while preserving most of their quality. His GGUF file format became the standard for distributing quantized models. Tools like Ollama and LM Studio, the apps that let anyone run AI locally, are all built on the foundation Gerganov created. In February twenty twenty-six, he and his team joined Hugging Face.
And at UC Berkeley, a PhD student named Woosuk Kwon was staring at the memory waste problem. When a language model generates text, it stores intermediate calculations called the key-value cache, and these caches were wasting sixty to eighty percent of allocated GPU memory through fragmentation. Kwon realized the problem looked exactly like one that operating systems had solved decades earlier: virtual memory. He built PagedAttention and the vLLM serving framework, reducing memory waste to under four percent and enabling up to twenty-four times higher throughput than previous approaches. The system now serves millions of requests daily across the industry.
Here is what actually happens inside the machine when your message arrives.
The process has two distinct phases, and understanding them explains nearly every quirk of the experience. The first phase is called prefill. Your entire prompt, every token of it, gets processed simultaneously in one massive parallel computation. All the attention calculations happen at once, building up the model's understanding of what you asked. This phase is heavy on raw computation but efficient because everything runs in parallel. The result is a complete internal representation of your input, stored in that key-value cache.
Then the second phase begins, and everything changes. This is the decode phase, token generation. The model produces one token. It looks at the probability distribution over its entire vocabulary, roughly a hundred thousand possible next pieces, picks one, adds it to the sequence, and feeds it back through the network to produce the next token. One at a time. Serially. There is no way around this, because each token depends on every token that came before it. The model cannot generate the fifth word until it has chosen the fourth.
This is why your prompt length affects how long you wait before the first word appears. A longer prompt means a bigger prefill computation. But once that first word starts streaming, the speed of generation is roughly constant regardless of how long your original question was. The first word is the expensive one.
The key-value cache is what makes this bearable. Without it, generating the hundredth token would require reprocessing all ninety-nine previous tokens from scratch. With the cache, the model only computes the new token's interactions against the stored representations of everything before it. The cache turns what would be a quadratic explosion into a linear process. It is also why long conversations eat memory. A seventy billion parameter model with a hundred and twenty-eight thousand token context can consume forty gigabytes of GPU memory just for the cache, sometimes exceeding the memory taken by the model's own weights.
Every inference request costs money. Real money. An NVIDIA H100 GPU, the workhorse of most AI data centers, costs between twenty-five and forty thousand dollars. A full server with eight of them runs between two hundred and four hundred thousand. Cloud rental runs two to five dollars per GPU per hour. And a large model does not fit on one GPU. Llama three point one at four hundred and five billion parameters needs a minimum of sixteen H100s, two full servers, just to load into memory.
This is why AI companies charge by the token. When GPT-3 was available through the API in twenty twenty, the price was sixty dollars per million tokens. By the time GPT-4o Mini arrived in mid twenty twenty-four, it had dropped to fifteen cents per million tokens. A thousand-fold reduction in three years. That crash came from a convergence of improvements: better hardware, quantization squeezing models into fewer bits, continuous batching letting GPUs serve dozens of users simultaneously instead of one at a time, and smaller models achieving what only massive ones could before.
But here is the part that rarely makes the headlines. Training a frontier model is a one-time expense. GPT-4 reportedly cost between seventy-eight and one hundred and ninety-two million dollars to train. That is a staggering number. But OpenAI spent an estimated one point eight billion dollars on inference in twenty twenty-four alone, and that was just one year of running the model for users. Anthropic burns two point seven million dollars per day serving Claude. Over a model's operational lifetime, the cost of running it for people dwarfs the cost of creating it in the first place. Training builds the brain. Inference is the electricity bill that never stops.
This is also why "thinking" models like o1 cost more. When those models reason internally before answering, they are generating hundreds or thousands of invisible tokens, reasoning tokens that the user never sees but the GPU still has to produce one at a time. Every thinking token is another iteration of the decode loop, another forward pass through billions of parameters, another fraction of a cent on the bill.
Inference is where every concept in this series converges into a single moment. The tokens from episode one are what the model reads during prefill. The attention mechanism from episode four is the core operation inside each transformer layer, repeated dozens of times per token. The weights that attention operates on were shaped by training in episode three. The cost of running the model scales with the number of parameters from episode seven and the length of the context window from episode ten.
And there is a tension underneath all of it that the field is still grappling with. Researchers build bigger, smarter models. Inference engineers figure out how to actually run them for real people at real scale. The researchers make the models bigger. The engineers optimize harder. The users send more messages. It is an arms race between ambition and economics, and right now, economics is winning. The thousand-fold price crash was not a gift. It was a necessity. Without it, the models the researchers built would be too expensive for anyone to use.
That was episode eleven. The deep dive goes further into the full inference stack, from speculative decoding to GPU economics to the engineer who figured out how to run a language model on a phone. Find it right after this in your feed.