This is episode ten of Actually, AI.
You paste a twenty-page contract into your AI assistant and ask about a clause on page fourteen. The answer comes back confident, specific, and subtly wrong. It has mixed up a definition from page two with a condition from page fourteen, stitching them together into a sentence that sounds correct but is not. You think: the AI forgot what was on page fourteen. It lost track. Its memory failed.
None of that is what happened. The model did not forget, because it never remembered in the first place. Not the way you remember. When you read a long document, you build a mental model as you go, paragraph by paragraph, updating your understanding, occasionally flipping back to check something. The AI does none of that. It processes every token in the document simultaneously, in a single pass, with no concept of reading order and no mechanism for flipping back. It sees everything at once. And yet it still gets page fourteen wrong.
The reason is architectural, not metaphorical. The attention mechanism, which we covered in episode four, computes relationships between every pair of tokens in the input. But the way those relationships are weighted creates a pattern. Tokens near the beginning of the input and tokens near the end receive stronger attention. Tokens in the middle receive weaker attention. Not because the model is skimming. Because the mathematics of how position is encoded and how attention scores distribute create a systematic bias toward the edges. Your page-fourteen clause is not forgotten. It is underweighted. The model processed it. It just processed it with less focus than the paragraph at the top and the paragraph at the bottom.
The word "memory" suggests a filing cabinet where things are stored and retrieved. The context window is nothing like that. It is more like a spotlight that illuminates everything in the room simultaneously but shines brightest near the door where you entered and the window where you will leave.
In the summer of twenty twenty-three, a PhD student at Stanford named Nelson Liu set out to measure exactly how much of a long context models actually use. Liu had an undergraduate degree in computer science and linguistics from the University of Washington and had already published over twenty-five papers in top venues by the time he started this particular experiment. He was working with Percy Liang, one of the most respected names in natural language processing, in the Stanford NLP group.
The experiment was elegant in its simplicity. Take a question-answering task. Scatter twenty documents into the context, only one of which contains the answer. Then move that answer document around, placing it at position one, then position five, then position ten, then position fifteen, then position twenty. Ask the model the same question each time. See how accuracy changes depending on where the answer was hiding.
What Liu and his co-authors found was a curve shaped like the letter U. When the answer sat at position one, the beginning, models got it right about seventy-five percent of the time. When it sat at position twenty, the end, accuracy was about seventy-two percent. But when the answer sat at position ten, the dead center, accuracy dropped to fifty-five percent. A thirty percent gap between the best and worst positions. They called the paper "Lost in the Middle."
The most disturbing part was the universality. They tested GPT-3.5 Turbo, GPT-4, Claude, Llama 2, and several others. Every single model showed the same U-shaped curve. This was not a bug in one company's product. It was a property of the architecture itself. Every transformer-based model, which is to say every major language model in existence, has this blind spot in the middle.
Models are most effective at using relevant information that occurs at the very beginning or the very end of the input context. Performance degrades significantly when models must access relevant information in the middle of long contexts.
Think about what that means for how you use these tools. When you paste a long document and ask a question, the model is not searching the document equally. It is reading the edges carefully and the middle with something like inattention. The practical advice that emerged from Liu's work is almost comically simple: put the important stuff first.
So what actually is a context window? Strip away the metaphors and it is a number. The maximum count of tokens the model can process in a single forward pass. GPT-1, released in twenty eighteen, had a window of five hundred and twelve tokens, roughly a page and a half of text. GPT-3, two years later, had two thousand forty-eight. Claude's first version in early twenty twenty-three had nine thousand. Then in May of that year, Anthropic jumped to one hundred thousand. By February twenty twenty-four, Google's Gemini had reached one million. As of early twenty twenty-six, some models claim ten million.
Every one of those jumps required genuine engineering work, not just bigger hardware. The reason is the quadratic wall. Attention computes relationships between every pair of tokens. Double the context length and you do not double the computation. You quadruple it. Go from four thousand tokens to one hundred twenty-eight thousand, a factor of thirty-two, and the attention cost increases by thirty-two squared, which is one thousand and twenty-four times more computation. That is not a scaling challenge you solve by buying more graphics cards. It is a mathematical wall that requires architectural breakthroughs to get around.
The breakthroughs came from two directions. First, researchers found cleverer ways to tell the model where each token is positioned. The original transformer used a fixed mathematical formula involving sine and cosine waves. Modern models use a technique called rotary position embedding, invented by a researcher named Jianlin Su at a Shenzhen startup, which encodes position through rotation rather than addition. This turns out to be far more graceful at extending to longer sequences than the original approach. Su published the idea on his personal blog in Chinese in twenty twenty-one. Within two years, every major open-source language model on the planet was using it.
Second, a Stanford PhD student named Tri Dao figured out that the attention computation was slow not because of the math but because of how the math interacted with the physical layout of memory on a graphics card. His algorithm, Flash Attention, does exactly the same computation as standard attention but reorganizes it so the data stays in the fast memory on the chip instead of bouncing back and forth to the slower main memory. The result was exact attention, not an approximation, running seven times faster and using a fraction of the memory. Suddenly, longer contexts were not just theoretically possible. They were practical.
Here is where the consequence of all this becomes personal. You are paying for context, whether you think about it or not. Every token in your conversation, your question, the system instructions, the entire conversation history, gets re-processed every time the model generates a response. A hundred-turn conversation at the full context limit can process over twelve million tokens across the session, even if you only typed a few thousand words yourself. Every one of those tokens costs money, and the price is not trivial. A full million-token conversation with a frontier model can cost five dollars or more in input tokens alone.
This is why a technique called retrieval augmented generation, or RAG, often outperforms simply pasting everything into the context window. Instead of feeding the model your entire codebase, a retrieval system searches for the five or ten most relevant files and feeds only those. The model gets less total information but more relevant information, concentrated at the beginning of the context where attention is strongest. Research from twenty twenty-four found that for sixty percent of queries, RAG and long context produce identical answers, but RAG does it at a fraction of the cost. For the remaining forty percent, which approach wins depends on whether the task requires synthesis across the whole document or pinpoint retrieval of a specific fact.
And then there is the discovery that scrambled how everyone thought about context quality. In twenty twenty-five, a research team at Chroma tested eighteen frontier models and found something that should not have been true. Models performed better when the irrelevant context documents were shuffled into random, incoherent order than when those documents were logically structured. Coherent text was a worse haystack than random noise. The implication is striking. When you feed a model a well-organized document, its attention follows the narrative structure, getting pulled along by the logic and coherence of the prose. The model becomes a reader instead of a retrieval engine. Structure, the very quality that makes documents useful to humans, becomes a distraction for the machine.
The context window sits at the intersection of almost everything in this series. It is measured in the tokens we discussed in episode one. It is bounded by the quadratic cost of the attention mechanism from episode four. In the next episode, episode eleven, on inference, we will walk through the KV cache, the data structure that stores previous tokens during generation and becomes the single largest memory cost when contexts get long. And in episode twelve, on benchmarks, we will see how tests like needle in a haystack try to measure whether models actually use their full context window, and how those tests can be both revealing and misleading.
The honest picture is this. Context windows have grown a thousandfold in three years, from two thousand tokens to two million and beyond. The engineering is real and impressive. But the model's ability to use that context has not grown at the same rate. A million-token window does not mean a million tokens of equal attention. The edges still get priority. The middle still suffers. And the cost still scales. The race to extend context windows is far from won. The frontier is not just making the window bigger. It is making every token inside it count equally.
If you want to know the engineering underneath, how positional encodings actually work, why a Shenzhen researcher's blog post ended up inside every major AI model, and what happens when you hide a sentence about sandwiches inside two hundred thousand tokens of Paul Graham essays, the deep dive is waiting in your feed. That was episode ten.