Context Windows Deep Dive: Inside the Million-Token Machine

The Position Problem

This is the deep dive companion to episode ten of Actually, AI, context windows.

In the main episode, we said that transformers process all tokens simultaneously and have no inherent sense of order. That deserves unpacking, because it creates a problem so fundamental that the entire context window story flows from how different researchers chose to solve it. A transformer without positional information treats "the cat sat on the mat" and "mat the on sat cat the" as identical inputs. Every token attends to every other token, but the attention mechanism has no way to know which tokens came first, which came last, or which are neighbors. Order is invisible.

The original twenty seventeen transformer paper by Vaswani and colleagues solved this with a mathematical trick. They generated a unique pattern for each position using sine and cosine waves at different frequencies, one pattern for position one, a different pattern for position two, all the way up. Then they added these patterns directly to the token embeddings before feeding them into the model. Position one gets a fingerprint. Position five hundred gets a different fingerprint. The model learns to read these fingerprints the way you learn to read page numbers.

We hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, the positional encoding at position plus k can be represented as a linear function of the encoding at position.

The approach was elegant and needed no training. But it had a ceiling. The sine waves were designed for a specific maximum sequence length. In theory, they could extrapolate beyond it. In practice, they could not, at least not reliably. When BERT and GPT-2 came along, both switched to a simpler method: just learn a position embedding for each slot during training. Position one gets its own trained vector. Position five hundred and twelve gets its own trained vector. This worked well within the trained range but created a hard wall. Position five hundred and thirteen was unknown territory, because the model had never seen a trained embedding for it. You could not extend the context without retraining.

Both approaches share a deeper limitation. They encode absolute position, telling the model this token is at position forty-seven. But what matters for understanding language is usually relative position, how far apart two tokens are, not where they sit in the absolute sequence. The word "it" in a sentence needs to know that "the cat" is three tokens back. Whether that happens at position ten or position ten thousand is irrelevant.

The Blog Post That Changed Everything

In early twenty twenty-one, a researcher named Jianlin Su was working at Zhuiyi Technology, a natural language processing company in Shenzhen. Su had a master's degree in mathematics from Sun Yat-sen University in Guangzhou and a deep interest in the geometry of neural networks. He was not at Google. He was not at a major American research lab. He was writing technical blog posts on his personal site, kexue.fm, which translates to Scientific Spaces.

Su's idea started from a geometric intuition. Instead of adding a position signal to the token embedding, what if you rotated it? Represent each pair of dimensions in the embedding as a point on a circle. Then rotate that point by an angle proportional to its position. Token at position one gets a small rotation. Token at position one hundred gets a larger rotation. The attention score between two tokens then depends naturally on the angle between their rotations, which is determined solely by how far apart they are, not by their absolute positions. Relative position falls out of the math automatically.

He called it Rotary Position Embedding, or RoPE. The paper title was characteristically understated: "RoFormer: Enhanced Transformer with Rotary Position Embedding." It appeared on his blog first, in Chinese, as part of a series he called "Path to Transformer Upgrades." The arXiv preprint followed in April twenty twenty-one.

What happened next is one of those quiet inflection points that only becomes visible in retrospect. EleutherAI, the open-source research collective, independently discovered Su's work while building GPT-NeoX. They validated his results and found that RoPE gave roughly thirty percent faster convergence in their one hundred fifty million parameter models and ten to twenty percent better validation loss in their one point four billion parameter models compared to the previous state of the art for relative position encoding.

Then in February twenty twenty-three, Meta released Llama. It used RoPE. That was the tipping point. Within a year, virtually every major open-source language model adopted it. Mistral, PaLM, GPT-NeoX, and dozens of others. A Chinese NLP researcher at a Shenzhen startup, publishing on his personal blog, had written the positional encoding that now runs inside every frontier model on earth. His Google Scholar page lists over twelve thousand five hundred citations. The blog post series that introduced RoPE has become a primary reference in the field.

The reason RoPE won is not just that it encodes relative position naturally. It also enables context extension after the fact. Because the rotation angles are proportional to position, you can scale those angles down to squeeze longer sequences into the model's trained range. A family of techniques, Position Interpolation, NTK-aware scaling, and YaRN, all exploit this property of RoPE to extend context windows without full retraining. YaRN, published in September twenty twenty-three, extended a Llama seven billion parameter model from four thousand to sixty-five thousand tokens with ten times fewer training tokens than previous methods. Without RoPE's rotational structure, none of these extensions would work.

There is a competitor worth mentioning. In August twenty twenty-one, a few months after Su's paper, Ofir Press and colleagues published ALiBi, short for Attention with Linear Biases. ALiBi takes a completely different approach. It does not encode position in the embeddings at all. Instead, it adds a penalty to the attention scores that grows linearly with distance. Tokens far apart get an attention discount. The effect is similar to RoPE, encoding relative position, but the mechanism is additive rather than rotational. ALiBi trained eleven percent faster and used eleven percent less memory. It powered the BLOOM model. But RoPE won the adoption race, largely because its extension properties turned out to be more flexible. Meta's Llama 4, released in twenty twenty-five, uses a variant called iRoPE, interleaved RoPE, which alternates between layers with and without positional encoding, pushing the context window to ten million tokens.

The Cache That Ate the GPU

During text generation, the model produces one token at a time. To generate each new token, it needs to compute attention over every previous token. Doing that from scratch every time would mean reprocessing the entire conversation for every single word. The KV cache avoids this. As each token is generated, the model stores its key and value vectors, the data that other tokens need in order to attend to it. The next token only needs to compute its own query, then look up the cached keys and values from all previous tokens. Generation goes from quadratic to merely linear per step.

The cost is memory. The KV cache grows with every token generated. For a large model like Llama three seventy billion parameters, a single request at four thousand tokens uses roughly ten gigabytes just for the KV cache. Push that to one hundred twenty-eight thousand tokens and the cache balloons to around forty gigabytes, and that is with an optimization called grouped query attention that reduces the cache by a factor of eight. Without it, the same configuration would require three hundred and twenty gigabytes, more than any single graphics card can hold.

To put that in perspective, for the Llama seven billion parameter model, the crossover point where the KV cache exceeds the memory of the model itself occurs at roughly twenty-six thousand seven hundred tokens. Beyond that point, storing the conversation history takes more space than storing the brain. For larger models at longer contexts, the cache dominates everything. Four concurrent requests at one hundred twenty-eight thousand tokens on a Llama seventy billion model would require one hundred sixty gigabytes of cache memory, exceeding even Nvidia's H200 cards at one hundred forty-one gigabytes.

This is why context length is not free. Every additional token you send or receive adds to the cache. Every token in the cache must be attended to for every new token generated. The KV cache is the single largest memory bottleneck in serving language models, and the primary reason why long conversations and long documents cost more than short ones. Episode eleven, on inference, will dig into this further.

The Needle and the Haystack

In November twenty twenty-three, Greg Kamradt had an idea. Kamradt was the CEO of an AI education company called Leverage, a former Director of Growth at Salesforce, and by his own description, "a data dude." He wanted a simple, visual way to test whether models really used their full context windows. His test was almost childishly straightforward. Take a random, out-of-place fact, the needle. Bury it at various depths inside a large body of unrelated text, the haystack. Then ask the model about the needle and see if it finds it.

The needle was a sentence about the best thing to do in San Francisco being to eat a sandwich and sit in Dolores Park on a sunny day. The haystack was a collection of Paul Graham essays. Kamradt varied two things: the total length of the haystack and the depth at which the needle was buried. The results, plotted as heatmaps, went viral on X. Two and a half million views. The visualizations were beautiful and damning. They showed exactly where each model's attention failed.

GPT-4's one hundred twenty-eight thousand token window showed degradation as context grew, particularly for needles buried in the middle, confirming Liu's U-shaped curve from a different angle. But the real story was Claude 2.1. Kamradt tested it on launch day, November twenty-first, twenty twenty-three, and found an overall retrieval accuracy of just twenty-seven percent. Not twenty-seven percent in the middle. Twenty-seven percent overall. A model marketed with a two hundred thousand token context window could not reliably find a single sentence.

What happened next is one of the most fascinating moments in context window research. Anthropic published a response. Their analysis revealed that Claude 2.1 could access the information. It was deliberately choosing not to answer, because the needle looked suspicious. A random sentence about sandwiches buried in the middle of Paul Graham essays did not match the surrounding content, and Claude's training had emphasized not making claims based on potentially injected or unreliable text. The model was being cautious, not forgetful.

Claude indicated it could not answer the question based on the given context when the answer was embedded in a seemingly unrelated document. Adding a single phrase to the prompt raised accuracy from twenty-seven percent to ninety-eight percent.

The fix was one line. Adding "Here is the most relevant sentence in the context" to Claude's response framework told the model it was allowed to retrieve seemingly out-of-place information. Accuracy jumped from twenty-seven percent to ninety-eight percent. The test had been measuring not just retrieval but also the model's willingness to answer. Context window evaluation, it turned out, is as much a behavioral problem as an architectural one.

When Claude 3 Opus launched in March twenty twenty-four, it not only achieved over ninety-nine percent retrieval accuracy but in some cases identified that the needle had been artificially inserted. The model was essentially saying: I found your sentence about sandwiches, and I can tell you put it there on purpose. That is not retrieval. That is something closer to reading comprehension.

Learning a Language Nobody Speaks

If the needle-in-a-haystack test measures retrieval, Google's Kalamang test measures something deeper. Kalamang is a language spoken by fewer than two hundred people, all of them in Papua, Indonesia. There is no Kalamang data on the internet. No language model has ever been trained on it.

In February twenty twenty-four, Google tested Gemini 1.5 Pro's million-token context window by feeding it a five-hundred-page grammar manual, a bilingual dictionary, and roughly four hundred parallel sentences for Kalamang. All of it went straight into the context, no fine-tuning, no training. Then they asked the model to translate English into Kalamang.

The model scored fifty-eight point three on the ChrF metric, a standard measure for translation quality. The previous best model scored forty-five point eight. A human learner, given the same grammar book and dictionary and the same amount of time, scored fifty-seven point zero. Gemini outperformed the human baseline. Not by using its training data. By reading a textbook it had never seen, in a language it had never encountered, and applying what it learned within a single context window.

This is the most compelling evidence that million-token context windows are not just marketing. The Kalamang test requires synthesis. You cannot answer a translation question by finding a needle. You have to understand grammar rules, apply vocabulary, and combine both according to patterns described in prose you are reading for the first time. Whether that constitutes "understanding" is a question for the philosophers. But it is certainly not the same thing as keyword matching.

Google's own analysis showed that next-token prediction loss improved following a power law up to one million tokens for documents and two million tokens for code. Each additional chunk of context made the model measurably more certain about what came next. That is direct evidence of genuine utilization, not just the ability to store tokens but the ability to learn from them.

The Race and the Reality

The timeline of context window expansion reads like an arms race. Five hundred twelve tokens in twenty eighteen. Two thousand in twenty twenty. One hundred thousand in May twenty twenty-three. One million in February twenty twenty-four. Ten million in April twenty twenty-five. Each jump required not just engineering effort but architectural innovation, new positional encodings, new attention mechanisms, new ways of distributing computation across hardware.

But the numbers on the box do not match the numbers in practice. A twenty twenty-five study by Chroma tested eighteen frontier models and found that every single one degraded as input length increased, even on simple tasks. A model advertising two hundred thousand tokens typically became unreliable around one hundred thirty thousand. The degradation was not gradual. It showed sudden drops, like a student who appears to be paying attention and then abruptly zones out.

The practical question, the one that matters if you are using these tools daily, is not "how big is the context window" but "how effectively does the model use it." The answer, in twenty twenty-six, is: better than two years ago, still imperfect, and expensive. RAG remains cheaper than long context for most tasks. Putting important information at the beginning of your prompt remains more effective than burying it in the middle. And the most capable models charge a premium for the tokens they process, because those tokens consume real memory and real computation on real hardware.

The context window story is not finished. It is one of the most active areas of AI engineering. Ring attention distributes the computation across multiple graphics cards in a circular topology, promising linear scaling with hardware. State-space models like Mamba process sequences in linear time rather than quadratic, abandoning attention entirely. And the cost of long context has been falling. Anthropic's March twenty twenty-six announcement eliminated the premium for context beyond two hundred thousand tokens, making a nine hundred thousand token request the same price per token as a nine thousand token one.

But the fundamental tension remains. The context window is not memory. It is a computation. Every token costs attention. Every token costs money. And the middle still sags.

That was the deep dive for episode ten.

The Jargon Jar

This episode's term: KV cache.

If a friend asked you, you would say: it is the scratchpad the model keeps while generating a response. Every time it produces a new word, it needs to refer back to everything that came before. Instead of recomputing all of that from scratch for every word, it stores the important bits in a cache. That cache is why the second word comes out faster than the first.

The marketing version: "optimized inference with advanced caching technology."

What it actually means in practice: the KV cache is the single largest memory bottleneck in running a language model. For a seventy-billion-parameter model at one hundred twenty-eight thousand tokens, the cache alone can consume forty gigabytes of graphics card memory. That is more than most consumer hardware has in total. This is the reason long conversations cost more than short ones, the reason context windows have practical limits even when the architecture theoretically supports more, and the reason why the question "how much memory does this model need" has a different answer depending on how long your prompt is. The cache is where context becomes physical, where tokens stop being abstract and start being gigabytes.