Context Windows: What This Means for You

The Temptation of Everything

This is the practical companion to episode ten of Actually, AI, context windows.

You know the science now. The U-shaped attention curve, the lost-in-the-middle problem, the quadratic cost of attention, the KV cache eating memory for every token. You know that a million-token window does not mean a million tokens of equal focus. The edges get priority. The middle sags. The cost scales.

So what do you actually do with that knowledge, every day, when you sit down and open a chat window?

The single most common mistake people make with AI is also the most intuitive one. They give it everything. Here is my entire codebase. Here is the full contract. Here are all forty-seven emails in this thread. Figure it out. It feels responsible. It feels thorough. And it actively makes things worse.

Why "Here Is Everything" Fails

Think about what happens mechanically when you paste ten thousand lines of code into a conversation. Every one of those lines becomes tokens. Every token competes for attention with every other token. Your actual question, the thing you want the model to focus on, is now a tiny signal in an ocean of context. The model does not scan your codebase the way you would, skipping irrelevant files and homing in on the function that matters. It processes everything simultaneously, and the attention mechanism spreads its budget across all of it.

The lost-in-the-middle effect makes this worse. If the relevant code happens to sit in the middle of your paste, somewhere around line five thousand, it receives the weakest attention. The model focuses hardest on whatever you pasted first and whatever sits closest to your question at the end. The file that actually contains the bug might as well be printed in light grey.

There is also a subtler problem. Nelson Liu's research showed that coherent, well-structured text pulls the model's attention along its narrative flow. Your codebase has structure. It has imports, class definitions, method chains that lead from one file to the next. The model follows that structure, attending to the logical flow of the code rather than searching for the specific piece you need. It becomes a reader instead of a retriever. And a reader does not jump to page fourteen. A reader starts at the top and gets absorbed.

The practical lesson is not "use less context." It is "curate your context." Give the model the right information, not all the information. Three relevant files beat thirty irrelevant ones. A function signature, the error message, and the test that fails is almost always more useful than the entire repository.

The Art of Context Curation

Think of yourself as an editor, not a dump truck. Before you paste anything into a conversation, ask yourself: does the model need this specific piece of information to answer my question? If the answer is "maybe," that is a no. The model does not benefit from maybe-relevant context. It pays the attention cost for it and gets little in return.

Here is a framework that works. For any task, give the model three things. First, the specific thing you want it to work on, the code, the paragraph, the data. Second, the immediate context it needs to understand that thing, the relevant type definitions, the surrounding logic, the constraints. Third, your question or instruction. That is it. Everything else is noise.

When you are working with code, this means copying the relevant function, its type signatures, and the error, not the whole file. When you are working with a document, this means pasting the section you care about plus maybe the table of contents for orientation, not the entire forty-page report. When you are having a conversation that spans multiple topics, this means restating the relevant context at the point where you need it rather than assuming the model is still holding on to something you mentioned twenty messages ago.

This feels like more work. It is. But it is the difference between getting a focused, accurate response and getting a confident, vaguely-relevant response that misses the point. You are not being lazy by giving less context. You are being precise.

When to Start Fresh

Here is a question nobody thinks about enough: when should you start a new conversation?

The instinct is to keep going. You have built up context. The model knows your project. You have explained your constraints. Starting over would mean re-explaining everything. So you push the conversation to message forty, message sixty, message a hundred. And at some point, usually without a clear moment of failure, the quality starts to degrade. The model contradicts something from early in the thread. It forgets a constraint you set. It starts repeating suggestions you already rejected.

This is not the model getting tired. It is the context window filling up and the attention mechanism spreading thinner with every message. Remember, every token in the entire conversation history gets reprocessed for every new response. Message one is still in there, but it is competing for attention with everything that came after it. By message sixty, the attention budget for your original instructions has been diluted by tens of thousands of intervening tokens.

The practical rule is simpler than you think. Start a new conversation when you change tasks. If you have been debugging a function and now you want to write documentation, start fresh. If you have been brainstorming ideas and now you want to implement one, start fresh. The model does not benefit from knowing about your brainstorming session when it is writing code. That earlier context is not helping. It is consuming attention that should be going to the current task.

For long tasks that genuinely need continuity, build checkpoints. At the point where you feel the conversation getting long, write a summary of where you are, what has been decided, and what is left. Paste that summary into a fresh conversation. You lose nothing, because the summary contains everything the model actually needs. And you gain a full, fresh attention budget focused entirely on the next step.

Think of it like clearing your desk. You do not keep every draft, every reference book, every sticky note from every previous project on your desk while you work. You keep what you need for this task. The AI's context window is its desk. Keep it clean.

The Cost You Are Not Thinking About

This one surprises people. Context is not free, even on services that feel free.

Every token in your conversation, your messages, the model's responses, the system instructions you never see, the entire history, gets reprocessed for every single response. A ten-message conversation is cheap. A hundred-message conversation can process millions of tokens across the session, even if most of those tokens are the same history being re-read over and over. On paid APIs, this is real money. On consumer products, it is still real computation that affects your response speed and quality.

Longer context means slower responses. The KV cache grows, the attention computation grows, and the physical hardware needs more time. If you have noticed that responses get slower as conversations get longer, that is not your imagination. That is the quadratic cost of attention making itself felt. The model is literally doing more work for every additional token in the history.

The cheapest and fastest response comes from a short, focused conversation. The most expensive and slowest comes from a marathon session where you are still carrying context from three hours ago. This does not mean you should never have long conversations. It means you should be intentional about it. Is this context still earning its keep, or is it just riding along?

The Sandwich Rule

Here is the single most useful structural trick for working with AI, and it comes directly from the lost-in-the-middle research.

Put your most important content at the beginning and the end. Put the least important content in the middle.

If you are giving the model a document and a question, put the question first, then the document. Or put the document first and repeat the question at the end. Either way, the instruction sits where attention is strongest. If you are giving the model multiple pieces of context, put the most relevant piece first and the most relevant constraint last. Let the middle hold the background information that is nice to have but not critical.

This sounds almost too simple to matter. It matters enormously. Researchers have measured a twenty percentage point accuracy difference between putting critical information at the beginning versus burying it in the middle. That is the difference between a useful answer and a wrong one, and the only thing that changed was the order of the paste.

Try This Today

Here is something you can try in your next conversation with any AI. Pick a task you would normally approach by pasting a large amount of context, a long document, a codebase, a collection of notes. Instead of pasting everything, take two minutes to curate. Pull out only the pieces directly relevant to your question. Put your question first. Then paste the curated context. Then restate the key constraint at the end.

Compare the result to what you would have gotten from the everything-dump approach. In most cases, the focused version will be more accurate, more specific, and faster. Not because the model is smarter with less context. Because the attention mechanism has less noise to fight through to find the signal.

The deeper shift is in how you think about the conversation itself. The AI is not a colleague who remembers everything you have discussed. It is a fresh reader processing a document, every single time you press send. That document happens to include your conversation history, but the model does not experience it as a conversation. It experiences it as a very long input that it needs to attend to all at once. Once you internalize that, the way you structure your prompts changes. You stop expecting continuity and start engineering it. You stop dumping and start curating. And the results get dramatically better.

What Comes Next

That was the practical companion for episode ten. The main story explained what context windows are. The deep dive went inside the engineering, positional encodings, the KV cache, the needle-in-a-haystack test. And now you know how to work with the constraints instead of against them. Curate, do not dump. Front-load what matters. Start fresh when the task changes. And remember that every token costs attention, whether you are paying for it or not.

Episode eleven is about inference, what happens in the milliseconds between pressing send and seeing the first word appear. The journey from your keyboard to a rack of graphics cards and back. If you have ever wondered why the first token takes longer than the rest, that is the episode.

That was the practical companion for episode ten of Actually, AI.