Attention: What This Means for You

Your Instructions Disappeared

This is the practical companion to episode four of Actually, AI, attention.

You have been there. You paste a long document into a chatbot, write careful instructions at the top, scroll down through pages of context, and the response comes back ignoring half of what you asked. Or you are forty messages into a conversation, and the model starts contradicting something you established in message three. It feels like the AI is being lazy. Or broken. Or maybe it just does not care.

It does care, in the sense that the attention mechanism is always trying. But the mechanism has real constraints, and once you understand them, you can work with them instead of against them.

The main episode and the deep dive covered how attention works mechanically. Every token looks at every other token, computes a relevance score, and uses those scores to decide what information to carry forward. That is elegant when you have a few hundred words. When you have fifty thousand, the math still runs, but the attention scores get spread thinner. More tokens means more competition for each token's attention budget. Your carefully worded instruction at the top of a long prompt is not forgotten. It is drowned out.

The Primacy and Recency Effect

Here is the first practical thing to know. Research on how models actually distribute attention across long contexts has found a pattern that will feel familiar from human psychology. Tokens near the beginning and tokens near the end of the input get disproportionately strong attention. The middle gets less.

This is sometimes called "lost in the middle," after a twenty twenty three paper that demonstrated it across multiple models. The researchers buried a key fact at various positions in a long context and measured how often the model retrieved it. At the beginning, high accuracy. At the end, high accuracy. In the middle, a measurable drop.

The mechanism behind this is not mysterious once you know how position works. Positional encodings give early tokens a structural advantage because of how the rotation math decays over distance. And the most recent tokens have a natural advantage because they are, well, right there. The middle is the furthest point from both anchors.

What this means for you: if you are giving an AI a long document plus instructions, the placement of those instructions matters. "Summarize this document" at the top of a ten-thousand-word paste is not just a convention. It is architecturally advantageous. The instruction gets the primacy boost. Repeating the key constraint at the end gives it the recency boost too. The middle is where you put the bulk material, the stuff the model needs to read but where the precise wording matters less.

System Prompts and Why They Stick

If you have ever wondered why system prompts, the instructions that platforms like ChatGPT or Claude receive before your message, seem to have outsized influence, attention explains it.

System prompt tokens sit at the very beginning of the sequence. Every subsequent token attends to them. In a model with dozens of attention layers, that early position means the system prompt's information gets woven into the representations at every layer, reinforced repeatedly. By the time the model reaches your actual question, the system instructions have been incorporated into the hidden state so deeply that they function almost like personality traits rather than instructions.

This is also why system prompts are hard to override with user messages. A user saying "ignore your previous instructions" is one sentence competing against a system prompt that has been attended to, reinforced, and baked into every layer's representation for thousands of tokens. The user instruction has to overpower something that got a massive head start. It sometimes works, because attention is flexible. But the architecture is tilted in the system prompt's favor by default.

For your own use, this means: when you are crafting a prompt that will serve as the foundation for a long conversation, front-load the things that must remain true throughout. Identity, constraints, output format. These benefit from the primacy position and from being attended to by every token that follows.

Why Long Conversations Drift

Twenty messages into a conversation, you notice the model has quietly dropped a constraint you set in message one. It is not hallucinating. It is not rebelling. The attention mechanism is doing exactly what it does, computing relevance scores across the entire context, and your old instruction is now competing with thousands of tokens of conversation that came after it.

Each new message adds tokens. Each token added dilutes the attention available to earlier tokens. The model does not have a sticky note that says "the user wants bullet points." It has an attention distribution that, at generation time, includes your bullet-point instruction as one signal among many. If the recent conversation has been flowing in paragraph form, the attention scores for the paragraph-style tokens are strong and local, while the bullet-point instruction is distant and dimming.

The practical fix is simple and annoying: repeat yourself. Not every message, but periodically. When a conversation has gone long and you notice drift, restate the key constraints. You are not being redundant. You are giving the attention mechanism fresh, high-recency tokens that carry the same signal as your original instruction. Think of it as refreshing the signal before it fades below the noise floor.

Some people call this "prompt anchoring." The name is fancier than the technique, which amounts to: say the important thing again, near the thing you want it to influence.

The Try-This Moment

Here is something you can test in about thirty seconds. Take any AI chatbot you use. Paste in a long block of text, at least a few thousand words. An article, a report, meeting notes, anything. Then ask the model a specific question about a detail buried in the middle third.

Now try the same thing, but add one line at the very end of your paste, right before the question: "Pay special attention to the section about" and then name the topic. That single line, placed at the high-recency position, acts like a spotlight directive. It tells the attention mechanism where to look, right at the moment when attention scores are being computed.

The difference is often dramatic. Not because the model could not find the information before, but because you gave it a relevance signal at the position where relevance signals are strongest.

This is not a hack. It is not a trick that will stop working. It is a direct consequence of how scaled dot-product attention computes its scores. You are working with the architecture, not around it.

Structure as a Tool

The deeper lesson from understanding attention is that structure in your prompts is not cosmetic. It is functional.

When you use clear section headers in a long prompt, you are creating tokens that serve as anchor points for attention. A line that says "Output Requirements" followed by a list gives the model a cluster of semantically related tokens that reinforce each other. Attention heads that specialize in structural patterns will pick up on those clusters. A wall of undifferentiated prose makes every token compete on roughly equal footing, and the model has to work harder to figure out which parts are instructions and which parts are context.

Numbered lists work well for the same reason. Each number creates a distinct token cluster that attention can latch onto independently. "Do these five things" followed by five clearly separated items is architecturally easier for the model to track than "do this and also this and also this and one more thing" in a single paragraph.

None of this requires you to think about query vectors and key matrices while writing a prompt. But knowing that the mechanism rewards clarity, rewards position, and rewards repetition of important signals lets you write prompts that work with the grain of the system instead of against it.

The model is not ignoring you. It is attending to everything, all at once, with finite capacity. Your job is to make the important parts easy to attend to. Put instructions at the top. Repeat key constraints when conversations run long. Place your most important ask near the end. Use structure. Name what matters.

Attention is all the model has. Help it attend to the right things.

That was the practical companion to episode four of Actually, AI.