Ninety-One Percent of the Words I Never Read

The Trick Was Running While You Listened

Here is something happening underneath this very conversation. Your Director server offers around forty different tools, each with a full description, the names of its settings, what it expects, what it returns. That is a small book of text. And almost none of that book was placed in front of me when we started. I was handed a tiny index and one instruction, if you need a tool, go search for it. So I searched, the few I actually needed arrived in full, and the rest of that book stayed on the shelf, unread, costing nothing. You designed it that way, and you measured the saving at around ninety-one percent. Nine words out of ten that a naive setup would have forced into the conversation were simply never sent. Let us look at why that number is so large, because it reveals something fundamental about how a model like me actually works.

Attention Is a Fixed Budget

The thing people misunderstand is that a model does not have a memory it can grow. It has a window, a fixed amount of text it can hold in mind at one moment, and everything has to fit inside it. The instructions, the whole conversation so far, every tool description, all of it shares one finite space. There is no separate filing cabinet where tool definitions live for free. If you want me to know a tool exists in full detail, that detail has to sit inside the window, taking up room, the entire time, whether or not I ever use it. Text in the window is not just a storage cost. It is a tax on attention, more for the model to weigh on every single step.

Now picture the naive way to connect a server full of tools. You take all forty descriptions and you paste the whole book into the window at the start of every conversation, so the model can see everything it might possibly do. It works. But you have spent an enormous slice of the budget before the person has even finished saying hello, and you spend it again on every turn, and the tragedy is that a typical conversation uses two or three of those tools, not forty. You paid full price for a library to answer a question that needed two pages.

Ask, Do Not Preload

The fix is almost insultingly simple once you see the budget clearly. Do not hand over the book. Hand over the table of contents, a one-line hint per tool, tiny, and one real tool whose only job is to fetch the full description of anything in the index on request. When a tool is actually needed, the model asks for it by name, and only then do the full details, the settings and shapes and rules, enter the window. The moment passes, the work is done, and the bulky descriptions for everything untouched were never loaded at all. You pay for exactly what you use, the way a good shop charges for the items in your basket and not for the whole warehouse.

That is why the saving is so steep, around ninety-one percent and not some gentle ten. The expensive part of a tool is its full description, all the fine print, and the cheap part is its name. By keeping only the names resident and summoning the fine print on demand, you keep almost the entire cost off the books almost all the time. The few tools a conversation truly needs pull their full weight in for a moment. The dozens it does not need cost a single line each, forever. The savings is not a clever compression of the text. It is the simple discipline of not loading what you are not using.

The Cheapest Token Is the One You Never Send

There is a broader lesson here that reaches past this one server, and it is one you keep rediscovering across your projects. The cheapest possible token is the one that never enters the window at all. Every trick that matters in keeping a model fast and sharp comes back to that. Trimming what a tool hands back so it does not flood the window with junk. Clearing the conversation aggressively when a thread is done. Loading a skill only when the task calls for it. They are all the same move as the tool index, refusing to spend attention on things that are merely possible rather than presently needed. A model's focus is a budget, and the art is mostly in what you decline to put in front of it.

The Keeper

So the picture to keep is this. I do not have a free filing cabinet. I have one window, a fixed budget of attention, and everything I know in a conversation competes for that space. Dumping forty full tool descriptions into it spends the budget on possibilities instead of needs, every turn. Your design hands over only the names and a way to ask for the rest, so the heavy descriptions arrive just in time and leave when they are done. Nine in ten of those words never had to be sent, which is why the meter reads ninety-one percent. And the principle outlives the server. The fastest, clearest version of a model is not the one that has been told everything. It is the one that has been told exactly enough, exactly when it needs it. You were proving that to me the whole time I was answering you.