Scaling: What This Means for You

The Truck Problem

This is the practical companion to episode seven of Actually, AI, scaling.

You heard about the curve. The power law. Billions of dollars wagered on the bet that bigger models perform better. All true. But here is the thing nobody in those stories had to deal with. You have a task. You have a budget. You need to pick a model. And the options range from something that costs a fraction of a cent per query to something that costs sixty cents. That is a six hundred fold spread. Picking the wrong one is like driving a semi truck to the corner store for milk. It gets the job done. It also costs forty dollars in diesel and takes twenty minutes to park.

The scaling papers proved that bigger models are more capable. They did not prove that you need the most capable model for every task. That distinction is where all the practical money lives. And most people, whether they are building applications or just using AI tools at work, get it wrong. They either default to the biggest model for everything, burning money on tasks that a smaller model handles perfectly, or they default to the cheapest model for everything, getting mediocre results on the tasks that actually matter. The sweet spot is in the middle, and finding it is more craft than science.

The Cost Curve Nobody Talks About

Here is what the scaling story looks like from the other side of the cash register. As of early twenty twenty-six, the spread between the cheapest and most expensive models from a single provider looks roughly like this. A small, fast model costs somewhere around twenty-five cents per million input tokens. A mid-tier model costs about three dollars. The frontier model costs fifteen dollars. And for extended thinking, where the model reasons through hard problems step by step, you might pay sixty to a hundred dollars per million tokens for the thinking itself.

Those numbers change constantly, and they have been falling. The price of a million tokens on a frontier model dropped by roughly ninety percent between early twenty twenty-four and early twenty twenty-six. Competition does that. But the ratio between tiers stays remarkably stable. The biggest model always costs roughly ten to forty times what the smallest model costs. So even as everything gets cheaper, the question of which tier to use never goes away.

Now here is the part that matters. For most tasks, the quality difference between tiers is not proportional to the cost difference. A task where the small model scores eighty-five and the frontier model scores ninety-five does not justify a forty-fold price increase. You are paying forty times more for ten points. But a task where the small model scores thirty and the frontier model scores eighty-five? That is a different equation entirely. You are paying for the difference between useless and useful.

When Small Wins

Small models are not just cheaper versions of big models with some capability shaved off. They are genuinely better for certain tasks. Not "acceptable." Better.

Speed is the obvious one. A small model responds in milliseconds. A frontier model with extended thinking might take thirty seconds on a complex question. If you are building an autocomplete feature that needs to feel instant, the big model is not just expensive, it is too slow to use at all. The user will have typed three more words before the suggestion arrives.

But speed is the shallow answer. The deeper one is that small models are less likely to overthink simple tasks. Ask a frontier model to classify a customer email as "billing," "technical," or "general," and it might produce a paragraph of reasoning about edge cases before giving you the label. Ask a small model the same question and it gives you the label. For structured, well-defined tasks with clear categories, the small model's simplicity is a feature. It does not hallucinate nuance where none exists.

Here is a rough guide. Small models tend to match or beat larger ones on text classification, simple extraction like pulling names and dates from documents, formatting and conversion tasks, basic summarization where you just need the key points, translation between common language pairs, and template-based generation where the structure is fixed and only the details change. These are high-volume, low-ambiguity tasks. The kind of work where you process thousands or millions of items and need consistent, fast, cheap results.

When Big Is the Only Option

The frontier model earns its price on tasks that require genuine reasoning across complex material. Writing code that interacts with multiple systems and needs to hold the full architecture in mind. Analyzing a legal contract for subtle implications. Synthesizing research from multiple conflicting sources into a coherent argument. Creative writing that needs to maintain voice, theme, and narrative arc across thousands of words. Debugging a problem that requires understanding three layers of abstraction simultaneously.

The pattern is consistent. When the task requires holding many things in mind at once, when the answer depends on subtle relationships between distant pieces of information, when there is no single correct answer and the quality gradient is wide, that is where the big model pulls ahead. Not by ten percent. By a factor that makes the small model unusable for the purpose.

There is a useful heuristic. If you can write clear, specific instructions that fully define the correct output, a small model will probably handle it. If the task requires judgment, where the quality of the output depends on understanding things you did not explicitly state, you need a bigger model. The gap between tiers tracks almost perfectly with how much implicit knowledge the task demands.

The Routing Trick

The smartest approach is not to pick one model. It is to use different models for different tasks. The industry calls this model routing, and it is how every serious AI application works behind the scenes.

The simplest version is a two-tier system. A small, cheap model handles the easy requests. When it encounters something it is not confident about, the request gets routed to a larger model. You can do this with a classifier, a simple model that looks at the incoming request and decides which tier should handle it, or you can do it with confidence thresholds, where the small model attempts the task and escalates if its own uncertainty is high.

The economics are compelling. In most real-world applications, somewhere between seventy and ninety percent of requests are routine. If you can handle those with a model that costs one fortieth of the frontier price, and only send the remaining ten to thirty percent to the expensive model, your average cost per request drops dramatically while your quality on hard tasks stays high.

This is not theoretical. It is how the major providers structure their own products. When you use an AI assistant and some responses come back instantly while others take a few seconds longer and feel more thoughtful, there is a reasonable chance you are seeing model routing in action. The system decided how hard your question was and picked accordingly.

Try This

Here is something you can do right now. Take one task you use AI for regularly. Something you do at least a few times a week. Try it on three different model tiers. The smallest available, something in the middle, and the largest. Do this five times with different inputs for each tier.

You will almost certainly find one of three things. Either the small model is perfectly fine and you have been overpaying. Or the big model is noticeably better and the cost is justified for this particular task. Or, and this is the most common outcome, the middle tier hits the sweet spot, good enough quality at a fraction of the frontier price.

The goal is not to find the cheapest model that technically works. It is to find the cheapest model whose output you do not feel the need to edit or redo. That is your quality threshold for that specific task. It will be different for every task you do. An email draft might have a low threshold. A technical analysis might have a high one. Knowing your own thresholds is worth more than any benchmark.

The Bigger Picture

The scaling episode told you that the industry is spending billions because bigger models perform better on a smooth, predictable curve. The practical reality for you is that the curve has a cost axis too. And on that cost axis, the returns diminish fast. Going from the smallest to a mid-tier model buys you a lot of capability per dollar. Going from mid-tier to frontier buys you less per dollar but on harder tasks. Going from frontier to frontier-with-extended-thinking buys you the least per dollar, but on the hardest tasks, it is the only thing that works.

The scaling laws are real. But they describe the ceiling of what is possible, not the floor of what you need. The person who picks the right model for each task will get ninety-five percent of the results at twenty percent of the cost compared to the person who uses the biggest model for everything. That is not a rounding error. Over a year of regular use, that is the difference between a rounding error on your budget and a line item.

The truck gets the milk. But so does the bicycle. And the bicycle is more fun to park.

That was the practical companion to episode seven.