Scaling: The Billion-Dollar Bet

The Curve That Should Not Exist

This is episode seven of Actually, AI.

In most of engineering, bigger eventually stops meaning better. A bridge twice as long is not twice as strong. An engine twice the size does not produce twice the power. Diminishing returns are the default. You scale something up, and at some point the gains taper off, then plateau, then vanish entirely. This is so reliable that engineers have a name for it. They call it the real world.

Neural networks do not behave this way. As you make a language model larger, give it more data, train it for longer, its performance improves along a smooth, predictable curve. Not a little. Across seven orders of magnitude, meaning a factor of ten million. The curve does not flatten. It does not wobble. It just keeps going, with the regularity of a physical law. And nobody fully understands why.

This is the fact that the entire AI industry is built on. Not a theory. Not a hope. An empirical observation that happens to be worth about four trillion dollars in market capitalization. Every model you have ever used, every chatbot that impressed you, every image generator that startled you, exists because someone looked at that curve and decided to bet that it would hold for one more order of magnitude. So far, the bet has paid off. The question that keeps the people making these bets awake at night is a simple one. Where does the curve stop?

Two Papers, Two Hundred Billion Dollars

The story of how a smooth curve redirected the global economy starts with a physicist who wandered into machine learning. Jared Kaplan did his undergraduate work at Yale in mathematics and physics, earned a doctorate in theoretical physics at Stanford, then spent years at Johns Hopkins as a professor working on string theory and cosmology. The kind of person who thinks in equations the way most people think in sentences. At some point in the late twenty-tens, Kaplan became interested in whether the same mathematical frameworks that describe physical phenomena might describe neural network behavior.

In January twenty twenty, Kaplan and nine co-authors at OpenAI published a paper called "Scaling Laws for Neural Language Models." The finding was striking. When you plotted model performance against the amount of compute used to train it, the result was not a messy scatter. It was a clean line on a logarithmic graph. A power law, the same type of relationship that governs earthquake magnitudes, city sizes, and word frequencies in human language. Each tenfold increase in parameters yielded roughly a sixteen percent decrease in loss, the measure of how wrong the model is. And this held across more than seven orders of magnitude.

The practical implication was enormous. If the curve was reliable, you could predict how good a model would be before you trained it. You could run a small experiment for a few thousand dollars and know what a hundred-million-dollar training run would produce. Dario Amodei, one of the paper's co-authors and later the founder of Anthropic, compared the precision to physics.

The scaling metrics achieve several significant figures of accuracy. That is unusual outside physics.

Kaplan's paper came with a prescription. If you had a fixed budget for training, he said, spend most of it on making the model bigger. Scale parameters five and a half times for every tenfold increase in budget. Scale data only one point eight times. This recommendation directly shaped how OpenAI built GPT-3, with its one hundred seventy-five billion parameters.

Then, two years later, a team at DeepMind proved the prescription was wrong.

The Chinchilla Correction

Jordan Hoffmann was a research scientist in London. He had a PhD in applied mathematics from Harvard and a quiet intensity about getting the numbers right. In March twenty twenty-two, Hoffmann and over twenty co-authors published a paper with the uninspiring title "Training Compute-Optimal Large Language Models." The industry called it Chinchilla, after the model they built to prove their point. The paper's central claim was simple and devastating. The entire industry was scaling incorrectly.

Kaplan had said to make models big and train them briefly. Hoffmann said to train models longer on more data. For every doubling of model size, double the training data too. The optimal ratio, Chinchilla found, was approximately twenty training tokens per model parameter. By that standard, GPT-3, with one hundred seventy-five billion parameters trained on three hundred billion tokens, was wildly undertrained. It should have seen three and a half trillion tokens. Or, looked at from the other direction, three hundred billion tokens only justified a model of about fifteen billion parameters, not one hundred seventy-five billion.

To prove their point, Hoffmann's team did something elegant. They took the same compute budget DeepMind had used to train Gopher, their two hundred eighty billion parameter model, and split it differently. Seventy billion parameters, four times more data. The resulting model, Chinchilla, uniformly outperformed not just Gopher but also GPT-3, Jurassic-1, and the five hundred thirty billion parameter Megatron-Turing. A model one quarter the size, beating models four and seven times larger, because the data balance was right.

The industry pivoted almost overnight. Meta's Llama series is the clearest example of the Chinchilla lesson applied. Smaller models, vastly more data. The paper that everyone had been using as their roadmap, Kaplan's, turned out to have a systematic bias from the way it counted parameters and the small scale at which it was tested. A twenty twenty-four reconciliation paper confirmed that much of the disagreement came down to Kaplan counting non-embedding parameters while Chinchilla counted total parameters. A technical distinction that redirected hundreds of billions of dollars.

The Shape of the Bet

Here is what makes the scaling story genuinely strange. The curve works. GPT-4's performance was predicted by scaling law experiments that used at most one ten-thousandth of the final compute budget. OpenAI knew what GPT-4 would score on coding benchmarks before training was even finished, based on tiny models trained for a fraction of the cost. That kind of predictive power is rare in any engineering discipline. It is almost unheard of in a field this young.

But here is the catch. The curve predicts aggregate performance. It tells you the overall loss will decrease by this much if you spend this much. What it does not tell you is which specific abilities will appear. Amodei described this as a paradoxical combination.

When does arithmetic come in? When do models learn to code? Sometimes it is very abrupt.

You can predict the average score. You cannot predict the individual questions the student will suddenly get right. The labs are betting billions on a curve that tells them their next model will be better, without telling them exactly how it will be better. It is like knowing that the next earthquake will release a predictable amount of energy, without knowing which buildings will fall.

And the bets are getting larger. A training run that cost tens of thousands of dollars in twenty eighteen now costs tens of millions. The next generation will cost hundreds of millions. Projections suggest billion-dollar training runs by twenty twenty-seven. The training cost of frontier models has been growing at two point four times per year since twenty sixteen. Every one of those dollars is placed on the assumption that the curve continues. If it does, each dollar produces a predictable improvement. If it does not, the economics collapse.

What If It Stops

The curve is still climbing. But the voices questioning whether it will continue are getting louder, and some of them belong to the people who drew the curve in the first place.

Ilya Sutskever was OpenAI's co-founder and chief scientist, one of the earliest and loudest advocates of the scaling hypothesis. "The models, they just want to learn," he used to say. In November twenty twenty-five, he declared the age of scaling over.

From twenty twenty to twenty twenty-five, it was the age of scaling. But now it is back to the age of research again, just with big computers.

He was not saying scale did not matter. He was saying that pure scale, just making the model bigger and giving it more data, was no longer enough. The industry had to get clever again. New architectures, new training methods, new ways of making models reason. Scaling, he argued, had sucked all the air out of the room, and now it was time to let other ideas breathe.

The numbers support the shift. Reports from inside OpenAI, Google, and Anthropic confirm that the most recent generation of models showed smaller improvements than expected from pure scaling alone. The industry has not stopped spending. It is spending more than ever. But the money is moving. Test-time compute, where the model thinks longer when answering hard questions, is one new axis. Reinforcement learning after pretraining is another. Synthetic data is a third. The bet has not failed. It has evolved. The question is no longer just "does scaling work" but "what exactly are we scaling."

The Honest Mystery

That is the scaling story as of right now. A smooth, beautiful, empirically validated curve that has held for seven orders of magnitude and redirected trillions of dollars. Two papers that disagreed on the recipe but agreed on the fundamental fact. A prediction so precise it rivals physics, for a phenomenon nobody can fully explain.

Rich Sutton, the reinforcement learning pioneer who won the twenty twenty-four Turing Award, wrote an essay in twenty nineteen called "The Bitter Lesson." The lesson, drawn from seventy years of AI research, is that simple methods plus more compute always eventually beat clever methods with less compute. The researchers who encode human knowledge into their systems feel personally satisfied, but they always lose in the long run to the researchers who just scale up the brute-force approach.

The biggest lesson that can be read from seventy years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Whether the lesson stays bitter or turns sweet depends entirely on what happens next with that curve. The deep dive for this episode goes further into the papers, the emergent abilities debate, the geopolitics of compute, and the data wall that every lab is staring at. Find it right after this in your feed.

That was episode seven.