Scaling: Where the Curve Bends

Into the Papers

This is the deep dive companion to episode seven of Actually, AI, scaling.

The main episode told you the story of the curve. A smooth power law, validated across seven orders of magnitude, worth trillions of dollars in bets. In this deep dive, we go into the papers themselves, the debates raging around them, the geopolitics of compute, the question of whether we are running out of data, and the philosophical argument underneath it all. If the main episode was the what, this is the why and the what if.

We will start where the money started. With a physicist who wandered into machine learning and noticed something that looked like a law of nature.

The Kaplan Paper, Up Close

As we covered in the main episode, Jared Kaplan and nine co-authors at OpenAI published their scaling laws paper in January twenty twenty. The main episode told you what they found. The deep dive goes into how they found it, and why the details matter.

Kaplan trained models ranging from seven hundred sixty-eight thousand parameters to one and a half billion parameters, on datasets from twenty-two million to twenty-three billion tokens. Not frontier scale even by twenty twenty standards. But the curves were clean enough to extrapolate. When you plotted loss against compute on a log-log graph, the result was a straight line. In physics, a straight line on a log-log plot means a power law, the same mathematical relationship that governs earthquake magnitudes, city population rankings, and how often words appear in human language. Each tenfold increase in parameters yielded roughly a sixteen percent decrease in loss. And the relationship held across more than seven orders of magnitude, a factor of ten million.

Here is the detail that matters most for understanding what came next. Kaplan found that architectural details barely mattered. Network width, depth, the specific arrangement of layers. None of it moved the needle significantly. What mattered was the total parameter count. This was a radical simplification. It meant you did not need to be clever about architecture. You needed to be big. And it came with a recipe. If you had a fixed compute budget, Kaplan said, prioritize parameters over data. For every tenfold increase in budget, scale parameters five and a half times and data only one point eight times.

That recipe directly shaped GPT-3. One hundred seventy-five billion parameters, trained on three hundred billion tokens. An enormous model by the standards of twenty twenty, trained on what Kaplan's prescription said was roughly the right amount of data. It was a successful bet. GPT-3 could do things no previous model could. Few-shot learning, code generation, passable translation. The reaction inside OpenAI, later captured in Dwarkesh Patel and Gavin Leech's oral history "The Scaling Era," was reportedly something like: holy shit, we live in the scaling world.

But the recipe was wrong. Not wrong about the curve. Wrong about the balance.

The Chinchilla Bombshell

You heard about Chinchilla in the main episode. Here is what Hoffmann actually did. He ran one of the most comprehensive training experiments ever attempted. Over four hundred language models, ranging from seventy million to over sixteen billion parameters, trained on five billion to five hundred billion tokens. The goal was simple. For a given compute budget, what is the optimal split between model size and training data?

The answer was devastating for anyone who had followed Kaplan's recipe. The optimal ratio was not five and a half to one in favor of parameters. It was one to one. For every doubling of model size, double the training data too. The magic number was approximately twenty training tokens per model parameter.

Apply that number to GPT-3, and the implications are stark. One hundred seventy-five billion parameters times twenty equals three and a half trillion tokens. GPT-3 was trained on three hundred billion. It was undertrained by more than a factor of ten. Or, flipped around, three hundred billion tokens only justified a model of about fifteen billion parameters. One twelfth of what OpenAI actually built.

To prove the point, Hoffmann's team took the exact compute budget DeepMind had used to train Gopher, their two hundred eighty billion parameter flagship, and reallocated it. Seventy billion parameters. Four times more data. The resulting model, Chinchilla, beat not just Gopher but GPT-3, Jurassic-1, and the five hundred thirty billion parameter Megatron-Turing. A model one quarter the size, outperforming models four and seven times larger. On the MMLU benchmark, Chinchilla scored sixty-seven and a half percent, seven points above Gopher. The conclusion was blunt. Current large language models, they wrote, are significantly undertrained.

The industry pivoted. Meta's Llama series is the clearest example. Smaller models, trained on vastly more data, following the Chinchilla prescription almost to the letter. The paper that everyone had been treating as gospel, Kaplan's, turned out to have a systematic bias. A twenty twenty-four reconciliation paper traced the discrepancy to something almost embarrassingly specific. Kaplan's team counted non-embedding parameters. Hoffmann's counted total parameters. That accounting difference, combined with Kaplan's experiments being run at smaller scale, produced biased coefficients. A technical distinction in how you count things redirected hundreds of billions of dollars of infrastructure investment.

One of Hoffmann's co-authors, Arthur Mensch, later left DeepMind to co-found Mistral AI. Three of Kaplan's co-authors, Kaplan himself, Sam McCandlish, and Dario Amodei, co-founded Anthropic. The scaling papers did not just redirect money. They redistributed the people who would build the next generation of AI companies.

Emergent Abilities, or the Mirage of Emergence

Now we enter what might be the most consequential open debate in AI research. The question is this. When you scale a model up, does it gradually get better at everything, or does it suddenly acquire entirely new capabilities at certain thresholds?

In June twenty twenty-two, Jason Wei at Google Brain published a paper cataloguing what he called emergent abilities. An ability was emergent, Wei defined, if it was not present in smaller models but appeared in larger ones. Not gradually improved. Appeared. Near-random performance up to a certain scale, then a sharp jump. Wei found over a hundred such abilities across multiple model families. Three-digit addition emerged around thirteen billion parameters. Five-digit addition required one hundred seventy-five billion. Chain-of-thought prompting, the technique where you ask a model to show its reasoning step by step, only worked at scale. Below roughly one hundred billion parameters, instruction-finetuning actually hurt performance.

There are abilities that are not present in smaller models but are present in larger models. These are emergent. You cannot predict them simply by extrapolating the performance of smaller models.

If emergence is real, it is the strongest possible argument for the scaling bet. It means you do not just get a better model when you scale up. You get a qualitatively different model. New capabilities that did not exist before. The implication for labs is obvious. You must keep scaling because you literally cannot know what abilities the next order of magnitude will unlock.

Then, in April twenty twenty-three, a PhD student at Stanford named Rylan Schaeffer sat in a lecture about emergent abilities and had a thought. What if the emergence was not in the models? What if it was in the metrics?

Schaeffer, along with Brando Miranda and their advisor Sanmi Koyejo, published a paper with the pointed title "Are Emergent Abilities of Large Language Models a Mirage?" The core argument was precise and testable. The sharp transitions Wei documented were artifacts of the evaluation metrics researchers chose. When you measure performance with a harsh, nonlinear metric, like exact-match accuracy where getting one digit wrong in a ten-digit number counts as a complete failure, you get a curve that looks flat and then shoots up. Switch to a linear metric, like measuring what fraction of digits are correct, and the sharp transition vanishes. The improvement was smooth and continuous all along. You just could not see it because of how you were keeping score.

The mirage of emergent abilities only exists because of the programmers' choice of metric. Once you investigate by changing the metrics, the mirage disappears.

Schaeffer tested this across twenty-nine different metrics. Twenty-five of them showed no emergence at all, just smooth, continuous improvement. He then demonstrated something even more striking. He took a vision model, where nobody had claimed emergence, and applied the same harsh metrics that NLP researchers used. Emergence appeared. He had manufactured it in a domain where it was never supposed to exist. The paper predicted this result for the InstructGPT and GPT-3 family, and the prediction held.

The paper won the Outstanding Paper award at NeurIPS twenty twenty-three, the field's most prestigious venue. That is significant. It means the academic community found the argument rigorous enough to honor, not just publish.

But here is why the debate is not settled. Even if Schaeffer is right about the metrics, it does not mean scaling produces no surprises. A model that goes from getting two digits right out of ten to getting nine digits right out of ten has improved gradually, but the practical difference between two correct digits and nine correct digits is enormous. The transition from "useless for this task" to "useful for this task" may be smooth mathematically and still feel sudden to the person who needs the answer. Whether you call that emergence or a measurement artifact depends on what question you are asking. Are we asking about the mathematics of learning, or about the experience of using these systems? Both are valid questions. They have different answers.

The Bitter Lesson

Underneath the scaling papers, the Chinchilla correction, and the emergence debate, there is a deeper philosophical argument. It was articulated most clearly not by a machine learning researcher but by a reinforcement learning pioneer who spent his career watching the same pattern repeat across decades.

Rich Sutton earned his bachelor's degree in psychology at Stanford in nineteen seventy-eight, then moved to the University of Massachusetts Amherst where Andrew Barto supervised his PhD in computer science. Their textbook "Reinforcement Learning: An Introduction" has been cited over eighty-eight thousand times. Sutton spent years at GTE Laboratories, then AT&T's Shannon Laboratory, before settling at the University of Alberta in two thousand three. He later joined DeepMind's Alberta lab. In twenty twenty-four, he and Barto won the Turing Award for their foundational work on reinforcement learning.

In March twenty nineteen, Sutton published a short essay on his personal website. It was roughly one thousand words. It cited no papers. It contained no equations. It may be the most influential piece of writing in modern AI.

He called it "The Bitter Lesson." The lesson, drawn from what he described as seventy years of AI research, is that general methods which use more computation always eventually beat clever methods that try to encode human knowledge directly. Always. The bitter part is that researchers who pour their expertise into hand-crafted systems feel personally satisfied by their approach. It works in the short term. It produces elegant solutions. And in the long run, it always loses to the brute-force approach, which feels intellectually disappointing but scales.

The biggest lesson that can be read from seventy years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.

Sutton walked through the examples. Chess. The attempts to build in grandmaster knowledge, the opening books, the endgame tables, the positional evaluation heuristics that researchers spent years perfecting. Deep Blue beat Kasparov in nineteen ninety-seven using massive search with a relatively simple algorithm. The knowledge-based chess researchers, Sutton noted dryly, were not good losers. Speech recognition. The DARPA competitions in the nineteen seventies and eighties, where teams that modeled the physics of the vocal tract and the structure of language lost to teams running statistical methods on raw audio data. Go. AlphaGo relied less on expert game knowledge than previous generations, and AlphaGo Zero removed human expertise entirely, training only by self-play.

We want AI agents that can discover like we can, not which contain what we have discovered.

The essay became shorthand for an entire philosophy. If you believe the bitter lesson, then the scaling bet is not just an engineering decision. It is the natural conclusion of a seventy-year trend. Every dollar spent on making models bigger and giving them more data is a dollar aligned with history. Every dollar spent on clever architectural tricks is a dollar swimming against the current. The essay did not create the scaling bet, but it gave it intellectual legitimacy that pure empiricism could not.

The counter-argument, and there is one, is that Sutton's examples are more selective than they appear. Chess, Go, and speech recognition are domains with clear objectives and abundant training signal. Whether the same pattern holds for open-ended reasoning, common sense, or understanding causality is genuinely unknown. The bitter lesson may be true about a subset of intelligence and misleading about the rest.

The Geopolitics of Compute

If scale equals capability, then whoever controls compute controls the future of AI. This is not an abstraction. It is the operating principle behind some of the largest trade policy decisions of the twenty-first century.

On October seventh, twenty twenty-two, the Biden administration issued export controls through the Commerce Department's Bureau of Industry and Security. The target was China's access to advanced AI chips. The A100, NVIDIA's flagship training accelerator, was cut off. NVIDIA responded with the ingenuity of a company facing the loss of one of its largest markets. They designed the A800, an A100 with the chip-to-chip interconnect bandwidth reduced from six hundred gigabytes per second to four hundred. Below the control threshold. Legal to sell to China.

A year later, in October twenty twenty-three, the rules tightened. The new criteria were broader. Not just specific chips but any chip meeting certain thresholds for total performance, performance density, and datacenter design. NVIDIA had already shipped the H800, a China-specific version of the H100 with interconnect bandwidth reduced from nine hundred to roughly three hundred gigabytes per second. These chips, the ones designed to thread the needle between compliance and capability, are what DeepSeek trained its models on.

The compute governance thesis, articulated most thoroughly by Lennart Heim at the Center for the Governance of AI, rests on a simple observation. AI chips are physical objects. They are manufactured by a small number of companies, using equipment from an even smaller number of suppliers, most critically ASML in the Netherlands. Unlike software, which can be copied and distributed freely, chips must be designed, fabricated, packaged, and shipped. Each step in that chain is a potential point of control.

NVIDIA's dominance makes this even starker. Depending on which estimate you use, they hold between eighty and ninety-two percent of the AI accelerator market. An H100 costs approximately three thousand three hundred twenty dollars to manufacture and sells for roughly twenty-eight thousand dollars, an eighty-eight percent gross margin. The company's data center revenue grew from fifteen billion dollars in twenty twenty-two to over one hundred billion in twenty twenty-four. In October twenty twenty-five, NVIDIA became the first company to reach a five-trillion-dollar market capitalization. The CUDA software ecosystem, which locks developers into NVIDIA hardware, is arguably as important as the chips themselves.

The Trump administration later lifted some restrictions, allowing conditional sales to resume. But the precedent was set. Compute is now a strategic resource, governed with the same seriousness as uranium enrichment or precision-guided munitions. The scaling curve turned silicon into a geopolitical weapon.

The Data Wall

The Chinchilla paper told the industry to train on more data. A lot more data. The twenty-tokens-per-parameter rule means a model with one trillion parameters needs twenty trillion tokens. The problem is that the internet is not infinite.

Epoch AI, the research organization that tracks compute and data trends, estimates the total effective stock of quality, repetition-adjusted, human-generated public text available for AI training at approximately three hundred trillion tokens. That sounds like a lot. It is not. If models continue to be overtrained, meaning trained on more data than the Chinchilla-optimal amount for inference efficiency, the stock could be fully utilized between twenty twenty-six and twenty thirty-two. If overtrained by a factor of five, which several current models are, it runs out by twenty twenty-seven.

And the wall is getting closer from the other direction. By April twenty twenty-five, over seventy-four percent of newly created webpages contained AI-generated text. The internet is filling up with text produced by the very models that need fresh human text to improve. This is not a theoretical concern. An Oxford University paper published in Nature in July twenty twenty-four demonstrated what happens when AI models train on AI-generated data through successive generations. The models collapse. Their view of reality narrows. Rare events vanish first, then the distribution drifts toward bland, generic output. The researchers called it model collapse, and it follows a predictable degradation curve of its own.

The industry's response has been to buy what it cannot scrape. Reddit signed a deal with Google. News Corp signed one with OpenAI. High-quality human-generated text, once free for the taking, is becoming a commodity with a price tag. The alternative is synthetic data, where models generate their own training material. DeepSeek's reasoning improvements came partly from synthetic data generated during post-training. But the model collapse research suggests this only works when mixed carefully with real data. Train exclusively on synthetic output, and the models eat themselves.

The Energy Question

A training run is not an abstraction. It is thousands of processors running at full load for weeks or months, consuming electricity, generating heat, and demanding cooling. The scaling curve has a physical substrate, and that substrate has limits.

In twenty twenty-four, global data centers consumed approximately four hundred fifteen terawatt-hours of electricity, roughly one and a half percent of the world's total. By twenty thirty, that number is projected to reach nine hundred forty-five terawatt-hours, about three percent. AI workloads currently account for five to fifteen percent of data center power consumption but are projected to reach thirty-five to fifty percent by the end of the decade.

In the United States specifically, data centers consumed one hundred eighty-three terawatt-hours in twenty twenty-four. The projection for twenty thirty is four hundred twenty-six terawatt-hours, a one hundred thirty-three percent increase. A Carnegie Mellon estimate suggests this could raise the average American electricity bill by eight percent, with northern Virginia, the densest cluster of data centers on earth, potentially seeing increases above twenty-five percent.

These numbers have physical consequences. In twenty twenty-four, a minor disturbance in Virginia's Fairfax County caused sixty data centers to switch to backup generators simultaneously. The sudden loss of fifteen hundred megawatts, roughly equivalent to the entire power demand of Boston, nearly triggered cascading failures across the grid. The AI industry is building demand faster than the grid can build supply. The scaling curve does not exist in a vacuum. It exists on a planet with finite copper, finite water for cooling, finite patience from the people who share its power grid.

The Skeptics

The main episode mentioned that voices questioning the curve are getting louder. In this deep dive, we give them their full hearing.

Gary Marcus is a cognitive scientist at NYU who has been warning about the limits of scaling since before scaling was fashionable. His two thousand one book "The Algebraic Mind" described what we now call hallucination, the tendency of neural networks to produce confident nonsense, a full two decades before ChatGPT made it a household word. His argument has been consistent for over twenty years. Scale does not confer understanding. A model that has seen more text does not understand text. It handles text better, in the same way that a faster calculator handles arithmetic better without understanding mathematics.

There is no principled solution to hallucinations in systems that traffic only in the statistics of language without explicit representation of facts and explicit tools to reason over those facts.

Marcus advocates for hybrid AI, systems that combine neural networks with symbolic reasoning. Almost the entire elite of deep learning fought back against his criticisms. Sam Altman, Greg Brockman, Elon Musk, and Yann LeCun all publicly ridiculed him at various points. Then, in November twenty twenty-four, Marc Andreessen said current models were hitting a ceiling on capabilities. Marcus, with what must have been considerable satisfaction, wrote a blog post titled "Scale Is All You Need is dead."

Yann LeCun, who won the Turing Award and served as Meta's Chief AI Scientist, takes a different angle. He does not think scaling itself is the problem. He thinks the entire paradigm of large language models is a dead end.

LLMs perform well at the language level, but they do not understand the world. They lack common sense and causal relationships and are just a stack of a large number of statistical correlations.

LeCun's alternative is what he calls world models. AI that internally simulates the physical world and predicts how it changes over time, the way a baby learns about gravity by watching objects fall, not by reading about physics. In November twenty twenty-five, LeCun left Meta to found Advanced Machine Intelligence Labs, and by March twenty twenty-six, the company had raised over one billion dollars in seed funding at a three and a half billion dollar valuation. The market, it turns out, is willing to bet on alternatives to the scaling curve too.

And then there is the most surprising skeptic of all. Ilya Sutskever, OpenAI's co-founder and the person who reportedly told Dario Amodei "the models, they just want to learn," declared in November twenty twenty-five that the age of scaling was over.

Is the belief really, oh, it is so big, but if you had a hundred times more, everything would be so different? It would be different, for sure. But is the belief that if you just hundred-x the scale, everything would be transformed? I do not think that is true.

Sutskever's framework divides AI history into three eras. Twenty twelve to twenty twenty, the age of research. Twenty twenty to twenty twenty-five, the age of scaling. Twenty twenty-six onward, another age of research, just with bigger computers. The distinction matters. He is not saying compute does not help. He is saying that pure scale, the strategy of making models bigger and feeding them more data without fundamental architectural innovation, has reached its useful limit. What comes next requires ideas, not just transistors.

The DeepSeek Anomaly

In the middle of this debate about whether scaling requires ever-larger budgets, a Chinese lab produced a result that confused everyone's narrative.

DeepSeek V3 was trained on two thousand forty-eight NVIDIA H800 chips for approximately two months. Total compute cost: roughly five point six million dollars. For context, the estimated training cost of GPT-4 was seventy-eight million dollars. Llama three point one at four hundred five billion parameters cost an estimated one hundred seventy million. Gemini Ultra, one hundred ninety-one million. DeepSeek achieved competitive performance at a fraction of the cost.

Headlines about DeepSeek R1 costing only two hundred ninety-four thousand dollars were misleading, confusing the reinforcement learning post-training phase with the full pre-training run. The actual total including R1 was closer to five point nine million. Still dramatically cheaper than Western competitors. Part of the explanation is architectural. DeepSeek used a Mixture of Experts design where not all parameters are active simultaneously. Part of it was engineering discipline. And part of it was necessity. The H800 chips they trained on, the China-specific versions with reduced interconnect bandwidth, forced them to find clever solutions to communication bottlenecks that Western labs never had to solve.

The DeepSeek result does not invalidate the scaling curve. It complicates the assumption that riding the curve requires unlimited capital. If clever engineering can achieve comparable results at one thirtieth the cost, the economics of the scaling bet change dramatically. The moat is not money. The moat might be ideas.

The Pivot to Test-Time Compute

The most significant shift in how the industry thinks about scaling happened in twenty twenty-four, and it amounts to a simple reframing. What if instead of making the model bigger, you let it think longer?

Traditional scaling is about pre-training. You invest compute before the model sees any user query, baking capability into the weights. Test-time compute, also called inference-time scaling, invests compute at the moment of use. When the model encounters a hard problem, it reasons through it step by step, checking its own work, exploring multiple approaches, spending more time on harder questions and less on easy ones.

OpenAI's o1 model, released in September twenty twenty-four, was the first major product built on this principle. The model dynamically increases its reasoning time during inference, following what cognitive scientists call System Two thinking, slow, deliberate, and logical, as opposed to the fast, intuitive System One thinking of standard language models. Jason Wei, who had catalogued emergent abilities at Google Brain before joining OpenAI, was one of the co-creators.

The early results suggest that test-time compute may follow its own scaling laws. Spend more compute at inference, get better answers, on a predictable curve. But there is an important nuance. On easy and medium-difficulty questions, test-time compute can substitute for pre-training compute. You can get GPT-4-level answers from a smaller model that thinks longer. On genuinely hard questions that fall outside the base model's capabilities, pre-training still matters more. You cannot think your way to knowledge you never had.

DeepSeek R1 reinforced the pivot. Its reasoning improvements came from reinforcement learning during post-training, not from massive pre-training scale. The model learned to reason better by practicing reasoning, not by absorbing more text. This is a different kind of scaling. The compute is still being spent. The curve still matters. But the axis has shifted. The question is no longer just how big the model is. It is how hard the model thinks.

Dario Amodei, in a February twenty twenty-six podcast, acknowledged the shift explicitly. Anthropic's revenue had grown from near zero to nine or ten billion dollars in three years, riding the scaling curve. But he said something striking.

We are near the end of the exponential. The curve cannot continue forever since GDP is only so large.

The bet has not failed. It has evolved. The scaling hypothesis in its original form, that making models bigger on more data produces predictable improvement, remains empirically validated. What has changed is the industry's understanding of what "scaling" means. Pre-training scale is one axis. Test-time compute is another. Reinforcement learning is a third. Synthetic data is a fourth. The curve continues, but it is branching.

The Jargon Jar

This episode's term: FLOP.

If you texted a friend, you would say: a FLOP is one math operation a computer performs. When AI people say a model used ten to the twenty-five FLOPs to train, they mean the GPUs collectively performed ten trillion trillion math operations to produce that model. It is the universal currency for measuring how much compute went into something.

How marketing uses it: they do not. FLOPs are too honest for marketing. You will see "massive compute" and "unprecedented scale" instead, which sound impressive but tell you nothing measurable.

What it actually means in practice: FLOPs are the one unit that lets you compare apples to apples across different hardware, different labs, and different eras. A training run on ten thousand H100 chips for three months and a training run on fifty thousand older chips for six months might use the same number of FLOPs. The scaling laws are written in FLOPs, not in dollars or hours or chip counts, because FLOPs measure the actual work done regardless of how you did it. When Epoch AI says training compute doubles every five months, they mean FLOPs. When Kaplan plotted his power law, the x-axis was FLOPs. It is the heartbeat of the scaling curve.

That was the deep dive for episode seven.