Training Deep Dive: The Mountain, the Lottery, and the Trillion Dollar Loop

Into the Fog

This is the deep dive companion to episode three of Actually, AI, training.

In the main episode, we described the training loop in four steps. Show an example, predict, measure the wrongness, adjust the knobs. That description is accurate. It is also like describing a cross-country road trip as "get in the car and drive west." The interesting part is everything that happens along the way.

This deep dive goes into that territory. How wrongness is actually measured. What the landscape of all possible knob settings looks like, and why navigating it is so strange. Why a network can memorize its training data perfectly and still be useless. The surprising discovery that most of a network's knobs might not matter at all. Where the training data comes from and who does the work. And the economics, because training a frontier model now costs more than most buildings.

We will start with wrongness, because that is where the entire process begins.

The Shape of Wrong

In the main episode, we called it the "loss." A single number that measures how far the model's prediction is from reality. But how do you turn "the model thought this was a dog and it was actually a cat" into a number?

For language models, the standard measurement is called cross-entropy loss. The intuition is this. The model does not predict a single word. It assigns a probability to every word in its vocabulary. For the sentence "the cat sat on the," the model might assign thirty percent probability to "mat," twenty percent to "floor," five percent to "couch," and tiny fractions to thousands of other words. Cross-entropy loss asks one question. How much probability did the model assign to the word that actually came next?

If the actual next word was "mat" and the model gave it thirty percent, the loss is moderate. If the model gave "mat" ninety-nine percent, the loss is tiny. If the model gave "mat" one tenth of one percent, the loss is enormous. The logarithmic scale is key here. Getting a confident answer wrong is punished far more harshly than getting an uncertain answer wrong. A model that says "I am ninety-nine percent sure this is a dog" when it is actually a cat gets hammered. A model that says "this could be a dog or a cat, roughly fifty-fifty" gets a gentler correction. This asymmetry drives the network to be calibrated, to be confident only when the patterns genuinely support confidence.

For the entire training set, the loss is averaged across all predictions. The total loss is a single number that captures how wrong the model is, on average, across everything it has seen. Training is the process of making that number smaller. Every adjustment to every knob is aimed at pushing that single number downward.

This is worth sitting with. All of modern AI, every chatbot, every image generator, every translation service, is optimized against a number. Not truth. Not usefulness. Not beauty. A number that measures prediction error on training data. Everything else, the apparent understanding, the helpfulness, the creativity, emerges as a side effect of making that number smaller. Whether those side effects constitute real understanding or just a very convincing imitation is one of the deepest open questions in the field. We will come back to that.

Walking Downhill in the Dark

Now imagine every knob in the network as a dimension. A network with one hundred billion parameters exists in a space with one hundred billion dimensions. Each point in that space represents one specific configuration of all the knobs. At each point, the loss function tells you how wrong the model is with those particular settings. The loss, plotted across all possible configurations, forms a landscape.

This landscape is called the loss landscape. In two thousand eighteen, a team led by Hao Li at the University of Maryland published a paper that actually visualized it, projecting the impossibly high-dimensional surface down to two or three dimensions you can look at. What they found was striking. The landscape is not a smooth bowl with one clear bottom. It is a mountain range. Peaks and valleys and ridges and plateaus, stretching in every direction. Some valleys are deep and narrow. Some are broad and shallow. Some are connected by tunnels of low loss that wind through the terrain in unexpected ways.

Gradient descent navigates this landscape in the dark. At each point, you can feel the slope under your feet, but you cannot see the horizon. You take a step downhill, feel the new slope, take another step. You will eventually reach a valley. But is it the deepest valley? You have no way of knowing. You might be in a shallow dip while a much deeper valley sits on the other side of a ridge you will never cross.

This should be a devastating problem. And for decades, people assumed it was. If gradient descent can only find local minima, nearby valleys, then training a network would be hopelessly sensitive to where you start. Initialize the knobs to random values, start walking downhill from a random point on the mountain, and you would end up in whatever shallow dip happens to be closest. The result would be mediocre and inconsistent.

Here is the surprise. In practice, this does not happen. Large networks trained with gradient descent consistently find good solutions. Not the same solution, they end up in different valleys, but valleys that all perform well. Recent research has even found that different minima are often connected by paths of low loss, like underground tunnels between valleys. The landscape is less rugged than it looks.

One major reason this works is noise. Standard gradient descent would compute the slope using the entire training dataset, billions of examples, before taking a single step. That is absurdly expensive. Instead, practitioners use stochastic gradient descent. You grab a random handful of examples, a mini-batch, compute the slope based on just those, and take a step. The slope estimate is noisy. It points roughly downhill, but it wobbles. And that wobble turns out to be a feature, not a bug. The noise kicks the optimization out of shallow valleys and narrow ravines, allowing it to stumble into broader, deeper ones. A worse estimate of the direction leads to a better final destination. This is one of those places where the math is surprising and the intuition struggles to keep up.

There is another finding that matters. Broad, flat valleys tend to generalize better than narrow, sharp ones. A sharp minimum means the knob settings are precisely tuned. Shift them slightly, as happens when you switch from training data to real-world data, and the loss spikes. A flat minimum means there is a wide range of nearby settings that all perform well. Small shifts do not matter much. Stochastic gradient descent, with its noisy steps, naturally favors flat minima. It spends more time in broad valleys because they are harder to accidentally leave.

The Memorizer and the Generalizer

There is a trap in training that seems obvious once you hear it and is extraordinarily difficult to avoid in practice.

Show a child a hundred photographs of cats. Then show them a new photograph of a cat they have never seen. They recognize it. Now imagine a different kind of learner. Show it the same hundred photographs. It memorizes every pixel of every image. Ask it about a new photograph and it is lost, because it did not learn what a cat looks like. It learned what those specific hundred photographs look like.

This is overfitting. The network reduces its loss on the training data to nearly zero, not by discovering general patterns, but by memorizing specific examples. It performs brilliantly on data it has seen and terribly on data it has not. The opposite problem, underfitting, happens when the network is too simple to capture the real patterns. It performs poorly on everything, training data and new data alike.

The classical view, taught in machine learning courses for decades, is that this creates a clean U-shaped tradeoff. A very simple model underfits. As you add capacity, it gets better. At some sweet spot, it captures the real patterns without memorizing noise. Add more capacity beyond that and it starts overfitting. Performance on new data gets worse. The name for this balance is the bias-variance tradeoff. Bias is the error from being too simple. Variance is the error from being too sensitive to the specific training examples.

This picture is clean, intuitive, and wrong. Or rather, it is incomplete.

In twenty nineteen, a team led by Mikhail Belkin discovered something that upends the classical story. They called it double descent. As you make a model bigger, past the point where it can perfectly memorize the training data, performance on new data does get worse. The classical theory is right about that. But if you keep making the model bigger, past that dip, performance starts improving again. The model gets too big to just memorize, and something else takes over. It finds structure.

OpenAI confirmed this in the same year with a study showing three separate forms of double descent. Model-wise, where bigger models dip and recover. Epoch-wise, where training longer causes error to rise and then fall again. And the most counterintuitive version, sample-wise, where adding more training data can temporarily make performance worse before it gets better. The phenomenon appears across architectures, across datasets, across tasks. It is not an anomaly. It is the landscape.

This discovery matters because it explains why modern AI works at all. The models we use are vastly overparameterized. GPT-3 has one hundred seventy-five billion knobs trained on three hundred billion tokens. It could memorize its training data many times over. Classical theory says this should be a disaster. Double descent says the disaster happens and then resolves itself if you push through to the other side. The field discovered this empirically, by building models too large for the theory and observing that they worked anyway. The theory caught up later.

Winning the Lottery

In twenty eighteen, a researcher named Jonathan Frankle at MIT asked a question that sounds absurd. If you take a trained network and remove ninety percent of its connections, can the remaining ten percent still perform just as well?

The answer, against all intuition, was yes. Not just yes, but the small surviving network often performed better than the full one. And it trained faster. Frankle called this the lottery ticket hypothesis. The idea is that a large randomly initialized network contains many possible subnetworks. Most of them are useless. A few of them, the "winning tickets," happened to start with initial weights that make training particularly effective. The full training process, running gradient descent across all the knobs, effectively finds and strengthens these winning subnetworks while the rest of the parameters just go along for the ride.

Dense, randomly-initialized, feed-forward networks contain subnetworks, winning tickets, that when trained in isolation reach test accuracy comparable to the original network in a similar number of iterations.

The winning tickets were consistently less than ten to twenty percent of the original network. Above that threshold, they learned faster and reached higher accuracy than the full network. You could prune over ninety percent of the parameters and lose nothing.

The implications are unsettling. If most of a network's parameters are unnecessary for the final solution, why are they there? The answer Frankle proposed is that they increase the odds of containing a good subnetwork. A bigger network has more lottery tickets. More chances that some subset of randomly initialized weights will land in a configuration that gradient descent can work with. You are not buying a better network. You are buying more chances at a good one.

This connects to the scaling story we cover in episode seven. Part of why larger models work better might not be that they have more capacity for patterns. It might be that they have more chances to contain a winning ticket. The relationship between scale and capability might be more about probability than about raw capacity. This is an active area of research with passionate disagreement on all sides.

Building the World in a Database

The training algorithm needs data. An enormous amount of data. And where that data comes from is a story most people in AI would prefer to tell quickly.

The backbone of most large language model training is Common Crawl, a nonprofit that has been systematically downloading the internet since two thousand eight. Their crawler visits billions of web pages every month. A single monthly crawl, as of twenty twenty six, captures roughly two billion web pages, about three hundred and forty-five terabytes of uncompressed text. They have been doing this for eighteen years. The total archive is measured in petabytes.

Common Crawl is free, open, and crude. The raw data is full of duplicates, spam, navigation menus, cookie consent banners, and text in hundreds of languages jumbled together. It is not something you feed directly to a model. It is the ore that needs refining.

The refining pipeline typically works like this. First, extract the plain text from the raw web pages. Then remove exact duplicates and near-duplicates using techniques like MinHash, which detects documents that are similar but not identical. Then filter for quality. This is where it gets interesting and uncomfortable. Quality filtering typically means training a separate model to distinguish "good" text, text that resembles Wikipedia articles and academic papers, from "bad" text, text that resembles informal social media posts, hate speech, or low-effort content. The model learns to prefer a certain kind of English, a certain register, a certain set of topics. The choices embedded in that filter shape what the final AI model considers normal language.

After filtering, only one to eleven pages out of every hundred survive. The vast majority of the internet is discarded as not good enough.

Beyond Common Crawl, training datasets pull from curated sources. EleutherAI's dataset called The Pile, published in twenty twenty, combined twenty-two different sources into eight hundred eighty-six gigabytes of text. Wikipedia, PubMed medical papers, ArXiv preprints, GitHub code, Stack Exchange discussions, Project Gutenberg books, FreeLaw court decisions, United States patent filings, YouTube transcripts, and Ubuntu IRC chat logs, among others. Each source adds a different flavor of language to the mix.

One of the original components of The Pile was a collection called Books3, roughly a hundred gigabytes of copyrighted books scraped from a private file-sharing tracker called Bibliotik. This led to a class action lawsuit from authors whose work was included without permission. Books3 was removed from the dataset before twenty twenty-four, but the models trained on it still carry the patterns.

The human labor in this pipeline is easy to overlook. Fei-Fei Li's ImageNet, the dataset that enabled the AlexNet moment we covered in the main episode, took two and a half years to label. Forty-nine thousand workers from one hundred and sixty-seven countries, recruited through Amazon's Mechanical Turk platform, filtering and labeling over one hundred and sixty million candidate images. Each image was labeled three times for quality assurance. Workers were paid per task completed, sometimes just a few cents. Li was operating, in her own words, on the knife edge of their finances.

A Woman and a Better Question

The ImageNet story deserves more room than the main episode could give it, because it is not just a dataset. It is a thesis about what matters.

Fei-Fei Li was born in Beijing in nineteen seventy-six and grew up in Chengdu, Sichuan. Her father was a physicist. Her mother was an engineer. When Li was twelve, her father immigrated to Parsippany, New Jersey. She and her mother followed when she was sixteen. Her parents were educated but did not speak English. Her father did camera repair. Her mother worked as a cashier. Li worked weekends at the family's dry cleaning business through high school.

Despite the near squalor of my mother's life in America, and the menial work that seemed to have claimed her every waking moment since our arrival, she remained steadfast that my passion for science was not to be ignored.

Li graduated number six in her class, earned a nearly full scholarship to Princeton, and while there borrowed money from friends and even her high school math teacher to buy a dry cleaning business for her parents. She finished a bachelor's degree in physics, then a PhD in electrical engineering from Caltech.

In two thousand six, as a newly hired professor at the University of Illinois, Li watched her colleagues focus exclusively on building better algorithms. Everyone was trying to be cleverer about how they processed small, carefully curated datasets. Li asked a different question. What if the bottleneck is not the algorithm? What if it is the data?

The paradigm shift of the ImageNet thinking is that while a lot of people are paying attention to models, let's pay attention to data. Data will redefine how we think about models.

People doubted her. If existing algorithms could not handle one type of object well, why would throwing thousands of categories at them help? A mentor told her she had taken the idea too far, that the trick was to grow with her field, not leap so far ahead of it.

The critical discovery came through a hallway conversation with a graduate student who showed her Amazon's Mechanical Turk, a platform where thousands of people around the world could be paid to do small tasks. Li realized she had found the workforce to label images at the scale she needed. She later said that the moment she saw the website, she knew ImageNet was going to happen. Without it, the project would have taken an estimated nineteen years of continuous human labor.

The ImageNet competition launched in twenty ten and was initially relegated to a poster session at a conference. Modest fanfare. Then in twenty twelve, AlexNet entered. Li herself later described neural networks at the time as seeming like a dusty artifact encased in glass and protected by velvet ropes. That dusty artifact, running on gaming hardware, trained on her dataset, obliterated the competition.

By twenty seventeen, twenty-nine out of thirty-eight competing teams achieved over ninety-five percent accuracy. The problem Li had created was effectively solved. Not because the algorithms got that much better, though they did. Because she was right. The data was the bottleneck.

The Billion Dollar Training Run

The economics of training have followed an exponential curve that makes even AI researchers nervous.

The original transformer paper in twenty seventeen cost roughly nine hundred and thirty dollars to train. BERT, in twenty eighteen, cost between seven thousand and twelve thousand dollars. GPT-3, in twenty twenty, cost somewhere between five hundred thousand and four point six million dollars, depending on which estimate you trust and what you include. GPT-4, in twenty twenty-three, cost over one hundred million dollars. Sam Altman confirmed this publicly. Google's Gemini Ultra, also twenty twenty-three, was estimated at one hundred and ninety-one million. Meta's Llama three, the four hundred five billion parameter version from twenty twenty-four, cost roughly one hundred and seventy million.

Training costs have been growing at roughly two point four times per year since twenty sixteen. If that trend continues, the largest training runs will cost more than a billion dollars by twenty twenty-seven. Dario Amodei, the CEO of Anthropic, has said that frontier developers will likely spend close to a billion dollars on a single training run this year, and up to ten billion within two years.

Where does the money go? The breakdown is not what most people expect. AI accelerator chips, GPUs and TPUs and custom silicon, account for forty-seven to sixty-four percent of the cost. Research and development staff take twenty-nine to forty-nine percent. Server components, the memory and networking and storage that surrounds the chips, add fifteen to twenty-two percent. And energy, the part everyone talks about, is only two to six percent.

The energy fraction will grow as training scales. But for now, the bottleneck is silicon, not watts. There are not enough high-end GPUs in the world to satisfy demand. Countries are treating chip supply chains as matters of national security. The training loop from the main episode, those four simple steps, is being run at a scale that shapes geopolitics.

There is a counterpoint worth noting. In twenty twenty-four, the Chinese company DeepSeek reported training its V3 model for just five point six million dollars in compute. Dramatically cheaper than comparable Western models. Either their efficiency gains are real, which would mean the cost curve is not as inevitable as it looks, or the headline number is missing substantial costs. Probably some of both. The honest answer is that nobody outside DeepSeek knows the full accounting.

Training More Than Once

There is one more pattern that defines how modern AI is actually built, and it cuts against the simple picture of "train the model and you are done."

Modern language models are trained at least twice. The first stage, pretraining, is the massive process we have been discussing. Billions of text examples, trillions of predictions, months of compute. The result is a base model. It is powerful but chaotic. Ask it a question and it might answer, or it might continue your question as if it were writing a Wikipedia article, or it might generate something incoherent. The base model is not trying to be helpful. It is trying to predict the next token.

The second stage is fine-tuning. Take the pretrained model and train it further on a much smaller, carefully curated dataset of desirable behavior. Questions paired with good answers. Instructions paired with helpful responses. The knobs get adjusted again, but gently, nudged toward a specific way of interacting rather than the general pattern completion of pretraining. This is why ChatGPT feels like an assistant even though the base model underneath it has no concept of assistance.

Episode eight goes deep into the most important form of fine-tuning, reinforcement learning from human feedback. But the key point for this episode is that "training" is not one thing. It is a pipeline. Pretraining gives the model general capability. Fine-tuning gives it direction. And different stages of training pull against each other in ways that are not fully understood. A fine-tuned model can lose capabilities the base model had. A model trained too aggressively on safety data can become unhelpfully cautious. The knobs do not have labeled dials. You turn them and see what happens.

Andrej Karpathy, who built the original training infrastructure at Tesla and later worked at OpenAI, has a blog post that captures this reality better than any paper.

Neural nets are nothing like off-the-shelf technology the second you deviate slightly. Everything could be correct syntactically, but the whole thing is not arranged properly, and it is really hard to tell. Networks silently work a bit worse when misconfigured. A fast and furious approach to training neural networks does not work and only leads to suffering.

The qualities that in his experience correlate most strongly to success in deep learning, Karpathy wrote, are patience and attention to detail.

Patience and attention to detail. Not the qualities most people associate with a field that moves at the speed of AI. But the training loop does not care about hype cycles or investor expectations. It cares about the loss going down. And making that happen reliably, at scale, without the whole thing silently degrading in ways you will not notice for weeks, is an engineering challenge that makes most software problems look simple.

Does It Understand?

We have described training as pattern matching. Statistical regularities. Wrongness minimization. But there is a question that refuses to stay polite. Is that all it is?

A twenty twenty-two survey of the natural language processing research community found a nearly perfect split. Fifty-one percent believed large language models could, to some meaningful degree, understand language. Forty-nine percent disagreed. These are the people who build these systems and study them professionally, and they cannot agree on whether the systems understand anything.

On one side, Ilya Sutskever, who helped build AlexNet and co-founded OpenAI, has argued that predicting the next token in a vast dataset forces the model to learn a compressed, high-fidelity model of the world. To predict what someone will say next, you need something that functions like understanding of what they mean. Sutskever has gone further.

It may be that today's large neural networks are slightly conscious.

On the other side, the linguist Emily Bender and the computer scientist Timnit Gebru coined the term "stochastic parrots" in a twenty twenty-one paper that argued these models are stitching together sequences of linguistic forms observed in training data according to probabilistic rules, but without any reference to meaning.

The understanding is all on our end. We are imagining a mind behind the text. A very key thing to keep in mind is that the output of these systems does not actually make sense. It is that we are making sense of the output.

That paper led to Gebru's firing from Google after the company asked her to retract it, a sequence of events that became its own controversy about corporate censorship and race in technology. The term "stochastic parrot" was named the twenty twenty-three AI-related word of the year.

The honest answer, the one this series is committed to giving, is that we do not know. The training loop produces something. That something can write coherent essays, solve novel math problems, generate code that works, and produce analogies that genuinely illuminate. Whether it does these things through understanding or through an extraordinarily sophisticated form of pattern matching depends on what you mean by understanding. And that is a question about philosophy, not engineering. The training loop does not have an opinion.

What we can say is this. The training process optimizes for a specific objective, predicting the next token, and the result exhibits behaviors that look like understanding without anyone designing for understanding. Whether the appearance is the reality or just a very convincing surface is a question the field will be arguing about for decades. This series will keep returning to it.

That was the deep dive for episode three.

The Jargon Jar

This episode's term: epoch.

If a friend asked you, you would say: one complete pass through all the training data. The model sees every single example once. That is one epoch. Most models train for many epochs, seeing the same data over and over.

Marketing uses it to sound scientific, as in "trained for hundreds of epochs," implying thoroughness and rigor without anyone asking what that actually means.

What it means in practice is more interesting. Each pass through the data extracts slightly different patterns, the way rereading a book picks up details you missed the first time. But there is a ceiling. Too many epochs and the model starts memorizing the specific examples instead of learning the underlying patterns. That is overfitting, and it is the most common way training goes wrong. The right number of epochs is not "as many as possible." It is "enough to learn the patterns and not so many that you memorize the noise." Finding that number is more art than science, and getting it wrong is expensive when a single epoch costs millions of dollars in compute.