This is episode three of Actually, AI.
Your chatbot learned nothing about you. It learned from Shakespeare, Wikipedia, Reddit, medical textbooks, forum arguments, love letters, code repositories, and billions of web pages, all before a cutoff date. Everything it knows was baked in and frozen. When you type a question, no knobs turn. No learning happens. You are talking to a snapshot.
This matters because it explains something you have probably noticed. Ask your chatbot about a company that rebranded last month and it will confidently describe the old name. Ask about a software library and it might document an API that changed six months ago. Ask about a world event from last week and it will generate something fluent, detailed, and wrong. The model is not broken. It is not lying. It is doing exactly what it was trained to do, predicting plausible next words, using patterns from text that predates whatever you are asking about.
The question, then, is how did all that knowledge get in there? How do you take a machine made of millions of numerical knobs, the kind we built in episode two, and turn it into something that can write a sonnet, debug Python code, explain the Treaty of Westphalia, and get confidently confused about last Tuesday? The answer is training. And the way it works is stranger than most people imagine, because it starts with a confession. The machine does not know. It guesses. And the entire engine of modern AI is built on measuring exactly how wrong those guesses are.
Strip away all the history and the drama, and the core of training is remarkably simple. Four steps, repeated billions of times.
Step one. Show the network an example. A sentence with the next word hidden. The input flows forward through the layers, each knob multiplying and transforming the signal, until the network produces a prediction at the output.
Step two. Compare that prediction to reality. The network said there was a seventy percent chance the next word was "dog" and a twenty percent chance it was "cat." The actual word was "cat." The prediction was wrong.
Step three. Measure exactly how wrong. This measurement is called the loss. It is a single number that captures the distance between what the network predicted and what the answer actually was. Higher loss means more wrong. The entire goal of training is to make this number smaller.
Step four. Adjust the knobs. A technique called backpropagation traces the error backward through every layer, calculating how much each individual knob contributed to the wrongness. Then every knob gets nudged, a tiny amount, in the direction that would have made the loss a little lower. This is gradient descent. The gradient is the slope, the direction of steepest downhill. Descent means you take a step downhill. You cannot see the whole landscape. You can only feel the slope under your feet.
Then you go back to step one with a new example. The adjustments are tiny. One pass through one example barely changes anything. But do this billions of times, across millions of examples, and the knobs settle into configurations that produce useful predictions on data the network has never seen before.
That last part is the miracle and the mystery. The network was only ever trained to reduce its wrongness on examples it saw. Nobody told it to generalize. It just does. And nobody fully understands why.
The technique at the heart of step four, backpropagation, has a story worth telling. Because it almost never happened.
Geoffrey Hinton started graduate school in nineteen seventy two with one obsession. He wanted to figure out how a network of simple units could learn by adjusting the strength of their connections. His adviser kept pressuring him to stop. Neural networks were a dead end, everyone said. The perceptron had been mathematically buried three years earlier by Minsky and Papert. Serious researchers had moved on.
Hinton did not move on. He spent the next fourteen years cycling through four different institutions, because no one wanted to fund what he was doing. His reasoning was plain. The brain uses neurons. The brain learns. Therefore a network of artificial neurons must be capable of learning, if only someone could figure out the right training procedure. He once described his position as not really faith, just something completely obvious to him.
In nineteen eighty six, Hinton and two colleagues, David Rumelhart and Ronald Williams, published a paper in Nature that cracked the problem open. The technique was backpropagation. When the network makes a wrong prediction, trace the error backward through every layer and figure out how much each knob contributed to the mistake. Then nudge each knob in the direction that would have reduced that contribution. The paper showed that hidden layers, the ones between input and output that nobody could directly supervise, would spontaneously develop meaningful internal representations. The network would discover its own features.
The first thing that really excited me about backpropagation was its ability to do unsupervised learning. The second was its ability to train recurrent neural nets.
There is a wrinkle in that history. Hinton, Rumelhart, and Williams did not invent backpropagation. A Finnish student named Seppo Linnainmaa had described the core mathematical technique in his master's thesis in nineteen seventy, sixteen years earlier, writing the algorithm out in FORTRAN on pages fifty eight through sixty of a thesis written in Finnish that almost nobody outside Finland ever read. Paul Werbos independently developed the idea for neural networks in his nineteen seventy four PhD thesis but could not get it published for years. The nineteen eighty six paper was the one that made the world pay attention.
By the early two thousands, neural networks were back in the wilderness. Support vector machines were elegant, mathematically respectable, and worked well on small datasets. Papers from Hinton and his allies, Yann LeCun and Yoshua Bengio, were being routinely rejected simply because the subject was neural networks. The three of them kept working anyway, calling themselves, half jokingly, the deep learning conspiracy.
The moment everything changed has a precise date. October two thousand twelve. Florence, Italy. A competition called the ImageNet Large Scale Visual Recognition Challenge had been running since twenty ten. Teams built systems to classify photographs into a thousand categories, and every year the best error rates hovered around twenty five to twenty six percent. In twenty twelve, a team of three from the University of Toronto submitted an entry that scored fifteen point three percent. They had not just won. They had nearly halved the error rate of every other competitor.
The system was called AlexNet, named after its first author, Alex Krizhevsky. Krizhevsky was a PhD student working under Hinton. The third member was Ilya Sutskever, who would later co-found OpenAI. The network had sixty million parameters, five convolutional layers, and it had been trained on two NVIDIA GTX five eighty graphics cards, consumer gaming hardware, in Krizhevsky's bedroom at his parents' house. The training took about six days.
Ilya thought we should do it, Alex made it work, and I got the Nobel Prize.
The key insight was not the architecture. It was the training. Krizhevsky took a deep neural network and trained it with backpropagation on a large dataset using graphics cards originally designed for rendering video game explosions. The GPUs made the training fast enough to be practical. The dataset, ImageNet, made it rich enough to generalize. And the training algorithm was the same one Hinton had been preaching since nineteen eighty six.
That was twenty twelve. Image classification. A long way from the chatbot you used this morning. So how did we get from a network that recognizes cat photographs to one that writes emails and explains quantum mechanics?
The bridge is a single idea. Instead of training a network to classify images, train it to predict the next word in a sentence. Show it the phrase "the cat sat on the" and ask it what comes next. It will be wrong. Measure how wrong. Nudge the knobs. Repeat. Billions of times. Across trillions of words scraped from the internet, from books, from every digitized text humans have ever produced.
The training data for a modern language model is almost incomprehensibly large. The backbone is usually Common Crawl, a nonprofit that has been systematically downloading the internet since two thousand eight. A single monthly crawl captures roughly two billion web pages. That raw data gets filtered, deduplicated, and refined. Only one to eleven pages out of every hundred survive the quality filters. On top of that, curated sources get mixed in. Wikipedia, academic papers, GitHub code, Stack Exchange discussions, Project Gutenberg books, patent filings, court decisions. Each source adds a different flavor of language to the mix.
The result of this training, after months of compute and hundreds of millions of dollars, is a base model. It is powerful but chaotic. Ask it a question and it might answer, or it might continue your question as if writing a Wikipedia article, or it might generate something bizarre. The base model is not trying to be helpful. It is trying to predict the next token.
To turn that chaotic base into something useful, there is a second stage. Fine-tuning. Take the pretrained model and train it further on a much smaller, carefully curated dataset of desirable behavior. Questions paired with good answers. Instructions paired with helpful responses. The knobs get adjusted again, but gently, nudged toward a specific way of interacting. This is why ChatGPT feels like an assistant even though the base model underneath it has no concept of assistance. Episode eight goes much deeper into this second stage.
When you type a question into a chatbot, you are sending your words through a machine whose knobs were set by this process. Billions of text sequences. Trillions of predictions. A loss function that rewarded the model for getting closer to whatever word actually came next in the training data. The model did not learn facts. It did not develop beliefs. It found patterns. Statistical regularities in human language so deep and so numerous that the output looks like understanding.
This is why AI can write fluently about topics it has never specifically been trained on. The patterns generalize. If you have seen enough sentences about physics and enough sentences about cooking, you can generate plausible text about the physics of cooking, even if that exact combination never appeared in the training data.
But this is also why AI confidently says wrong things. We will get into that deeply in episode five. The training process does not optimize for truth. It optimizes for predicting what comes next. A pattern that appeared frequently in the training data gets a strong signal, whether or not that pattern reflects reality. The model has no mechanism for checking its own claims against the world. It has knobs. The knobs were set by wrongness minimization. What came out the other end is something that mimics understanding well enough to be useful and poorly enough to be dangerous.
Training adjusts the knobs from episode two. The specific architecture being trained in the systems you use today is the transformer, which is episode four. RLHF, a different kind of training applied after the initial phase, is episode eight. And the question of what happens when you make this training process bigger, when you add more data, more knobs, more compute, is the scaling story in episode seven.
That was episode three. The deep dive goes further into the loss landscape, the lottery ticket hypothesis, the human labor behind training data, and the staggering economics of making these models. Find it right after this in your feed.