Neural Networks: The Full Story

This is the deep dive companion to episode two of Actually, AI, neural networks.

The main episode told you that neural networks are not brains. That they are machines made of simple knobs, and that nobody fully understands why they work. This deep dive goes to the sources. The people who invented this idea, the feuds that nearly destroyed it, the decades of exile, and the unlikely resurrection that changed the world.

The Teenager and the Neuroscientist

The story begins in nineteen thirty five, in a public library in Detroit. A twelve year old boy named Walter Pitts, chased by neighborhood bullies, ducked inside to hide. While waiting for them to leave, he found a copy of Principia Mathematica, the monumental work by Bertrand Russell and Alfred North Whitehead that attempted to ground all of mathematics in formal logic. Nearly two thousand pages across three volumes.

Pitts stayed in that library for three days. He read all three volumes cover to cover, and in the process identified several mistakes. He wrote to Russell in England, pointing out the errors. Russell wrote back, impressed, and invited the twelve year old to come study at Cambridge as a graduate student. Pitts did not go. He was a child from a working class family in Prohibition era Detroit. His father was a boiler maker who wanted him to leave school and work.

Three years later, at fifteen, Pitts learned that Russell would be visiting the University of Chicago. He ran away from home to meet him. He never returned to his family. From that point on, he refused to speak about them, sending only an anonymous Christmas present home each year for the rest of his life.

At Chicago, Pitts worked menial jobs while attending lectures he was not enrolled in. He met a medical student named Jerome Lettvin, who recognized something extraordinary in this homeless teenager and introduced him to Warren McCulloch, a forty two year old neuroscientist with a spectacular beard who wrote sonnets, designed buildings, and was obsessed with a question that had consumed him for years. How could the electrical activity of neurons, which seemed to operate as simple on off switches, give rise to the complexity of thought?

McCulloch had been struggling with a specific mathematical problem. Neurons form loops, feeding signals back to each other, and these loops seemed paradoxical when you tried to model them with formal logic. Pitts, the teenager with no academic credentials, solved it using modular arithmetic. McCulloch called it an idea wrenched out of time.

McCulloch invited the homeless eighteen year old to live with his family in Hinsdale, Illinois. The household was intellectually vibrant, filled with evening discussions about poetry, psychology, and politics. Late at night, McCulloch and Pitts would collaborate on what became one of the most influential papers in the history of computing.

A Logical Calculus

Their nineteen forty three paper, "A Logical Calculus of Ideas Immanent in Nervous Activity," proposed something that sounds obvious now but was revolutionary then. Because neurons fire in an all or nothing manner, either they send a signal or they do not, neural events can be treated using propositional logic. McCulloch and Pitts built a mathematical model where each artificial neuron takes weighted inputs, sums them, and fires a one if the sum exceeds a threshold, or stays silent at zero. Two types of inputs, excitatory ones that push toward firing and inhibitory ones that block it entirely.

The key result: these simple binary units, connected together, could compute any logical function. The network was Turing complete. Given the right arrangement, it could perform any calculation a digital computer could perform. McCulloch declared that for the first time in the history of science, we know how we know.

For the first time in the history of science, we know how we know.

That was an overstatement. They had built a mathematical model of how simple elements could compute, not an explanation of human cognition. But the paper landed with enormous force. John von Neumann, designing what would become the first stored program binary computer, explicitly cited McCulloch and Pitts in his famous First Draft of a Report on the EDVAC. Von Neumann wrote that these simplified neuron functions could be imitated by telegraph relays or by vacuum tubes. The paper that started as a model of the brain became a blueprint for the digital computer.

Walter Pitts was eighteen years old and had no formal academic credentials when this seminal paper was published.

A Life Wrenched Out of Time

Pitts went to MIT, where Norbert Wiener, the father of cybernetics, recognized his brilliance immediately. By the second blackboard, according to accounts, Wiener knew Pitts was exceptional. He promised Pitts a PhD in mathematics despite his lack of a high school diploma. Pitts wrote to McCulloch in December of nineteen forty three.

I now understand at once some seven eighths of what Wiener says, which I am told is something of an achievement.

In nineteen fifty two, McCulloch joined Pitts at MIT, heading a brain science project. They set up a laboratory in Building Twenty on Vassar Street and posted a sign on the door that read Experimental Epistemology. With Lettvin and another researcher named Patrick Wall, they seemed poised for revolutionary breakthroughs.

Then everything collapsed. Wiener's wife told her husband that McCulloch's boys had behaved inappropriately with their daughter during a visit to Chicago. Wiener, without investigating or asking for explanations, sent an angry telegram severing all ties. He never spoke to Pitts again and never explained why.

For Pitts, the professional loss was secondary. Wiener had been a father figure, a protector, the man who had believed in a homeless teenager when no one else would. The unexplained abandonment, as one account put it, defied logic itself, and logic was the only framework Pitts had for understanding the world.

Then came the frogs. Lettvin conducted experiments in the basement of Building Twenty on how frogs' eyes process visual information. The results, published in nineteen fifty nine as "What the Frog's Eye Tells the Frog's Brain," showed that the eye does not passively record images. It actively filters and analyzes visual information before sending it to the brain. This contradicted the foundational assumption of Pitts' life work, that the brain operates through pure mathematical logic.

It was apparent to him after we had done the frog's eye that even if logic played a part, it did not play the important or central part that one would have expected. It disappointed him.

Pitts began drinking heavily. He withdrew from colleagues. When offered his PhD, he refused to sign the paperwork. And then he did something that still haunts the history of science. He set fire to his dissertation, along with all of his notes and papers. Years of important work that the scientific community had been eagerly awaiting, reduced to ash.

We would go hunting for him night after night. Watching him destroy himself was a dreadful experience.

He remained nominally employed at MIT but rarely spoke to anyone. On April twenty first, nineteen sixty nine, from a hospital bed at Beth Israel Hospital, he wrote to McCulloch, who was in cardiac care nearby.

No doubt this is cybernetical. But it all makes me most abominably sad.

Walter Pitts died on May fourteenth, nineteen sixty nine, alone in a Cambridge boarding house, from bleeding esophageal varices associated with cirrhosis. He was forty six years old. McCulloch died four months later. As one writer put it, as if the existence of one without the other were simply illogical, a reverberating loop wrenched open.

The Perceptron Wars

The main episode introduced Frank Rosenblatt and Marvin Minsky. The deep dive tells the fuller story.

Their rivalry was not just intellectual. It was institutional and personal. Rosenblatt at Cornell championed connectionism, the idea that intelligence emerges from networks of simple units. Minsky at MIT championed symbolic AI, the idea that intelligence requires structured representations and explicit rules. Both had attended the Bronx High School of Science, Minsky in the class of forty four, Rosenblatt in the class of forty six.

At conferences throughout the nineteen sixties, the two debated publicly while their colleagues and students looked on in what many remember as great spectator sport. Rosenblatt called Minsky the loyal opposition. Behind the scenes, the tone was sharper. One colleague said Rosenblatt irritated a lot of people. Another said Minsky knocked the hell out of our perceptron business. Despite all of this, they reportedly remained friendly.

But Minsky was not just debating. He and Papert had been working toward dismantling neural networks since around nineteen sixty five, speaking at conferences and circulating preprints. By nineteen sixty nine, when their book Perceptrons was finally published, most researchers had already left connectionism, frustrated by the lack of progress. The book was the final nail, not the sole cause.

What Minsky and Papert actually proved was narrow but devastating. A single layer perceptron cannot compute the XOR function. XOR outputs true only when its two inputs disagree. You cannot draw a single straight line on a flat plane that separates the true cases from the false ones. AND works. OR works. XOR does not. The proof was rigorous and correct.

But then came the conjecture, and this is where the story gets complicated. It was generally understood at the time that multi layer networks could solve XOR. The problem was that nobody knew how to train them effectively. Minsky and Papert conjectured that extensions of the perceptron, for example based on additional layers of units and connections, would be subject to limitations similar to those suffered by one layer perceptrons. They admitted this was an intuitive judgment. That intuitive judgment proved wrong, and it may have been the most damaging sentence in the entire book.

No machine can learn to recognize something unless it possesses, at least potentially, some scheme for representing it.

ARPA explicitly defunded neural network research. Careers were derailed. The first AI winter had begun. And in nineteen eighty eight, when neural networks were already making a comeback, Minsky and Papert published an expanded edition of Perceptrons. They did not soften their position. In the prologue, they claimed that little of significance had changed since nineteen sixty nine. In the epilogue, they called multi layer neural networks a sterile extension. History was about to prove them spectacularly wrong.

Who Really Invented Backpropagation

If neural networks are machines made of knobs, then someone had to figure out how to turn the knobs in the right direction. This is the problem of backpropagation, the algorithm that makes neural networks trainable, and its history is one of the most contested credit disputes in all of computer science.

The timeline has at least seven plausible "inventors." In nineteen sixty, Henry Kelley published a precursor gradient method from control theory. In nineteen sixty one, A.E. Bryson developed gradient methods for multi stage processes. In nineteen sixty seven, the Japanese researcher Shun-ichi Amari applied stochastic gradient descent to deep multi layer perceptrons.

Then, in nineteen seventy, a Finnish graduate student named Seppo Linnainmaa submitted his master's thesis at the University of Helsinki. The title, translated from Finnish, was about representing cumulative rounding error as a Taylor expansion of local rounding errors. Buried in this deeply technical thesis was an algorithm, implemented in FORTRAN, that described reverse mode automatic differentiation. It is mathematically identical to what we now call backpropagation. Linnainmaa had priority but no paternity. The thesis was in Finnish and was never published in English in the context of neural networks. He remained obscure outside the automatic differentiation community until Juergen Schmidhuber began championing his contribution in the twenty tens.

Four years later, Paul Werbos at Harvard wrote a PhD thesis that explicitly connected the algorithm to neural networks. He called it reverse mode optimization for dynamic systems. Remarkably, Werbos developed his version of backpropagation as a mathematical translation of Freudian psychodynamics. He claimed to have developed the algorithm in nineteen seventy one but was frustrated at every turn whenever he tried publishing it, not managing to get it into wide circulation until around nineteen eighty two.

The irony is hard to miss. The algorithm's discoverer spent more energy fighting publication gatekeepers than advancing the field, while its re-discoverer received lasting recognition through superior presentation and timing.

That re-discoverer was David Rumelhart. In nineteen eighty six, Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper in Nature called "Learning representations by back-propagating errors." This was not the first description of the algorithm. It was not even the second or third. But it was the one that made the field pay attention, because it showed what backpropagation could actually do: the hidden units inside the network learned to represent important features of the task on their own, without being told what to look for.

We not only discovered them, but realized what we could do with them. It is not the first inventor who gets credit, but the last re-inventor.

Rumelhart's story has its own tragic dimension. He was one of the great minds of connectionism, co-author of the two volume Parallel Distributed Processing that became the bible of neural network cognitive science, cited over thirty thousand times. In the mid nineteen nineties, he was diagnosed with Pick's disease, a rare form of frontotemporal dementia. He stopped teaching in nineteen ninety eight as his cognition progressively declined. The man who had modeled how brains learn was struck by a disease that destroyed his own brain's ability to do so. He died in twenty eleven, in Chelsea, Michigan, at the age of sixty eight.

The Man Who Would Not Sit Down

Geoffrey Hinton believed in neural networks before almost anyone else, and he kept believing in them long after it was professionally sensible to do so.

His family tree reads like a catalog of British intellectual eccentrics. His great great grandfather was George Boole, the inventor of Boolean logic, the mathematical foundation of every digital computer ever built. His middle name, Everest, comes from George Everest, the British surveyor after whom the mountain is named. His first cousin several times removed was Jane Taylor, the poet who wrote Twinkle, Twinkle, Little Star. His first cousin once removed, Joan Hinton, was a nuclear physicist who worked under Enrico Fermi and later moved to China.

After studying experimental psychology at Cambridge, Hinton did something unexpected. He became a carpenter. Not fancy carpentry, as he later put it, but carpentry to make a living, with thoughts of his father hanging over him. His father, a respected entomologist and Royal Society member, used to tell young Geoffrey every morning to get in there pitching, and maybe when you are twice as old as me you will be half as good.

He knew a lot more about beetles than he knew about people.

Hinton eventually returned to academia for a PhD in artificial intelligence at Edinburgh. His supervisor, Christopher Longuet-Higgins, favored symbolic AI and was always trying to persuade Hinton that what he was doing was nonsense. Their interactions were shouting matches. Hinton kept promising that if he could not make neural networks work in six months, he would switch to symbolic AI. That six months never came.

Through the late nineteen seventies and eighties, Hinton moved between institutions, a perception of unfulfilled promises reducing funding for neural network research wherever he went. After having difficulty getting funding in Britain, he moved to the United States, working at the University of Sussex, UC San Diego, and Carnegie Mellon. At Carnegie Mellon he joined the PDP group with Rumelhart, Sejnowski, and others, the small community of researchers who still believed neural networks were the right path.

In nineteen eighty seven, Hinton made another unexpected decision. He moved to Canada because he did not want to take money from the United States military, and most AI funding in America came from the Department of Defense. The Canadian Institute for Advanced Research offered him academic freedom and a decent salary. The University of Toronto became his base for the next thirty six years.

There is a physical detail about Hinton that matters for understanding his persistence. Back injuries that began in his teens became severe by his fifties. He has not sat down since two thousand and five.

I last sat down in two thousand five, and it was a mistake.

He travels by car lying across the back seat. For meals, he kneels on a foam cushion before the table. When asked about it, he says if you let it completely control your life, it does not give you any problems.

When CBS asked him, decades into his career, when he realized he had been right all along about neural networks, his answer was characteristically blunt.

I always thought I was right.

It took much longer than he expected. It took, like, fifty years before it worked well, he said. But in the end it did work well.

Why Depth Matters

Here is the question that haunted neural network research for decades. You can prove, mathematically, that a single layer of knobs can approximate any pattern. We will get to that theorem shortly. So why bother with depth? Why stack dozens of layers when one should suffice?

The answer is that a single layer can theoretically do anything, but in practice it needs an absurd number of knobs. Depth is a shortcut. Each layer builds on what the layer below it found. Early layers detect simple patterns, edges, curves, basic shapes. Middle layers combine those into structures. Later layers compose structures into the abstract representations the task demands. A face is not recognized by comparing pixels. It is recognized through a hierarchy: edges become features, features become eyes and noses, eyes and noses become a face. Each layer handles one level of that hierarchy, and the whole stack handles complexity that would drown a single layer.

The problem, for decades, was that deep networks were almost impossible to train. The algorithm for adjusting the knobs, backpropagation, works by sending error signals backward through the network. In a deep network using the activation functions available in the nineteen eighties, these signals shrank as they traveled backward. By the time they reached the early layers, they had essentially vanished. This was the vanishing gradient problem, and it was the reason most researchers gave up on deep networks.

The breakthrough came from an absurdly simple change. In twenty ten, Vinod Nair and Geoffrey Hinton proposed replacing the traditional activation function with something called a rectified linear unit, or ReLU. The old activation function, the sigmoid, squeezed all values into a range between zero and one. ReLU just says: if the input is negative, output zero. If the input is positive, pass it through unchanged. That is the entire function. It is barely even a function. But it solved the vanishing gradient problem because the gradient of ReLU is either zero or one. It does not shrink as it passes through layers.

This tiny design choice, replacing a smooth curve with an angular elbow, is considered one of the biggest breakthroughs in deep learning. It made genuinely deep networks, dozens or hundreds of layers, trainable for the first time.

ReLU has its own problem. Neurons can die. If a ReLU neuron's inputs are always negative, its gradient is always zero, and it never updates. It gets stuck at zero permanently, like a knob that rusted into position. In large networks, entire populations of neurons can die, effectively shrinking the model's capacity. Variants like Leaky ReLU, which allow a tiny positive gradient even for negative inputs, address this. But plain ReLU remains the default for most modern networks.

The Theorem That Promises Everything and Delivers Nothing

In nineteen eighty nine, a mathematician named George Cybenko proved something remarkable. A neural network with a single hidden layer and a suitable activation function can approximate any continuous function to arbitrary precision, given enough neurons.

This is the Universal Approximation Theorem, and it sounds like it should have ended all debate. Neural networks can compute anything. Why did anyone ever doubt them? The answer is in what the theorem does not say, and what it does not say is far more important than what it does.

The theorem says a network with the right weights exists. It does not say how many neurons you need. For some functions, you might need more neurons than atoms in the observable universe. It does not say anything about how to find the right weights. It is a pure existence proof, like knowing that somewhere in a library of every possible book there is one that contains the answer to your question, but having no way to find which shelf it is on.

This is the most dangerous kind of theorem. It gives you confidence that your approach is fundamentally sound while telling you nothing about whether your specific problem is tractable. Neural network advocates waved the Universal Approximation Theorem like a flag. Their critics correctly pointed out that a theorem about existence says nothing about engineering.

LeCun and the Numbers

While the theorists debated, a French researcher named Yann LeCun was solving a practical problem. LeCun had studied in Paris, done a postdoc with Hinton in Toronto, and in nineteen eighty eight joined AT and T Bell Laboratories in New Jersey. His insight was deceptively simple: not every neuron needs to be connected to every input.

In a standard neural network, every neuron in one layer connects to every neuron in the next. For images, this is wasteful and problematic. An image is two dimensional. The relationship between a pixel and its neighbor is fundamentally different from the relationship between that pixel and one on the opposite corner. LeCun proposed using small filters, grids of weights that slide across the image, detecting the same pattern at every position. This was the convolutional neural network, the CNN, and it was the architecture that would dominate image recognition for the next twenty five years.

LeCun's system, LeNet, learned to read handwritten digits. He deployed it for the United States Postal Service, where it read handwritten zip codes on envelopes. The error rate was one percent. By two thousand one, a version of this system was reading twenty million checks per day for banks across the United States, processing roughly ten percent of all checks in the country. It ran on DSP chips inside ATM machines.

This is proof.

That quote actually came years later, in twenty twelve, when LeCun saw what a deeper version of his architecture had done on a much harder problem. But the foundation was laid at Bell Labs in the early nineties: a neural network doing real work, in production, at industrial scale, while most of the field still considered neural networks a dead end.

The Deep Learning Moment

Three things came together in the years between two thousand six and two thousand twelve to turn neural networks from a fringe interest into the dominant paradigm of artificial intelligence.

First, in two thousand six, Geoffrey Hinton published a paper on deep belief networks that showed deep networks could be trained layer by layer, using a greedy algorithm that sidestep the vanishing gradient problem. This paper initiated what researchers now call the third wave of neural networks and made the term deep learning popular.

Second, also in two thousand six, Jensen Huang's Nvidia introduced CUDA, a platform that let general purpose software run on graphics processing units originally designed for video games. Wall Street reacted with dismay. By two thousand eight, Nvidia's stock had dropped seventy percent. But Hinton and his students quickly recognized that backpropagation, which involves massive amounts of parallel matrix multiplication, was exactly the kind of computation GPUs were designed for.

Third, a computer science professor named Fei-Fei Li realized something that seems obvious in hindsight but was radical at the time. The best algorithm in the world would not work well if the data it learned from did not reflect the real world. She set out to build ImageNet, a database of over fourteen million images labeled across twenty two thousand categories. Her colleagues were skeptical. Her mentor told her she had taken the idea way too far. A pivotal moment came when a graduate student introduced her to Amazon Mechanical Turk, an online marketplace for small tasks.

Literally that day I knew the ImageNet project was going to happen.

It took two and a half years to label the images. When Li published the ImageNet paper in two thousand nine with three point two million labeled images, it was met with little fanfare. A top computer vision conference only allowed it as a poster, not an oral presentation.

Then came Alex Krizhevsky. Born in Kyiv, Ukraine, he had emigrated to Canada and was studying under Hinton at the University of Toronto. In twenty twelve, Krizhevsky built a deep convolutional neural network and trained it on ImageNet using two Nvidia GTX five eighty graphics cards. Each card had three gigabytes of memory and cost about five hundred dollars. The network did not fit on a single GPU, so he split it across both. He trained it in his bedroom at his parents' house. Five to six days of computation, ninety passes through the data.

The result was AlexNet. It achieved a top five error rate of fifteen point three percent on the ImageNet challenge, beating the second place entry by more than ten percentage points. The previous years had seen only incremental improvements using traditional computer vision methods. AlexNet did not just win. It obliterated the competition. A forty percent improvement.

To put this in context: in two thousand eight, an AI course at Princeton treated neural networks as a backwater, a field where progress had stalled after initial enthusiasm in the late eighties and early nineties.

Four years later, a network trained on two consumer grade graphics cards in a bedroom changed everything.

After AlexNet, Hinton, Krizhevsky, and Ilya Sutskever founded a tiny company called DNN Research. They set up an auction at a hotel in Lake Tahoe. Four companies bid: Google, Microsoft, Baidu, and DeepMind. As the price soared past twenty million dollars, only Baidu and Google remained. Close to midnight, with the price at forty four million dollars, Hinton suspended the bidding and went to get some sleep. Google won. Hinton joined Google Brain the next year. The three person company he had auctioned off contained two future founders of OpenAI.

The Jargon Jar

This episode's term: parameters.

When someone says a model has seventy billion parameters, they mean seventy billion knobs. Each parameter is a single number, typically a decimal with many digits, that got adjusted during training. It is the weight on one connection between two neurons, or a bias term, or a value in a filter. The number is specific and was found by the training process described in episode three.

The marketing version: more parameters means smarter. A seventy billion parameter model must be better than a seven billion parameter model, which must be better than a seven hundred million parameter model. This framing is convenient for companies selling bigger models.

What it actually means in practice: parameter count tells you the model's capacity, not its capability. A model with more parameters can potentially represent more complex patterns, but it also needs more training data to learn those patterns well, more compute to train, more memory to run, and more ways to go wrong. A well trained seven billion parameter model can outperform a poorly trained seventy billion one. The parameter count is like the number of faders on a mixing board. More faders gives you more control, but only if someone knows how to set them. Otherwise you just have a bigger board producing the same noise.

That was the deep dive for episode two.