Benchmarks: Measuring the Unmeasurable

The Number That Fooled Everyone

This is episode twelve of Actually, AI, and the season one finale.

GPT-4 scores in the ninetieth percentile on the bar exam. You read that headline, and you think: this machine almost understands law. It can nearly do what a human lawyer does. Ninety percent sounds definitive. Sounds like intelligence.

Except it was not ninetieth percentile. An MIT study in twenty twenty-four took the same data and recalculated. Against first-time test-takers, the actual performance was roughly the sixty-second percentile. Against attorneys who passed, forty-eighth percentile. On the essay section, the part closest to what a real lawyer actually does, GPT-4 scored around the forty-second percentile overall and the fifteenth percentile against passing attorneys. The original ninetieth percentile figure was inflated by comparing against repeat test-takers who had already failed the exam at least once. And the model was trained on text that almost certainly included bar exam preparation materials, study guides, practice questions, and explanations of the answers. It was not demonstrating legal reasoning. It was pattern-matching against material it had already absorbed during training.

This is the story of how we measure intelligence in machines. It is a story about tests, about the people who build them, and about a law from economics that keeps proving itself true in the most inconvenient ways. A benchmark is a mirror held up to the field. And right now, the mirror is cracking.

The Woman Who Gave Machines Eyes

In nineteen ninety-two, a sixteen-year-old girl arrived in Parsippany, New Jersey, from Chengdu, China. Her parents were educated, engineers and scientists back home, but in America they did not speak the language. Her father found work repairing cameras. Her mother became a cashier. When Fei-Fei Li enrolled at Princeton to study physics, her family opened a dry-cleaning shop. Every weekday she was a student. Every weekend she took the bus back to Parsippany to answer phones, manage inspections, and handle billing. She was the only person in the family who spoke English.

By two thousand six, Li was an AI researcher, and she noticed something the rest of the field had not. Everyone was trying to build better algorithms. She realized the algorithms were not the problem. The data was.

Pre-ImageNet, people did not believe in data. Everyone was working on completely different paradigms in AI with a tiny bit of data.

She decided to build a dataset of every object in the visual world. Her colleagues thought she had lost her mind. A mentor told her she had taken the idea way too far. Nobody would fund it. But Li had heard about Amazon Mechanical Turk, and the moment she discovered it, she knew the project was possible. Over two years, forty-nine thousand workers from a hundred and sixty-seven countries filtered and labeled over a hundred and sixty million candidate images. The result was ImageNet: more than fourteen million annotated images, organized into over twenty-two thousand categories, available free to any researcher.

ImageNet became the first great AI benchmark. Not because the dataset was perfect, but because it gave the field a shared measuring stick. In twenty ten, Li launched a competition: take our images, build the best classifier. The first winners used traditional methods and got seventy-two percent accuracy. In twenty twelve, a team from Toronto, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, entered a deep neural network and scored eighty-five percent. Nearly ten percentage points ahead of everyone else. Deep learning went from a fringe idea to the dominant paradigm overnight. By twenty seventeen, competitors were achieving ninety-eight percent accuracy, and the organizers retired the competition. The benchmark was solved. Or rather, the benchmark was exhausted. Computer vision still had enormous unsolved problems. But on this specific test, machines had surpassed humans, and the test had nothing left to teach.

The Mirror With a Crack In It

A benchmark is a test with known correct answers. That sounds simple. Build a set of questions, give them to a model, score the results. But there is a problem baked into the design, a problem that a British economist named Charles Goodhart identified in nineteen seventy-five, decades before anyone imagined testing a language model.

Goodhart was writing about monetary policy at the Bank of England. He observed that when the government started using a statistical measure as a target for policy decisions, the measure stopped working. The act of optimizing for the number changed the thing the number was supposed to represent. His original phrasing was dry, as you would expect from an economist: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." Two decades later, an anthropologist named Marilyn Strathern sharpened it into the version that stuck: when a measure becomes a target, it ceases to be a good measure.

This is the quiet engine underneath every AI benchmark story. Build a test. Publish it. Watch the field optimize for it. Watch the scores climb. Watch the test stop measuring what it was designed to measure. Build a new test. Repeat. The GLUE benchmark for natural language understanding launched in twenty eighteen. Within roughly one year, models surpassed human performance. The team built SuperGLUE, deliberately harder, in twenty nineteen. By December twenty twenty, Microsoft's DeBERTa had surpassed humans on that too. The old benchmarks MNIST and Switchboard took twenty-plus years to saturate. GLUE took one. The cycle is accelerating, and it is accelerating because the models are getting better at exactly the thing Goodhart warned about: optimizing for the measure rather than the thing being measured.

What Benchmarks Cannot See

There is a man in France who thinks almost everyone is confused about this. Francois Chollet, the creator of the deep learning library Keras, published a sixty-four-page paper in twenty nineteen called "On the Measure of Intelligence." His argument is precise and uncomfortable: benchmarks measure skill, not intelligence. A model that scores ninety percent on a knowledge test is demonstrating that it has memorized an enormous amount of text. It is not demonstrating that it can reason, adapt, or learn new things from a handful of examples.

They are confusing skill and intelligence. You are increasing the skill of the system, but not its intelligence. Skill is not intelligence. Generality is not specificity scaled up.

Chollet built his own benchmark to prove the point. The Abstraction and Reasoning Corpus, ARC, is a set of eight hundred puzzle-like tasks. Each gives the solver only three example input-output pairs and asks it to deduce the underlying rule. The puzzles require the kind of on-the-fly reasoning that humans find easy and language models find nearly impossible. In twenty twenty-four, GPT-4o scored about five percent on ARC. The average human scores well above sixty percent. When OpenAI released o3 in December twenty twenty-four, it hit seventy-six percent on a low-compute run and eighty-eight percent on a high-compute run that cost an estimated twenty to thirty thousand dollars per task. But even o3 still fails on some tasks that any child could solve. Chollet was unmoved.

Intelligence is what you use when you do not know what to do. You can never pre-train on everything you might see at test time because the world changes all the time. This is why we developed intelligence in the first place.

The Map of What We Built

Here is what twelve episodes have shown you. AI reads in fragments, not words. It processes those fragments through millions of numerical pathways shaped by training, which is itself just a loop of being wrong and adjusting. The attention mechanism lets it weigh everything against everything else, all at once. When the patterns do not match reality, it hallucinates with complete confidence, because it has no mechanism for truth, only for probability. Embeddings turn meaning into geometry. Scaling makes all of this bigger, and bigger works, and nobody fully knows why. Humans step in with reinforcement learning to sand down the rough edges. Diffusion models learn to reverse noise. Context windows define the boundaries of what the machine can hold in mind. And inference, the act of actually running the model, costs more than building it.

Benchmarks are the attempt to put a number on all of this. A percentage. A score. A leaderboard position. And Goodhart's Law says the number will always, eventually, stop meaning what we want it to mean. Not because the people building benchmarks are careless, but because any fixed test can be overfit. Any known question can be memorized. Any metric can be gamed.

That does not mean benchmarks are useless. ImageNet catalyzed the deep learning revolution. HumanEval showed us that models could write code. ARC keeps reminding us what models still cannot do. The measurements matter. But they are proxies, all of them, for something we do not yet have a way to test directly. We can measure what a model knows. We can measure what it produces. We can even measure what it gets wrong. What we cannot measure, not yet, is whether it understands.

That was episode twelve, and the end of season one of Actually, AI. If you want the full story of benchmark contamination, Goodhart's original paper, Chollet's philosophy of intelligence, and the arena where models fight for Elo ratings, the deep dive is right after this in your feed. It is the last episode of the season, and it goes deep.