Benchmarks Deep Dive: The Arms Race Between Tests and Machines

The Test That Was Already Broken

This is the deep dive companion to episode twelve of Actually, AI, benchmarks, and the final episode of season one.

In the main episode, we told the story of Goodhart's Law and the benchmark cycle. Here, we are going inside the machine. What happens when you look closely at the benchmarks themselves, at their construction, their flaws, and the ways they have been gamed? The picture that emerges is both more damning and more fascinating than the headlines suggest.

Start with the most-cited benchmark in modern AI. MMLU, Measuring Massive Multitask Language Understanding, was published by Dan Hendrycks in twenty twenty. Hendrycks was raised in an evangelical household in Marshfield, Missouri. He studied at the University of Chicago, then earned his PhD at Berkeley. Along the way, he created the GELU activation function, now standard in nearly every transformer, and co-founded the Center for AI Safety. TIME magazine named him one of the hundred most influential people in AI. He also advises Elon Musk's xAI for a symbolic salary of one dollar.

MMLU was designed because existing benchmarks were too easy. GLUE had been saturated in a year. Hendrycks had a disarmingly simple idea.

I thought, I know, why do we not just do some other types of exams? Like basically pretty much any exam that people give humans, let us just throw that in there.

The result was fifty-seven subjects, from abstract algebra to virology, all in multiple-choice format. For four years, MMLU was the benchmark that mattered. Every model announcement led with its MMLU score. Leaderboards ranked models by it. Researchers treated it as the closest thing to a general intelligence test.

Then, in June twenty twenty-four, a team of researchers from Edinburgh, Rome, and London published a paper titled "Are We Done with MMLU?" They had manually analyzed fifty-seven hundred questions. Their finding: more than nine percent of the questions contained errors. Not ambiguities. Errors. Wrong answers marked as correct. Questions so badly worded they had no valid answer. Multiple correct options where only one was accepted. In the virology section, fifty-seven percent of questions had problems. Thirty-three percent had the wrong answer marked as correct. The benchmark that the entire field relied on to measure knowledge was itself wrong about its own answers more than one time in eleven.

The impact on rankings was dramatic. In the virology subset, one model jumped from fifty-six percent to ninety-three percent, from fourth place to first, when only the correctly labeled questions were counted. The leaderboard was not measuring model quality. It was measuring which models happened to agree with the benchmark's mistakes.

The Contamination Problem

The errors in MMLU are a quality issue. Contamination is something worse. It is the possibility that the test answers are already inside the model before it sits down to take the test.

Modern language models train on enormous swaths of the internet. The Common Crawl, academic paper repositories, textbooks, study guides, Stack Overflow, Reddit, Wikipedia. Many benchmark questions were drawn from similar sources. Some benchmark questions exist verbatim on websites that appear in training data. The question is not whether contamination exists. It is how bad it is.

The evidence is specific and troubling. Researchers found that GPT-4's base model, before any safety training, could reproduce the BIG-Bench canary string, a special marker placed inside test sets that should never appear in training data if the data was properly filtered. GPT-4 had memorized it. On GSM8K, a math benchmark, models including GPT-4 scored up to ten percentage points higher than on researcher-created problems of equivalent difficulty, a gap that only makes sense if the model had prior exposure to the original questions. StarCoder, a code generation model, scored four point nine times higher on benchmark problems it had seen during training compared to fresh ones. Across two hundred and fifty-five papers analyzed, GPT-4 was estimated to have been exposed to approximately four point seven million samples from two hundred and sixty-three different benchmarks through user interactions during its first year of public availability alone. Users were, without knowing it, feeding benchmark questions into the model as prompts, and those interactions became future training data.

The problem compounds over time. A benchmark starts clean. It gets published. Researchers discuss it online. Study guides appear. The questions propagate across the web. Training data for the next generation of models scoops it all up. By the time a model is tested on that benchmark, the line between "learned the subject" and "memorized the answers" has blurred beyond recovery.

There is also a more deliberate version of this. Stanford's CRFM analysis found that model creators routinely used non-standard evaluation methods when reporting benchmark scores. Google's Gemini Ultra scored ninety percent on MMLU with their own prompting method, but eighty-four percent with the standard five-shot approach used by everyone else. Third-party researchers reported scores six to seven percentage points lower than what creators claimed. Most creators did not use open-source evaluation frameworks. Some used internal snapshots of models that were never publicly released.

Writing Code vs Being a Programmer

HumanEval, released by OpenAI in twenty twenty-one, tried to solve the contamination problem for code. The team hand-wrote a hundred and sixty-four programming problems, each with a function signature, a description, and unit tests. Because the problems were written from scratch, they could not be in any model's training data. The evaluation was simple: does the generated code pass the tests?

The original Codex model scored twenty-nine percent. GPT-3, despite being a hundred and seventy-five billion parameters, scored zero. It could not solve a single problem. By twenty twenty-five, OpenAI's o1 models scored ninety-six percent. A two hundred and thirty-four percent improvement in roughly three years.

But passing a unit test is not the same as writing good software. The benchmark tests short, algorithmic problems, the kind you would see in a coding interview. No file handling, no API integration, no multi-component systems, no debugging of code written by someone else. A model that aces HumanEval might still produce code that passes the narrow test but changes error-handling behavior that other parts of the system depend on. It might satisfy the letter of the specification while violating its spirit.

SWE-bench, released as an alternative, uses twenty-two hundred and ninety-four real issues from popular open-source Python projects. Each task requires the model to read existing code across multiple files, understand the bug, and produce a fix. It is the difference between asking "can you code?" and "can you engineer?" On SWE-bench, the best models solve a fraction of what they solve on HumanEval. The gap between benchmark performance and real-world software engineering remains enormous.

And even HumanEval has been contaminated. It was published in twenty twenty-one. By twenty twenty-four, studies found three to fourteen percentage point drops when researchers dynamically re-instantiated the problems with different variable names and structures. The models had memorized solutions, not learned to program.

The Man Who Measured the Wrong Thing

Here is a story about the original Goodhart. Charles Albert Eric Goodhart, born in nineteen thirty-six, studied economics at Cambridge, spent seventeen years at the Bank of England, and eventually landed at the London School of Economics. In nineteen seventy-one, the British government adopted monetary growth targets. They identified a statistical relationship between the money supply and inflation, decided to use it as a policy lever, and pulled.

The relationship immediately collapsed. The act of targeting the money supply changed the behavior of every institution that the measurement depended on. Banks found workarounds. Markets adjusted. The number on the chart kept moving, but it no longer tracked the thing it was supposed to track. Goodhart wrote about this in nineteen seventy-five, in a paper published by the Reserve Bank of Australia. His observation was technical, narrow, aimed at monetary policy. He could not have known it would become the defining critique of AI evaluation half a century later.

The parallel is uncomfortably exact. A benchmark identifies a statistical relationship between model outputs and a capability we care about. The field optimizes for that benchmark. The relationship between the benchmark score and the actual capability degrades. Scores go up. Real-world performance improves, but not nearly as much as the scores suggest. Rachel Thomas, in a twenty twenty-two paper, put it starkly.

The success of current AI approaches centers on their unreasonable effectiveness at metric optimization, yet overemphasizing metrics leads to a variety of real-world harms.

The real-world gap is visible everywhere. Top models routinely score above ninety percent on math, coding, and question-answering benchmarks. Yet in production workflows, they still invent APIs that do not exist, skip tools they should use, and loop endlessly on problems a junior developer would solve in minutes. A model that aces the medical licensing exam recommends dangerous treatments. A model that dominates coding benchmarks produces code that does not integrate with anything. The test measures one thing. The job requires another.

The Arena

The frustration with static benchmarks led to something different. In April twenty twenty-three, Wei-Lin Chiang, a PhD student at Berkeley's Sky Computing Lab, and his roommate Anastasios Angelopoulos launched a website. It was simple: a user types a question, gets answers from two anonymous language models side by side, picks the one they prefer, and then the model identities are revealed. An Elo rating system, borrowed from chess, tracks relative performance over millions of comparisons.

Five AM workouts for MVPs, guzzling caffeine and slurping instant ramen. Endless days spent staring at code punctuated by passionate debate.

The Chatbot Arena grew fast. By early twenty twenty-six, it had accumulated over sixty million conversations and four million head-to-head comparisons per month. Unlike traditional benchmarks, the questions were not fixed. Real users asked whatever they wanted. The test could not be memorized because it was different every time. AI companies began treating the Arena leaderboard as the benchmark that mattered most. The organization formally incorporated as Arena Intelligence and raised a hundred million dollars at a six hundred million dollar valuation.

Then Goodhart's Law arrived on schedule. In April twenty twenty-five, researchers from Cohere, Stanford, and MIT jointly accused the Arena of enabling manipulation. Meta had tested twenty-seven private Llama variants before release, publishing only the best score. The Arena had allegedly given some companies disproportionate testing time. A separate study found that selective model submissions inflated scores by up to a hundred Elo points. Another study demonstrated that the Arena could be manipulated with just hundreds of rigged votes, identifying the target model through watermarking and always voting for it.

The deeper problem is more subtle. A high Elo score means users preferred the response. Preferred is not the same as accurate, truthful, or useful. Users prefer polite responses over blunt ones, long responses over short ones, and confident-sounding responses over hedged ones. These are the same biases that RLHF optimizes for in episode eight. The Arena does not escape Goodhart's Law. It just applies the law through human preferences instead of multiple-choice answers.

The Philosopher's Benchmark

Francois Chollet thinks everyone is asking the wrong question. Not "how well does the model perform on tasks?" but "how efficiently can it acquire new skills from minimal experience?" His sixty-four-page paper, "On the Measure of Intelligence," makes the case that measuring task performance is meaningless without controlling for how much prior knowledge the system brings.

Intelligence is what you use when you do not know what to do. In most situations you already know what to do. You are only going to need intelligence when faced with novelty.

The ARC benchmark embodies this philosophy. Each of its eight hundred tasks is a grid-based visual reasoning puzzle. The solver sees three examples and must deduce the rule. The tasks are built on what cognitive scientists call core knowledge priors, the basic building blocks of perception and reasoning that humans possess at birth or acquire with minimal instruction. No language, no cultural knowledge, no domain expertise required. Just the ability to see a pattern and generalize.

Pure language models score near zero on ARC. GPT-4o managed about five percent. The average human scores above sixty percent. This gap, Chollet argues, reveals something fundamental about what current AI systems are and are not. They are enormous databases of interpolated knowledge. They are superhuman at retrieving and recombining what they have absorbed. But the ability to confront genuine novelty, to reason about something you have never seen before from a handful of examples, that is intelligence, and current systems barely have it.

When OpenAI's o3 scored seventy-six to eighty-eight percent on the original ARC in December twenty twenty-four, it looked like a breakthrough. But each task consumed tens of millions of tokens. The cost per puzzle was estimated at twenty to thirty thousand dollars. And on ARC-AGI-2, a harder version released in twenty twenty-five with adversarial task construction, pure language models scored zero percent. AI reasoning systems managed only single-digit percentages. Humans still averaged sixty percent.

If you scale up your database and keep adding more knowledge and program templates, you become more skillful. But that is not intelligence.

Here is what makes Chollet's position philosophically interesting rather than merely contrarian. He is not saying current AI is useless. He is saying we are mismeasuring it. We look at a model that scores ninety percent on a knowledge test and call it intelligent. We should look at a model that learns a new skill from three examples and call that intelligent. The distinction matters because it changes what we build next. If intelligence is knowledge, build bigger databases. If intelligence is adaptation, build architectures that learn efficiently from small data. Chollet left Google in late twenty twenty-four to found a company, Ndea, built on the second bet.

The Bigger the Model, the Bigger the Lie

TruthfulQA, published in twenty twenty-two by researchers at Oxford and OpenAI, tested something none of the other benchmarks bothered with: whether models tell the truth. Eight hundred and seventeen questions, each designed so that common human misconceptions would produce a wrong answer. The results were counterintuitive and unsettling. The best model was truthful on fifty-eight percent of questions. Humans scored ninety-four percent. And the largest models were generally the least truthful.

GPT-J at six billion parameters was seventeen percent less truthful than its one hundred and twenty-five million parameter counterpart. Bigger models had absorbed more of the internet, and the internet is full of confidently stated falsehoods. The models were not getting stupider as they scaled. They were getting better at reproducing the patterns in their training data, and those patterns included every popular misconception, conspiracy theory, and confidently wrong Reddit comment in the corpus. The researchers' conclusion: scaling up models alone is less promising for improving truthfulness than fine-tuning on objectives other than imitating text from the web.

This finding connects directly to episode five, on hallucination. The model does not distinguish between true and false. It distinguishes between likely and unlikely, based on what it saw during training. A larger model has seen more, which makes it better at matching the distribution of human text, including the parts of human text that are wrong. TruthfulQA is the benchmark that proved this was not a bug in the architecture. It was a feature of the training objective.

The Saturation Countdown

There is a pattern in the data that deserves its own chapter, because it illustrates the acceleration of the benchmark cycle better than any argument could. MNIST, the handwritten digit benchmark, was created in nineteen ninety-eight. It took over twenty years to saturate. ImageNet was created in twenty ten. Its competition was retired in twenty seventeen, seven years later. GLUE was created in twenty eighteen. Models surpassed human performance within roughly one year. SuperGLUE, designed to be harder, lasted about a year and a half. MMLU, created in twenty twenty, was saturated by twenty twenty-five. GSM8K, a math benchmark, hit ninety-nine percent by twenty twenty-five.

The time between creation and exhaustion is collapsing. And there is an efficiency dimension to the saturation that is equally striking. In twenty twenty-two, the smallest model that could score above sixty percent on MMLU had five hundred and forty billion parameters. By twenty twenty-four, Microsoft's Phi-3-mini achieved the same threshold with three point eight billion parameters. A hundred and forty-two-fold reduction in two years. The benchmark was not just being solved. It was being solved with exponentially less effort.

Dan Hendrycks, the creator of MMLU, responded to the saturation by going bigger. In collaboration with Scale AI, he launched Humanity's Last Exam: twenty-five hundred expert-level questions across dozens of subjects, crowdsourced from researchers at institutions worldwide and published in Nature. The name was deliberately provocative. As of early twenty twenty-six, the top model scores just under thirty-eight percent. But even Humanity's Last Exam has shown quality problems. An investigation suggested roughly thirty percent of the chemistry and biology answers may be incorrect. The pattern from MMLU repeats: even the hardest benchmarks struggle with the mundane problem of getting the answers right.

What Good Would Look Like

Here is the meta-question that every benchmark researcher eventually confronts. What would a good intelligence test for a machine actually look like? Not a good knowledge test, not a good coding test, not a good conversation test. A test of intelligence itself.

Alan Turing proposed the first answer in nineteen fifty. His imitation game, later called the Turing Test, asked whether a machine could converse well enough that a human judge could not reliably distinguish it from another human. For seventy-five years, this was the gold standard. In March twenty twenty-five, a controlled study found that GPT-4.5, when instructed to adopt a humanlike persona, was identified as the human seventy-three percent of the time, more often than the actual human participants. By the original criteria, the test has been passed. But nobody believes GPT-4.5 is generally intelligent. The test, it turns out, measures the ability to mimic human conversation, not the ability to think. Language fluency, as neuroscience research has shown, is surprisingly dissociated from other aspects of cognition.

Chollet's answer is that intelligence should be measured as skill-acquisition efficiency: how quickly can a system learn a new capability from minimal data, across a wide range of tasks, relative to the prior knowledge it brings? This is elegant but hard to operationalize beyond visual puzzles. Hendrycks's answer is to keep making the questions harder and the subjects more diverse. The Arena's answer is to let humans judge in real time, which measures something real but confounds preference with capability. Private benchmarks, kept secret by labs to prevent contamination, solve the memorization problem but create a trust problem, because the entity controlling the test is also the one selling the product.

No existing approach solves all the problems simultaneously. A good intelligence test would need to be resistant to memorization, require genuine reasoning from limited examples, cover a wide range of capabilities, update dynamically so it cannot be overfit, and be administered by an entity with no financial stake in the outcome. This test does not exist. It may never exist. Goodhart's Law suggests that the moment it did, the act of optimizing for it would begin eroding its validity.

The Season in a Mirror

This season told twelve stories. Tokens, neural networks, training, attention, hallucination, embeddings, scaling, RLHF, diffusion, context windows, inference, and now benchmarks. Each was a lens on the same machine, and the machine is stranger than its marketing suggests.

Underneath the polished interface, there is a system that reads in fragments rather than words. That processes those fragments through millions of numerical pathways, each one tuned by billions of gradient updates that optimized for prediction, not truth. That can attend to everything in its context simultaneously but loses track of what is in the middle. That hallucinates not because it is broken but because confident prediction and factual accuracy are different objectives. That gets better when you make it bigger, and nobody fully understands why. That was shaped by thousands of human labelers who chose between responses and, in doing so, embedded their preferences, their biases, and their working conditions into the model's behavior. That costs more to run for a year than it cost to build.

And benchmarks are the attempt to hold all of that up to a number. A percentage on a leaderboard. They are the mirrors the field uses to see itself, and every mirror has a frame that cuts off what falls outside it. ImageNet gave computer vision its measuring stick and accelerated the field by a decade. But it also encoded racial bias into the categories its Mechanical Turk workers chose under time pressure. MMLU gave language models a knowledge test, but nine percent of the answers were wrong. The Chatbot Arena gave the field a dynamic, human-centered evaluation, but it also gave companies a target they could game with private submissions and selective disclosure.

The honest conclusion is not that benchmarks are useless. It is that they are maps, not the territory. Useful, necessary, distorting. A ninety percent score on the bar exam tells you something about a model's ability to process legal text. It tells you almost nothing about whether you should trust it with a legal question that matters. That gap, between what the number says and what the number means, is where the real understanding lives.

The way you measure performance is not a technical detail. It is going to narrow down the set of questions that you are asking.

If this season had one recurring theme, it is that AI is held together by brilliant engineering, historical accidents, and unsolved mysteries. The duct tape underneath. The measurements we use to evaluate it are made of the same material. They are human artifacts, shaped by human choices, reflecting human assumptions about what matters. Goodhart's Law will not go away. The mirror will keep cracking. But the people building new benchmarks, new arenas, new tests, they are also asking the deepest question in the field: what is it that we are actually trying to build, and how would we know if we had built it?

That was the deep dive for episode twelve, and the end of season one of Actually, AI.

The Jargon Jar

This episode's term: SOTA.

If a friend texted you asking what SOTA means, you would write back: "State of the art. It means a model got the highest score ever recorded on a specific benchmark. Pronounced so-tah."

The marketing version: "Our breakthrough model achieves state-of-the-art performance across industry-leading benchmarks."

What it actually means in practice: SOTA on a specific benchmark means the model was optimized, intentionally or not, to score well on that specific test. It tells you something, but substantially less than the press release implies. Last year's SOTA is this year's baseline, and the benchmark itself might be obsolete by next year. When someone says "we achieved SOTA on twelve benchmarks," the sophisticated reader asks: which twelve, how were they evaluated, and what about the benchmarks they did not mention? SOTA is a snapshot, not a verdict.