Benchmarks: What This Means for You

The Leaderboard Will Not Help You

This is the practical companion to episode twelve of Actually, AI, benchmarks, and the final episode of season one.

You heard the full story. Goodhart's Law eating every test the field builds. MMLU with nine percent wrong answers. The Chatbot Arena accumulating millions of votes and then getting gamed. Chollet arguing that the entire field is measuring the wrong thing. It is a fascinating intellectual saga. But you have a spreadsheet to clean up by Thursday and three AI tools open in different browser tabs, and you need to know which one to use.

Here is the uncomfortable truth. The benchmarks, the leaderboards, the percentage scores that companies trumpet in their press releases, they will tell you almost nothing about whether a model is good at the thing you need it to do. MMLU measures multiple-choice exam performance across fifty-seven academic subjects. Unless your job is to take multiple-choice exams across fifty-seven academic subjects, that number is decoration. A model that scores ninety percent on MMLU and another that scores eighty-seven percent might perform identically on your actual work, or the eighty-seven percent model might be better, because your work involves nuance and context that a multiple-choice test cannot capture.

This is not a flaw in any particular benchmark. It is the nature of benchmarks. They measure performance on the benchmark. Your work is not the benchmark.

What Actually Predicts Whether a Model Works for You

There is exactly one reliable way to know if a model is good at your task. Try it on your task. That sounds obvious. It is also the thing almost nobody does systematically.

Most people pick a model one of three ways. They use whatever their company pays for. They use whatever got the best headline last week. Or they use whatever a friend recommended. None of these are terrible strategies. All of them leave money and quality on the table. The friend's recommendation is the best of the three, because at least a human who used the model is reporting on their experience. But their tasks are not your tasks, and model performance varies wildly across different kinds of work.

Here is what does predict real-world performance, roughly in order of reliability.

First, your own experience using it. Your gut feeling after spending a day with a model is real data. Not anecdotal noise, not confirmation bias, real data. You are running hundreds of implicit evaluations every time you read an output and think "that is good" or "that missed the point." Your brain is a benchmark, and unlike MMLU, it is calibrated to exactly the tasks you care about. The AI research community has a word for this: vibes-based evaluation. They tend to say it dismissively. They are wrong to dismiss it. For individual users picking a tool for their own work, vibes are the highest-signal evaluation available.

Second, the Chatbot Arena. Of all the public benchmarks, the Arena is the least bad for practical purposes, because it measures preference rather than test scores. Real people asked real questions and picked the answer they liked better. It has problems, the deep dive covered the gaming, the selection bias, the preference for verbose and confident responses. But a model that consistently wins head-to-head comparisons against another model, across millions of diverse questions from real users, is probably better at the kind of tasks real users bring to chatbots. That is a more useful signal than "scored three percent higher on abstract algebra."

Third, domain-specific benchmarks if they exist for your field. SWE-bench for software engineering. Legal benchmarks for legal work. Medical benchmarks for medical applications. These are closer to the actual work than general benchmarks, though they still suffer from all the problems the main episode described. A model that scores well on SWE-bench is more likely to write good code than a model that scores well on MMLU. That is not a guarantee. It is a better prior.

Fourth, and dead last, the headline number. MMLU, HumanEval, whatever score the press release leads with. Not useless. But worth about as much as knowing a job candidate's university GPA. It tells you something about general capability. It tells you almost nothing about whether they will be good at this specific job.

Build Your Own Benchmark

Here is the most useful thing you can do with everything you have learned this season. Build a personal benchmark. Not a formal one. Not a published one. Something that takes thirty minutes and saves you months of using the wrong tool.

Pick five tasks from your actual work. Not hypothetical tasks. Real ones you did last week or will do next week. A customer email you needed to draft. A document you needed to summarize. A piece of code you needed to write or debug. An analysis you needed to think through. A creative brief you needed to produce. Whatever your work actually involves. Five is enough. More is better, but five gets you eighty percent of the insight.

Now run each task on three different models. If you have access to multiple tools, use them. If you only have one tool but it offers different model tiers, use those. Give each model the same prompt for each task. Do not optimize the prompt for any particular model. Just give it the task the way you would naturally describe it.

Then compare the outputs. Not with a rubric. Not with a scoring system. Just read them side by side and ask yourself three questions for each task. Which output would I use as-is, without editing? Which output would I need to fix, and how much fixing? Which output missed the point entirely?

That is your benchmark. It is more predictive of your actual experience than any leaderboard on the internet. It is resistant to contamination because nobody trained on your specific tasks. It is resistant to Goodhart's Law because you are not optimizing for a number, you are checking whether the output is useful. And it takes half an hour.

Write down what you found. Not because you need formal documentation, but because models change. Providers update them, sometimes without announcement. The model that won your benchmark in March might lose it in June. Having a record of what worked lets you rerun the comparison when something feels off.

Try This Right Now

Here is the five-minute version.

Think of the last thing you used AI for where the result was not quite right. The email that needed heavy editing. The code that almost worked. The summary that missed the key point. Open a different model, one you do not normally use. Give it the same task with the same prompt.

Compare the two outputs. Was the second one better? Worse? Different in an interesting way? You just ran a one-task, two-model benchmark. You now know something about both models that no leaderboard could tell you. Scale that up to five tasks and three models, and you have a genuine evaluation framework.

The Model Selection Cheat Sheet

After twelve episodes of understanding how these systems work, here is a practical framework for picking the right model for any given task.

For quick, structured tasks like classification, extraction, formatting, and short-form generation, use the smallest model available. Speed matters more than depth. Episode seven covered this: the scaling curve has diminishing returns, and for simple tasks, you are paying for capability you do not need.

For professional work like drafting, analysis, research assistance, and code generation, use a mid-tier or frontier model. The quality difference on tasks requiring judgment, the kind where the right answer depends on things you did not explicitly spell out, is real and worth paying for.

For hard reasoning tasks, like debugging multi-system code, synthesizing conflicting sources, or writing something that needs to maintain coherence across thousands of words, use the best model you have access to, including extended thinking if it is available. These are the tasks where the frontier model earns its price.

For tasks where accuracy is critical, like legal research, medical information, or anything with real consequences, do not trust any model alone regardless of its benchmark scores. Use the model to draft. Verify against primary sources. Cross-check with a second model if the stakes justify it. Episode five covered why: the mechanism that generates fluent text and the mechanism that generates accurate text are not the same mechanism.

And for anything new, a task you have not tried with AI before, run it on two or three models before committing to one. Five minutes of comparison saves weeks of mediocre output.

What You Know Now

Twelve episodes. Twelve mechanisms. Here is what you actually walked away with.

You know that AI reads in fragments, not words, and that the fragmentation itself shapes everything downstream. You know that a neural network is not a brain but a mountain of numbers, each one adjusted by the slow pressure of being wrong. You know that training is not teaching but pattern extraction at industrial scale. You know that attention, the mechanism that let one paper remake the entire field, is about letting every fragment see every other fragment simultaneously. You know that hallucination is not a bug but an inevitable consequence of a system optimized for plausible next tokens rather than truth. You know that meaning is geometry, that embeddings place concepts in a space where distance encodes relationship. You know that bigger models work better and nobody fully understands why, that the scaling curve bends but does not break. You know that human preferences were baked into these systems by thousands of labelers whose choices became the model's personality. You know that image generation is noise reversal, that context windows define the boundary of what the machine can hold in mind, that inference is the expensive part, and that benchmarks are maps that keep being mistaken for the territory.

That is a real understanding. Not complete. Not expert-level. But genuine. You can read a press release about a new model and know which claims are meaningful and which are marketing. You can hear someone say "ninety percent on the bar exam" and know why that number is less impressive than it sounds. You can watch a demo of a new AI product and have a mental model of what is happening underneath the interface. You are not at the mercy of the hype cycle anymore.

That is what this season was for. Not to make you an AI engineer. To make you an informed user in a world that profits from your confusion. The companies selling these tools have every incentive to make them seem like magic. The critics have every incentive to make them seem like fraud. The truth is more interesting than either version. These are extraordinary engineering achievements held together by duct tape, lucky accidents, and unsolved mysteries. They are useful in ways that were unimaginable five years ago and limited in ways that the marketing will never mention. Both things are true simultaneously, and now you can hold both.

The benchmarks will keep coming. New leaderboards, new scores, new claims of state-of-the-art performance. Goodhart's Law will keep eroding them. The models will keep getting better, and the measurements will keep struggling to capture what "better" means. But you do not need the leaderboard to tell you what works. You have five tasks from your own work, three models to compare, and the mechanical understanding to interpret what you see. That is worth more than any score on any test.

That was the practical companion to episode twelve, and the end of season one of Actually, AI. Thank you for listening.