Four point eight zero versus two point nine zero. Out of five.
That is the code quality score for Mistral's cheapest generalist model versus their flagship. The small one scored four point eight zero. The large one scored two point nine zero. Dead last among six contestants, beaten by every model in the lineup including an eight billion parameter model running on what amounts to a pocket calculator.
This was not supposed to happen. Model families have hierarchies. The small is the budget option. The large is the prestige tier. The medium sits in the middle and does medium things. The specialist handles its domain. The reasoning model thinks deep. Everyone knows their role. Everyone stays in their lane.
Except nobody told the Mistral family.
The setup was simple. Take eight Mistral models. Give them four tasks. Same prompts, same conditions, blind judging by five independent AI models who do not know which output came from which Mistral. Editorial review of a podcast script. Write a new podcast opening from scratch. Implement a Python function. Translate a paragraph into Swedish. Score everything on a one to five scale with weighted rubrics.
Twenty nine API calls go out. Twenty one come back successfully. The eight failures are all from the two Magistral reasoning models, which cannot return a coherent response because of a parsing bug that has been active for months. The API sends back a list where it should send a string. Every single call. The models that are supposed to think deeper than the others cannot clear the bar of forming a sentence.
That leaves six models in the arena. And the results invert nearly every assumption about how a model lineup should work.
Let me walk you through this because the pattern is genuinely unusual. Mistral Small Latest, the cheapest generalist in the lineup, finishes first overall with a four point three six out of five. It wins the code task by a landslide. It places second in Swedish. It is competitive in everything. Mistral Large Latest finishes second overall at four point two two. But its profile is bizarre. It gets a unanimous perfect score in creative writing. Every single judge ranked it first. Four point nine four out of five. Then you look at its code score and it is sitting at two point nine zero. Last place. The model that writes the most beautiful prose in the family cannot implement a function to parse time strings.
This is not a small gap. The difference between four point eight zero and two point nine zero is the difference between a model that nails the task on every dimension and a model that fundamentally struggles with it.
And it gets stranger. Codestral, the model that Mistral specifically built for code, the one with "code" in its name, scores three point three five on the code task. That is mid-pack. The generalist that costs less beat the specialist at the specialist's own job.
This should not surprise anyone who has been paying attention. Specialization in language models is a marketing label, not an architectural guarantee. You take a base model, train it on more code, give it a code name, and hope that the extra training data translates to measurably better output. Sometimes it does. Sometimes the generalist has already absorbed enough code during pretraining that the specialist's edge evaporates. What we are seeing with Codestral is the second case.
Every family has a middle child and Mistral Medium is the one nobody needs.
It finishes fifth out of six in the overall ranking. It is slower than Large but worse at everything Large is good at. It is more expensive than Small but worse at everything Small is good at. It does not win a single task. It does not come close to winning a single task. Its best score is three point six one in editorial review, which is still below the average. It is, by every measure, the model you should never choose.
In model economics there is a concept called the dominated option. An option is dominated when another option beats it on every dimension. Mistral Small Latest dominates Medium on quality, speed, and price. Medium is not even a reasonable fallback. It is a menu item that exists because someone at Mistral decided the lineup needed three tiers and nobody checked whether the middle one earned its spot.
This matters beyond Mistral. Every AI provider ships a tiered lineup. Small, medium, large. Basic, standard, premium. And the assumption is always that the tiers form a smooth gradient. Pay more, get more. But the Mistral data says the gradient is not smooth. It is a landscape with peaks and valleys, and the medium tier is often a valley that charges a peak price.
Ministral Eight B, the tiny model, finishes last. This is expected. What is not expected is where it fails.
Its code score is three point zero four. Mediocre but not terrible. Its editorial score is three point seven eight, nearly matching the models three and four times its size. Where it collapses is generation. Asked to write three hundred to four hundred words of podcast script, it wrote one hundred and fifty. It did not run out of tokens. It did not hit a limit. It simply stopped. Declared the task complete at less than half the requested length and moved on.
This is the compliance problem. Small models understand the task but lack the executive control to follow constraints. They can write. They can think. But when the prompt says three hundred to four hundred words and their internal momentum says this feels done after one hundred and fifty, momentum wins. It is the model equivalent of a junior developer who writes correct code that ignores half the requirements.
If Mistral printed an honest brochure, it would read something like this.
Small Latest. Our best model. Good at everything. Use this one. Seriously.
Large Latest. For creative writing only. Will produce prose that makes you forget it was written by a neural network. Will also produce code that makes you remember. Do not let it near a function definition.
Small Twenty Five Oh One. The previous Small. Fine if you are already using it but there is no reason to choose it over Latest. The update was real.
Codestral. We named it after code but it does not beat Small at code. We are as surprised as you are.
Medium. Please do not use this. We are leaving it on the menu because removing a tier is an admission of failure but we would rather you did not look at it.
Eight B. Fast. Cheap. Will not finish the job. Good for things where you need a quick answer and do not care if the answer is complete.
Magistral Medium and Small. Currently unable to speak. We will get back to you.
The assumption that model families form coherent hierarchies is comforting and wrong. Within the same family, trained by the same company, on presumably overlapping data, using the same API, you get a model that scores four point nine four on creative writing and two point nine zero on code. You get a specialist beaten by a generalist. You get a middle tier that nobody should ever choose. The only way to know what a model actually does well is to test it. Not on benchmarks. On your tasks. With your prompts. Judged by criteria you actually care about.
This is twenty nine API calls and maybe thirty cents. That is the cost of knowing which model to use instead of guessing based on the name.
The Mistral family walked into a talent show. The small one won. The big one sang beautifully and then tripped on the code challenge. The specialist forgot what it specialized in. The medium one stood in the back and hoped nobody would notice. And the two thinkers at the end of the lineup never made it past the door because they could not figure out how to say their names.
Thirty cents. Now you know.