The Jury of Machines

The Word Count That Nobody Counted

Three artificial intelligences walk into a codebase. Independently, without seeing each other's work, all three of them point at the same line and say: this is broken.

The line in question was a word counter. A little number in the top corner of a writing app that tells you how many words you have written. It updated when you typed. It did not update when you loaded a saved story, approved a generated draft, or cleared the editor. In other words, it only worked when you were doing the one thing where you least needed a word counter, because you could see the words appearing under your fingers.

Three models found the same bug because the bug was obvious. It sat in the open, visible to anyone who traced the data flow for more than thirty seconds. The real question was not whether the bug existed. It was what happened next, after the obvious finding, when each model kept looking.

This is the story of what happens when you stop asking one AI to review your code and start asking several. Not because any of them is unreliable. But because they are reliable in completely different directions.

The Experiment

The app is called StoryTeller. A desktop writing tool where you pick a local language model, describe your characters and their relationships, and the model writes scenes from their perspective. The kind of project a person builds when they have taste in fiction and no patience for typing it themselves.

The first round happened earlier that day. Four AI models, each given the same six-commit codebase and the same open-ended prompt: review this, improve it. Claude, Codex, Qwen, and Gemini. They worked in isolation, returned their results, and a human distilled the best ideas from each into a cherry-pick list of twenty items. Sixteen of those items got built into the app in three quick batches. Bug fixes, robustness improvements, features. An hour of work that would have taken a human team a full sprint.

Then came the second round. The same codebase, now upgraded, sent back into the ring. This time only three models. Claude doing the deep read. Codex and Qwen dispatched into sandbox copies, running autonomously in the background. Same prompt, same deadline, same scoring rubric. A jury of machines asked to judge work that was itself assembled from the recommendations of machines.

And they all found the word count bug.

What the Surgeon Left Behind

Codex, the model built by OpenAI, has a reputation for being the fastest and the cleanest. In this session, it earned both. Two commits. Zero new bugs. Then it died.

The crash was not its fault. It tried to validate its changes by importing the Flet graphical framework inside a sandboxed environment. The sandbox does not allow GUI operations. An NSException propagated through the Python runtime and the process collapsed in a stack trace sixty two frames deep.

But here is the thing about Codex. Before it crashed, it left behind a structural fix that neither of the other models saw. The settings view, the screen where you toggle which models appear in your writing dropdown, had its own copy of the model discovery code. It imported the Ollama library directly. It had its own try-except blocks. It parsed the Ollama response format in its own way, slightly different from how the brain module parsed it. Two copies of the same logic, drifting apart, neither aware of the other.

Codex extracted a single method called get all models, put it in the brain module where it belonged, and replaced the settings view's twenty lines of duplicated logic with a one-line function call. Clean. Obvious in retrospect. The kind of fix that makes you wonder how nobody saw it before, and the answer is that seeing duplication requires reading two files side by side, which requires the specific suspicion that duplication exists.

Codex had that suspicion. The other models did not.

What the Guard Checked Twice

Qwen, the open-source model from Alibaba, has a different personality. Where Codex is a surgeon, precise incisions and clean sutures, Qwen is the security guard who checks every lock in the building. Four commits. Six files. Seventy two lines added. The widest coverage of any model in the review.

Some of those locks were already locked. One commit added a check to verify that the MLX load function was not None, directly after checking a boolean flag that was set to false in the exact same conditional block that would have set load to None. If the flag is false, load is None. If you have already checked the flag, checking load is redundant. But Qwen checked it anyway.

The instinct is to dismiss this as noise. Defense in depth against a threat that cannot exist. But Qwen also found three things nobody else caught. A race condition where clicking a character card after that character had been deleted would crash the app, because the view assumed the database ID was still valid. Input sanitization on character names, where invisible whitespace could create characters that looked identical but were not. And error handling on the delete operation itself, so a database failure would show the user a message instead of silently swallowing the exception.

Three genuine catches, buried under a layer of paranoid redundancy. The lesson is not that Qwen wastes effort. The lesson is that casting a wide net catches fish you were not looking for.

What the Detective Traced

Claude found the deepest bug in round one. A dictionary called opts that contained a key called system prompt. The dictionary was built in one method, the system prompt was extracted from it in another, and then the full dictionary, system prompt still included, was passed directly to the Ollama chat API as an options parameter. Ollama silently ignores keys it does not recognize. The bug had no symptoms. The code ran fine. But it was wrong, and it was wrong in a way that could break silently if a future version of Ollama decided to stop being so forgiving.

In round two, Claude verified the fix and then spent most of its budget tracing edge cases. What happens if the context window is set to five hundred twelve tokens, making the truncation math produce a negative number? The code handles it. What happens if the PRAGMA query returns an empty table? The migration system handles it. What does row index one actually contain in a SQLite PRAGMA table info response? It contains the column name, as documented, as the code assumes.

Claude did not produce the most fixes. But it produced the highest confidence that the fixes already applied were correct. Verification is invisible work. Nobody celebrates the reviewer who says "this is fine" on twelve different edge cases. But the absence of that review is how bugs survive into production.

The Umlaut Problem

There is a small detail in this story that deserves its own moment. The app matches character names in story text to inject their personality descriptions into the prompt. A character named Al would have their persona included whenever the word Al appeared in the story.

The problem is that Al also appears in the words also, already, altar, and aluminum. Substring matching is not name matching.

The original experiment suggested using regex word boundaries, the backslash-b pattern that matches the edge between a word character and a non-word character. This works perfectly for English names. It fails completely for Swedish ones.

The word boundary in regex is defined by the ASCII word character class. The letter a is a word character. The letter P is a word character. The letter a with a diaeresis, the Swedish letter that turns Par into a different name, is not a word character. A regex word boundary sees Par-umlaut and thinks: that is two things. A word that ends at r, and then some foreign symbol floating alone.

The fix uses Unicode-aware lookarounds instead of word boundaries. Not preceded by a word character, not followed by a word character, where word character means any Unicode letter, digit, or underscore. Seven characters of regex syntax, the difference between a program that works in English and a program that works for the person who built it.

The Consensus Theorem

Mathematicians have a name for what happened in this session. The Condorcet jury theorem says that if each member of a jury is more likely than not to reach the correct verdict, then adding more jurors increases the probability that the majority verdict is correct. As the jury grows, the probability approaches certainty.

The theorem assumes independence. Jurors who talk to each other, who anchor on the first opinion expressed, who defer to authority, do not improve the verdict. They just amplify the first voice. The power of the theorem depends entirely on each juror thinking alone.

Four AI models reviewing the same code, in isolation, from the same starting point, is about as close to the conditions of the theorem as you can get in software engineering. They share no context. They cannot see each other's work. They have different training data, different architectures, different blind spots.

When all four flag the same issue, like bare except clauses, that is the theorem working. The probability that all four are wrong about the same thing is vanishingly small. Fix it without discussion.

When only one model finds something, like Codex spotting the duplicated model listing, the signal is different. It is either noise, a false pattern seen by one particular architecture, or it is the deepest insight in the review. The way to tell the difference is not to count votes. It is to read the code.

The jury does not replace the judge. But the jury narrows the field. After three models have independently verified that the migration system handles empty tables, you do not need to check that yourself. After one model suggests extracting a shared method, you do need to read it yourself before applying it. The consensus handles the volume. The unique findings demand the judgment.

The Story Underneath

StoryTeller is an app that helps a human write fiction with the assistance of AI. It was improved by AI models reviewing it. Those improvements were verified by more AI models reviewing the reviewers. At no point in this process did a human write a line of code.

But a human decided which improvements to apply. A human read three independent reviews and chose the surgeon's structural fix over the guard's redundant checks. A human tested the regex against Swedish names because the models would not have known to. A human invoked the second review round at all, because the first round left a feeling that something might have been missed.

The machines are the jury. The human is the judge. And the interesting thing about this arrangement is not that it works. It is that it works better than either party alone. A human reviewing this codebase would not have found the system prompt leak, because it had no symptoms. A machine reviewing alone would not have tested the regex against Par with an umlaut, because it would not have known who the app was for.

Four models found twenty bugs and improvements. Three more found eight additional issues. The combination produced better code than any individual model, and better code than the human could have produced alone, and the whole thing took about ninety minutes on a Saturday night.

The codebase is one thousand four hundred fifteen lines across ten files. It writes stories. The machines made it better at writing stories. Then they checked each other's work.

Nobody wrote a line of code, and every line got better.