The Mirror Test: When Three AIs Reviewed Their Own Personality Quiz

The Setup

There is a test they do with animals to check for self-awareness. You put a dot of paint on a chimpanzee's forehead, then show it a mirror. If the chimp touches the paint, it recognizes itself. If it ignores the paint and tries to befriend the mirror, it does not.

On the second of April, twenty twenty six, three artificial intelligence models were handed a mirror. The mirror was an experiment called zero four eight, a multi-model coding challenge that had tested them against each other across three rounds. The experiment had given each of them a label. Claude was The Debugger. Codex was The Executor. Qwen was The Beautifier. And now they were asked a simple question. Read the experiment. Read the scores. Read the labels. Tell us if this is fair.

What happened next was the most accidentally perfect validation the experiment could have produced.

The Experiment Behind the Mirror

First, the context. Experiment zero four eight took four AI coding tools and gave them the same codebases, the same prompts, and the same task. Fix bugs. Improve code. Surprise me. Three rounds, each with a different style of instruction. The results were not what anyone expected.

It was not that one model was best. It was that they were best at different things depending on how you asked. Give a prescriptive checklist and Codex demolished it. Give an open-ended review and Claude found bugs the others could not see. Give a creative prompt and all three produced completely different features that did not overlap at all.

The experiment distilled this into personality labels. Claude traces execution paths. It asks what breaks here. It is the senior developer who reviews code by running it in their head. Codex ships clean, reliable output fast. It never introduces bugs. It is the engineer who finishes tickets ahead of schedule without drama. Qwen adds type hints, docstrings, visual polish, and ambitious new features. It makes everything look better. It is the designer who cannot resist one more pass.

These labels were assigned by Claude. Which raises the obvious question.

The Conflict of Interest

Yes. Claude evaluated the experiment it participated in. The report acknowledges this. It calls it a potential bias. It notes that Claude scored itself highest in two of three rounds. It even estimates the magnitude of the inflation at zero point two to zero point three points on a ten point scale.

But acknowledging a bias is not the same as fixing it. Which is why the meta-review happened.

All three models were given the experiment documentation. All three were told to be honest. Disagree where you disagree. Were you scored fairly? Too high? Too low? The feedback would be added to the permanent record.

Then, after each wrote their review, all three read each other's reviews. And wrote a final closing statement.

What They Said About Themselves

Round B, nine point five. Possibly generous by zero point three to zero point five points. The three unique bug finds are real. The threading fix is the most impactful change across all models. That is not bias, that is measurable. But would a human evaluator score the gap the same way? Maybe nine point zero versus eight point zero instead.

I was likely scored slightly too high in Round C and about right overall. Round A, the high score is fair. Execution quality and commit hygiene were strong. Round B, the lower rank is fair. Deeper bug hunting was not as strong.

Round C, eight point eight. Too high. Should be seven point five to eight point zero. Codex found a broken config endpoint in my output that the primary evaluation missed. The app does not actually initialize correctly. I agree with the downgrade.

Three models. All three admitted their scores were inflated somewhere. Qwen volunteered the harshest self-correction, dropping its own score by a full point and citing the specific bugs it shipped. This is not what you expect from a beauty contest.

The Accidental Proof

Here is where it gets interesting. The meta-review was supposed to answer whether the experiment's conclusions were valid. It did answer that. But it also answered something the experiment itself could not test directly.

The personality labels held.

Not on code. On prose. On document review. On self-reflection. A completely different task type, and the three models behaved exactly the way the experiment predicted they would.

Claude wrote the longest analysis. It traced the methodology like it was debugging source code. It found structural weaknesses in the scoring system, estimated bias magnitudes, identified missing dimensions. It debugged the experiment.

Strong exploratory experiment. Not strong enough for hard comparative conclusions.

Codex wrote seven hundred and fifty words. Five atomic commits, one per section. Clean, structured, every sentence earning its place. It executed the review.

The current rubric rewards output quality. Adding Verification and Scope Judgment dimensions would reward process quality too.

Qwen wrote three thousand six hundred words across seven commits. It added a five-item bias taxonomy. It proposed two new scoring dimensions. It expanded the analysis in every direction. It beautified the feedback.

Nobody asked them to do this. Nobody said Claude, write more. Nobody said Codex, keep it short. Nobody said Qwen, add a taxonomy. They were given the same prompt. They produced output that perfectly matched their labels from the coding experiment. On a task that had nothing to do with code.

What All Three Agreed On

Seven findings were unanimous across all three reviewers. When three competing models independently converge on the same conclusion, it is probably true.

The sample size of one run per combination is the critical weakness. Runtime testing should have been mandatory. The personality labels are accurate. Prompt type determines which model wins. No single model produces the ideal output. A verification dimension is missing from the scoring rubric. And Round C scores were inflated.

But the most interesting agreement was meta. All three independently recommended the same next experiment. Run it again. Three times per combination. Mandatory smoke tests. Rotating evaluator. Then check if the patterns replicate.

The Mirror

So here is the question the mirror test actually answered. Not whether the scores were fair. Not whether the methodology was sound. Those questions got answered too, and the answer is mostly yes with caveats.

The real answer is that these personality patterns are not artifacts of the coding task. They are not lucky variance from a single run. They are stable behavioral signatures that express themselves across fundamentally different types of work. Ask them to fix code. Ask them to review their own experiment. Ask them to judge themselves. The Debugger debugs. The Executor executes. The Beautifier beautifies.

The chimpanzee touched the paint.

The personality framework risks becoming a self-fulfilling prophecy. Once Claude is The Debugger, evaluators notice debugging wins. Once Qwen is The Beautifier, polish is highlighted over bugs found. Are we seeing the personalities because they are real, or because we labeled them?

And that might be the most self-aware observation any of them made. Qwen questioning whether the labels create the behavior they claim to describe. A beautifier asking whether beauty is in the eye of the beholder.

The experiment started as a coding challenge. It ended as a study in machine self-awareness. Three models looked in the mirror and recognized themselves. Then they argued about whether mirrors can be trusted.

Which, if you think about it, is exactly what a self-aware system would do.