Here is something nobody asked but probably should have. If you take three different AI coding tools, give them the exact same codebase and the exact same instructions, do they produce the same result?
The answer is no. Not even close. And the way they differ tells you more about artificial intelligence than a thousand benchmark papers.
The experiment worked like this. Three codebases, each copied four times. One copy for each AI tool. Same prompt, same starting code, same git history. The only variable was the brain doing the work. Claude, Codex, and Qwen walked into the same room and saw completely different problems.
The first codebase was a European Union transparency tool. A real application with a real client and one actual political ad in its database. The kind of project that needs maintenance the way a house needs plumbing. Not glamorous. Not optional.
The prompt was prescriptive. Fix the issues. Work from most critical down. Commit after each step.
And here is where it gets interesting. All three tools identified the same four problems. Broken authentication storage, missing email configuration, gaps in form validation, and rough error handling on PDF generation. They agreed on the diagnosis. They did not agree on the treatment.
Codex went to work like a contractor with a clipboard. Four commits. Each one atomic. Each one mergeable on its own. The email configuration got SSL and TLS support with a validation property that catches misconfiguration before you ever try to send. The PDF error handling got a custom exception class that separates user errors from server errors. Clean. Complete. Ready to ship.
Claude took a different approach. It created a reusable validation module, a separate file with seven validators that any endpoint could import. It added a dot env example documenting every configuration variable. It fixed the authentication storage correctly. But it skipped the email system entirely. Not because it could not do it, but because it could not verify the fix without actual email credentials.
Qwen did everything in one commit. Four hundred and sixty four lines inserted, a hundred and four removed, touching twelve files. Good code. Solid implementations. But one single commit means you cannot cherry pick any individual fix. It also committed its own IDE configuration files into the repository. Like a house painter who does excellent work but leaves their lunch on your counter.
The scores told a clear story. Codex won the prescriptive round. When the task is a checklist, Codex executes checklists better than anyone.
The second codebase was a desktop writing app built with Flet. Flet is a Python wrapper around Flutter. It is not obscure, exactly, but it is unusual enough that none of these models have deep training data on it. The prompt was looser. Review the code, fix what needs fixing, improve where you see opportunity.
This is where Claude pulled ahead. Not by a little. By a lot.
Claude found three bugs that neither Codex nor Qwen detected. The first was critical. The story generation function was running on the main UI thread. Every time a user asked the AI to write, the entire application froze. Buttons stopped responding. The stop button did not work. The fix was a threading wrapper with carefully captured values, the kind of change that looks simple but requires understanding how the framework handles concurrent state.
The second bug was subtle. A system prompt variable was being passed inside an options dictionary to the language model API, but that API does not accept system prompts in the options field. It accepts them somewhere else entirely. The code ran without errors. It just silently ignored your system prompt. No crash, no warning, just worse output with no obvious cause.
The third bug was poetic. This was a creative writing application. A tool for writing prose. And the draft viewer was rendering story text as markdown. Which meant that every underscore in your fiction, every character name with an underscore, every italicized moment of internal thought, was being visually corrupted by the rendering engine. A writing app that mangles your writing.
Codex and Qwen both improved the codebase. Codex added logging and a better save workflow. Qwen added type hints, docstrings, tooltips, and an auto save feature that was genuinely clever. But neither of them found the threading bug, the options bug, or the markdown bug. They improved what they could see. Claude debugged what was actually broken.
The third codebase was a voice interface for talking to AI while driving. A progressive web app. Record your voice, transcribe it, send it to Claude, text to speech the response, play it back. Simple pipeline, real deployment, running on a server.
The prompt was deliberately vague. Add at least two new AI providers. Improve the experience. Surprise me with something I would not have thought of.
And this is where the personalities crystallized.
Claude looked at the core interaction. You tap to record. You tap to stop. You wait. You listen. And it asked the question an engineer asks. Why does the driver need to tap twice? The answer: they do not, if the app can hear when they stop talking. Claude built voice activity detection. A real time audio analyzer using the Web Audio API, monitoring volume levels, auto stopping after one point eight seconds of silence. It added haptic feedback so the driver feels the phone vibrate differently when recording starts, stops, and when a response is ready. No looking at the screen required.
Codex looked at the same app and asked a different question. What happens after the AI responds? The driver heard something interesting. They want to follow up. But they are driving. They cannot compose a nuanced prompt with their hands on the wheel. So Codex built six one tap shortcut buttons. Twenty second recap. Keep going. Make it practical. Challenge it. Quiz me. Plain English. Each one a carefully crafted prompt template that turns one tap into a thoughtful follow up.
Qwen looked at the same app and asked something else entirely. What if the car ride itself is the content? It built a driving stories mode. Eight narrative themes. Science fiction, fantasy, mystery, horror. Each story segment designed for two to four minutes of listening, with hooks at the end to keep you engaged. It added six different text to speech voices. Conversation bookmarks. Ambient sound selection. It turned a voice assistant into an entertainment companion.
Three models. Same codebase. Same prompt. Claude removed friction from the core interaction. Codex designed features for the use case. Qwen expanded what the product could be. An engineer, a product manager, and a designer walked into the same room and saw three different apps.
Here is what matters. These are not random variations. They are stable traits. Across three rounds, three codebases, and three completely different prompt styles, the personalities held.
Claude always goes deep. It traces execution paths. It finds the bug hiding underneath the code that looks correct. It is the colleague who reads your pull request and says actually, this will crash if the server restarts during an authentication ceremony.
Codex always ships clean. Zero bugs introduced across three rounds. Perfect commit messages. Fastest by a wide margin. It is the colleague who takes a well written ticket and delivers exactly what was asked, on time, tested, ready to merge.
Qwen always goes wide. Type hints everywhere. Docstrings on every class. New features you did not ask for but actually want. It is the colleague who comes back from a code review with thirty suggestions, a new feature prototype, and their IDE configuration accidentally checked into your repository.
No single model produced the best possible version of any codebase. When we built cherry pick lists of the best individual changes, every list drew from all three tools. Claude's threading fix plus Codex's email configuration plus Qwen's auto save. Claude's voice detection plus Codex's shortcut buttons plus Qwen's storytelling mode.
The optimal strategy is not picking a winner. It is running multiple models and combining their strengths. The portfolio beats any individual. And if that sounds like a lesson about AI tools, it is also a lesson about teams. The debugger, the executor, and the dreamer are not competing. They are complementary. The mistake is thinking you need to choose.
The experiment cost zero dollars. Free tiers, promotional access, and a subscription that was already being paid. Twelve private GitHub repositories. Nine completed sessions. One evening of evaluation. And a finding that will change how every future project in this shop gets built.
Not which model is best. Which combination of models is best. The answer, it turns out, is all of them.