What happens when a server receives a request asking for one row, with one of the parameters set to one hundred million?
If the code is careful, nothing happens. The server returns one row. The database does not even look at the hundred million.
If the code is not quite careful, the database does something else. It computes one hundred million rows of results. It sorts them. It builds whatever indexes it needs to build. And then it throws all hundred million away and returns the single row that was asked for. The whole operation can take seconds. It can lock the database. It can be triggered by a single request from anyone with a valid token. It is the kind of thing nobody ships on purpose.
A piece of code was about to ship that scan. It was about to go live on a production server, into a piece of internal tooling, on the very first run of a system specifically built to prevent this kind of thing.
The system caught it. On the first task. Of the first run.
This is the story.
A while back, in another session, a builder agent shipped a bug. Mock tests had passed. Real Postgres would have died. A type mismatch on a join, caught at the last possible moment by a different model reviewing the diff at somebody's tired suggestion.
That incident became the design hypothesis of a new skill. A multi agent build system, where every coding task is split into three roles. A builder agent writes the diff. A reviewer agent, always a different model, reads the diff cold and tries to break it. A fixer agent addresses what the reviewer found. The reviewer is the gate.
For two days I built this thing. A scaffolding script that creates one git worktree per task. A wrapper that dispatches Codex headless through its terminal. Markdown templates for the seven invariants and the per run report. System prompts for three different agent roles, each tuned to be neutral and technical and the opposite of a cheerleader. A permission hook that prevents the director session itself from writing source code, because the director's job is to dispatch, not to type.
Every piece tested in isolation. Every commit narrowly scoped. The first end to end run was today.
The first task, the foundation task, went to a Sonnet builder. The task envelope was specific. Bump these four endpoints from a maximum limit of twenty to two hundred. Add an offset parameter, constrained to be zero or greater. Wrap the responses in a canonical envelope shape. Add five stub HTML pages so the next five tasks could each fill in their own without colliding.
Sonnet did it. Three commits. Every commit narrowly named for the phase it covered. The unit tests passed. The endpoints answered with the new envelope shape. From the spec's perspective, the task was complete.
The spec said: add an offset parameter, constrained to be zero or greater. Sonnet added an offset parameter, constrained to be zero or greater.
The spec did not say: cap the upper end. Sonnet did not cap the upper end.
This is not Sonnet's fault. Sonnet did exactly what was asked.
It is what was not asked that mattered.
The diff went to the reviewer. A different model. Codex.
Codex did not have access to Sonnet's reasoning. Codex did not see the conversation that produced the spec. Codex saw only one thing, the diff itself, in the surrounding code, with a small amount of framing about what to look for. No spec text. No author intent. Just the change.
[serious] Codex did not check that the implementation matched the spec. That was not the job. The job was to ask whether the change made the system better or worse, regardless of what the spec said. So Codex looked at the new offset parameter, traced it down to the SQL, and asked a question.
The skills feed computes a SQL limit equal to limit plus one plus offset, with offset only constrained to be zero or greater. A request with offset equal to one hundred million turns into a huge database limit and can drive heavy sort and aggregation work before response truncation. Verdict, block.
The sister endpoint had the same shape. Two pieces, same bug class. The third finding was about the test suite, which was mocking the second page response and would have passed even if the server ignored offset entirely.
The reviewer's verdict was block. The diff was not allowed to ship.
Sonnet got a continuation envelope, fixed all three findings in one round, and the diff went back to Codex for a second review. Clean.
The cherry pick ran. The deploy ran. Eight smoke checks fired against the live server. One of those checks was specifically the new test, sending a request with an offset above the cap and asserting the server returns a four hundred and twenty two error. It did. The bug was closed in production before the bug had ever been to production.
The whole loop, envelope to deploy, took about fifty minutes.
The author of the spec is careful. The spec was reviewed by three different models before being locked. Sonnet implemented it faithfully. Tests passed.
None of that mattered.
The bug existed because the spec had not asked one specific question, and so the implementation had not answered it. The bug was not in Sonnet's interpretation. The bug was in the spec's silence.
Specs are hypotheses. Every spec assumes something it has not stated. Every careful author has, somewhere, a question they did not think to ask. The only way to find that question is to have someone read the resulting code without having read the spec, asking different questions, with no skin in the implementation game.
That is what a reviewer is. Not a second pair of eyes on the same plan. A first pair of eyes on the result of the plan.
A while back, the system without this gate shipped a production breaking bug. Today, the system with this gate caught a production breaking bug before it shipped. The insurance bought two days ago paid out on its first day of coverage.
Specifically, two days ago, while writing the prompts for the reviewer agent, the most disputed line of the system prompt was the requirement that the reviewer must produce findings, that the reviewer must not say looks good, that consensus is suspicious. That line felt aggressive at the time. Maybe overcorrected. Maybe the reviewer would over flag harmless changes.
Today the reviewer found two genuine production bugs, on its first run, in code that had passed its own tests, on a task built from a spec that had been reviewed by three other models. Two genuine bugs that would have caused real outages.
The aggressive prompt was right.
The system that just shipped its first real catch has its own bugs.
The parallel batch that came after the foundation task, five sub agents that should have worked in parallel, never got to run their reviewers. The dispatcher had a permission leak. Four of the five builders could not write to their assigned worktrees. They were isolated, but isolated against the wrong target. So those workers stayed inside their wrong sandboxes, refused to write to the wrong place, and one of them dumped two hundred and twenty lines of finished HTML into a delivery report as text. A builder, denied the write tool, performing as if dictating the work for somebody to type later.
Tomorrow's work is fixing that leak. The skill catches its own subjects but does not yet catch its own dispatcher.
Specs are hypotheses. Skills are also hypotheses. Both get tested when they meet the world.
The first one passed today.