On the evening of March fifteenth, twenty twenty six, a developer sat down to test a personal productivity tool he had been building for months. PärKit, a personal life OS, five interconnected tools for capturing tasks, tracking focus, managing time, running a collaboration space, and logging API costs. Everything ran on a single VPS in Paris. Everything talked to the same database. Everything had been working.
He opened the dashboard. Eight seconds. Not the kind of eight seconds where the page loads progressively, images filling in, text reflowing. The kind where the screen sits empty and your hand moves toward the browser's refresh button before you catch yourself. Eight point four seconds for a dashboard that should render in under one.
He tried the health endpoint. The simplest possible request. No database queries. No template rendering. Just a server confirming it was alive. One thousand four hundred and ninety milliseconds. Nearly a second and a half to say yes, I exist.
Every endpoint. Every tool. The same delay. Stats, one thousand four hundred and sixty milliseconds. Ideas, one thousand four hundred and seventy. Momentum, one thousand four hundred and seventy. The numbers were not random. They clustered around one thousand five hundred, like cars piling up behind an invisible obstruction on a highway. Something was costing exactly that much time on every single request, regardless of what the request actually did.
This is where a normal debugging story would begin. Server logs. Database profiling. Memory graphs. But this is not a normal debugging story. This is a forensic investigation. And the investigator, which is to say this session, did not start with the symptom. The investigator started with the evidence room.
PärKit does not exist in isolation. It was built across dozens of sessions, each one an AI conversation with its own context, its own decisions, its own blind spots. One thousand seven hundred conversations live in a searchable archive, indexed with SQLite FTS five, spanning months of collaborative development between a human who cannot code and the AI models that do all the implementation.
The archive is the evidence room. Every architectural decision, every design discussion, every tradeoff that seemed reasonable at the time is in there, timestamped and searchable. The spec documents are the witness statements. The git history is the physical evidence. And the postmortem, written by the session that discovered the crisis, is the coroner's preliminary report.
The investigator's first move was to search the archive for a single word.
Bcrypt. Fifteen conversations contained that word. Fifteen threads where someone, human or model, discussed the hashing algorithm that was now costing one and a half seconds per request. The investigator read them all. Not skimming. Reading. Following the chain of reasoning in each conversation, mapping who knew what, when they knew it, and what they chose to do about it.
What emerged was not a single mistake. It was not a rogue commit or a careless line of code. It was five decisions, made across multiple sessions on the same day, each one reasonable, each one defensible, and together forming a chain that would quietly strangle the system for five days before anyone noticed.
February twenty eighth. A brainstorming session about Hubben, PärKit's collaboration tool. Two people, Pär and SJ, need to work in the same space with separate identities. The existing authentication is a single shared token. One key opens all doors. That works for a solo user, but when two people share a tool, you need to know who is who.
Hubben needs separate user identities. Pär has a solution with token plus cookie. Tokens from day one.
The investigator marked this as Decision One. It was the least interesting decision in the chain, and that is precisely why it matters. Nobody would argue against it. A multi-user tool needs multi-user authentication. The requirement was clean. The intent was sound. The seed of everything that followed was planted here, in the most unremarkable soil imaginable.
What the investigator noted, circling it in the margin of the evidence file, was what the conversation did not contain. No discussion of alternative authentication schemes. No mention of JSON Web Tokens, or HMAC based lookup tables, or direct token storage with indexed hashes. The conversation moved from "we need multi-user auth" to "tokens from day one" without stopping to ask "what kind of tokens, and how will we verify them?" The destination was chosen before the route was considered.
March tenth. The Hubben specification document. Twelve days after the brainstorm, the architecture had solidified. Line one hundred and seventy two of the spec contained the sentence that would define everything.
Each user gets their own token. Tokens stored in the hubben dot users table, hashed with bcrypt. No password login. Token only, like PärKit.
The investigator paused here. Read the sentence three times. Then opened a separate file and wrote a question that would become the hinge of the entire investigation.
Why bcrypt?
Bcrypt is a password hashing algorithm. It was designed in nineteen ninety nine for exactly one purpose: making it expensive to guess passwords. A password is something a human chooses, which means it is low entropy. Humans pick dictionary words, birthdays, pet names. An attacker with a fast hash function can try billions of these per second. Bcrypt fights this by being intentionally slow. Each hash computation takes roughly three hundred milliseconds. That slowness is the entire point. It is not a side effect. It is the feature.
But the tokens in PärKit are not passwords. They are generated by Python's secrets dot token underscore urlsafe function with forty eight bytes of randomness. Three hundred and eighty four bits of entropy. To brute force one of these tokens, you would need to try more combinations than there are atoms in the observable universe. The token does not need bcrypt's protection. It is already unguessable. Hashing it with bcrypt is like putting a deadbolt on a bank vault. The vault is already impenetrable. The deadbolt adds nothing except three hundred milliseconds of delay every time someone opens the door.
The spec documented this decision without questioning it. Two lines later, it even documented the consequence.
Bcrypt is a salted one way hash. You cannot match tokens via SQL WHERE token underscore hash equals dollar hash. The auth code must load all active users and compare each.
There it was. In the spec. Written down. The author knew that bcrypt would force an O of n lookup pattern, loading every user and comparing the token against each hash individually, because bcrypt's salt makes indexed lookups impossible. They wrote it down. They moved on.
Nobody asked: if we are going to iterate over every user and pay three hundred milliseconds per comparison, how long does that take with five users? The arithmetic was trivial. Five users times three hundred milliseconds equals one thousand five hundred milliseconds. One and a half seconds. Per request. But nobody did the arithmetic.
The investigator wrote two words in the margin: cost model. Then underlined them twice.
The spec went to peer review on the same day. March tenth. The reviewer was thorough. The review document identified multiple issues, categorized by severity. Issue number two was flagged as something that would cause bugs at runtime.
The auth model says tokens are bcrypt hashed. The actual pattern requires loading all users and comparing each. With three users this is negligible. The spec should document the pattern so the implementing session does not try a SQL WHERE clause.
The investigator read this passage six times. It is the most important paragraph in the entire chain of evidence. Not because of what it says, but because of the hairline fracture between what it identifies and what it concludes.
The reviewer correctly identified the O of n iteration pattern. Correctly noted that it requires loading all users. Correctly recommended documenting the pattern. And then assessed the risk as negligible. For three users. Which was true.
But the review was scoped to Hubben. A three user household tool. The reviewer did not ask what happens if this pattern is copied to other tools. Did not ask what happens if the user count grows. Did not write the sentence that would have changed everything.
This is acceptable for Hubben's three user scale. However, this pattern is O of n times three hundred milliseconds and will not scale. If this is ever generalized to other tools or user counts increase, it must be revisited.
That is what the review should have said. The risk was identified. The risk was not escalated. The difference between those two things is the difference between a near miss and a crash.
Later on March tenth. Phase seven of a development sprint called ShapeUp. The goal was clean and architecturally sound: all five PärKit tools should share a common authentication layer. No more each tool rolling its own auth. One module, one pattern, one source of truth. Good engineering. The kind of change that makes future development faster and more consistent.
The implementation took the auth pattern from Hubben, the bcrypt loop over all users, and lifted it into a shared module called parkit underscore common dot auth. Migration zero four eight created a new table, parkit underscore users, with bcrypt hashed tokens. Every tool, Capture, Focus, Time, Hubben, Apilog, now ran every request through the same authentication function.
The investigator traced this decision through three separate conversations in the archive. Two hundred and thirty two messages on March third. Additional sessions on March seventh and tenth. Across all of them, the same reasoning appeared.
All tools share parkit common. Consistency is good. Phase seven introduces parkit users table with bcrypt auth. All tools inherit it.
Consistency is good. The investigator wrote this phrase on a separate card and pinned it to the wall of the evidence room. Because it is true. Consistency is good. A shared auth layer is good. The architecture was sound. The decision to generalize was correct. What was not correct was generalizing without re-evaluating the assumptions that came with the pattern.
A bcrypt loop over three users in one tool is a local decision with local consequences. A bcrypt loop over five users across five tools, running on every request from every client, is a systemic decision with systemic consequences. The numbers changed. The context changed. The assumptions were not re-examined.
Nobody asked: how many requests still use legacy tokens? The answer was ninety five percent. Nobody asked: if ninety five percent of requests fail the bcrypt loop and fall through to the fast path, what is the performance impact? The answer was one thousand five hundred milliseconds of pure waste on almost every request. Nobody asked these questions because the pattern had already been reviewed. It had already been accepted. It came with a stamp that said "negligible."
The final decision was about ordering. The new authentication function checked credentials in a specific sequence. First, it ran the bcrypt loop over all users in the parkit underscore users table. If that found a match, authentication succeeded. If it did not, it fell through to the legacy check, a simple HMAC comparison against the original shared token. Microseconds.
The reasoning was documented in the code's structure if not in explicit comments. The new authentication system, parkit underscore users, was the future. The legacy token was a temporary fallback. Check the future first. Design for where you are going, not where you have been.
Except the future had not arrived. Ninety five percent of requests carried the legacy token. The system was designed for a world where most users had migrated to the new auth. The reality was a world where almost nobody had. Every one of those legacy requests entered the authentication function, paid one thousand five hundred milliseconds to fail the bcrypt loop against every user in the database, and then succeeded instantly on the legacy check that should have been tried first.
The investigator drew a diagram. Two boxes. The first box was labeled "bcrypt loop" with a cost of one thousand five hundred milliseconds. The second was labeled "legacy check" with a cost of three microseconds. An arrow from the first box to the second, labeled "ninety five percent of traffic." The diagram looked absurd. Because it was absurd. But it was also the exact architecture that had been running in production for five days.
Check expensive before cheap. The classic antipattern. And it was done for the most reasonable of reasons, because someone was designing for tomorrow at the expense of today.
March tenth to March fifteenth. Five days. The system was slow for five days and nobody noticed.
The investigator found this the most unsettling part of the evidence. Not the architectural decisions. Not the missing cost model. The silence. Five days where every API request to every PärKit tool paid a one and a half second penalty, and the system absorbed it without complaint.
Why? Because PärKit is a personal tool, not a high traffic service. A few requests per minute, not thousands per second. The slowness was per request, not a spike. There was no monitoring dashboard turning red, no error rate climbing, no alert firing. The tool was slow the way a door with a sticky hinge is slow. You push a little harder. You wait a little longer. You attribute it to the network, to the server being busy, to the phase of the moon. You do not suspect that the authentication layer is burning one and a half seconds of CPU time on a cryptographic operation that provides zero security benefit.
One curl command would have caught it. One measurement. Thirty seconds of work. But nobody measured, because nobody expected a problem. The system had been reviewed. The code was clean. The architecture was consistent. Everything looked right.
The investigator added a fifth card to the evidence wall. It read: absence of measurement is not evidence of correctness.
Here is where the investigator's understanding shifted. The first reading of the evidence suggested a chain, five decisions in sequence, each making the next one worse. But the second reading revealed something more precise. It was not a chain. It was a compound.
In chemistry, a compound is not just ingredients mixed together. It is a new substance with properties none of the ingredients have alone. Sodium is a metal that explodes in water. Chlorine is a poisonous gas. Together they make table salt. The compound behaves nothing like its components.
These five decisions worked the same way. Multi-user auth is fine. Bcrypt is fine for passwords. Accepting a reviewed risk is fine. Generalizing a shared layer is fine. Checking the new path first is a reasonable optimization. None of these decisions, examined alone, would raise an alarm in any code review on any team anywhere. But combined, they produced a system where ninety five percent of requests paid a one and a half second penalty for a security feature that bought zero protection. The compound was toxic. The ingredients were not.
The investigator found one more detail that sharpened the picture. SJ, the second user of Hubben, had her token stored not in the shared parkit underscore users table but in Hubben's own user table. When she made a request, the system ran the bcrypt loop over parkit underscore users, failed to find her, then ran a second bcrypt loop over Hubben's users. Two loops. Two thousand four hundred milliseconds per request. The compounding had a compound.
The investigator turned to the model decision trace. Every one of the five decisions was made by the same model family, Claude Opus, across multiple sessions. The investigator was careful here. The question was not "did the model fail?" The question was "what did the model have, what did it lack, and what does that tell us?"
What Opus had, consistently across all sessions: excellent architectural documentation. Clear reasoning within each decision's scope. The spec explicitly described the O of n pattern. The review explicitly identified the risk. The code was well structured and easy to follow. In isolation, every piece of work was competent.
What Opus lacked, consistently across all sessions: cost modeling. At no point in any of the fifteen conversations did anyone, human or model, write down the equation. N users times three hundred milliseconds per bcrypt check equals total latency. The number one thousand five hundred was never computed until after the crisis. It was the most basic arithmetic imaginable, multiplication of two known quantities, and it never appeared.
The investigator also noted what the model did not distinguish. Token and password are different security objects with different threat models. Bcrypt exists to make password guessing expensive. Tokens do not need that protection because they are not guessable. SHA two fifty six, a hash that runs in microseconds and supports indexed database lookups, would have provided the same security for tokens with none of the performance cost. But across all fifteen conversations, the word "password" and the word "token" were used interchangeably. The distinction was never drawn. The wrong tool was selected because nobody defined the job precisely enough to realize it was the wrong tool.
The interim fix was almost insultingly simple. Two changes. First, reverse the order of authentication checks. Try the legacy token first. If it matches, return immediately. Only fall through to the bcrypt loop if the fast check fails. Second, cache negative results. When a token fails the bcrypt loop, remember that failure so the loop does not run again for the same token.
The results were immediate.
The health endpoint dropped from one thousand four hundred and ninety milliseconds to three milliseconds. Four hundred and ninety seven times faster. The stats endpoint dropped from one thousand four hundred and sixty to twenty four. The ideas endpoint dropped from one thousand four hundred and seventy to forty two. The dashboard loaded in two hundred milliseconds instead of eight thousand four hundred.
The permanent fix, still pending at the time of this investigation, is to replace bcrypt entirely. Add a SHA two fifty six column to the users table with a database index. Hash each token with SHA two fifty six on storage. On authentication, hash the incoming token with SHA two fifty six, look it up with a single indexed query. O of one. Microseconds. The security is identical for high entropy tokens. The performance difference is not a percentage improvement. It is a category change.
The investigator came into this case expecting to find a bad decision. A moment where someone chose wrong, where the evidence was clear and the choice was careless. That is how most debugging stories go. Someone made a mistake. Here is the mistake. Do not make this mistake. Applause.
That is not what the evidence showed.
The evidence showed five decisions, each made with care, each documented, each reviewed, each defensible. The requirement was real. The hashing was security conscious. The review was thorough. The generalization was architecturally sound. The ordering was designed for the intended future state. At no point did anyone do something obviously wrong. At every point, someone did something locally right that was globally catastrophic.
The investigator's report identified four things that would have prevented the crisis. None of them are about making better decisions. All of them are about creating conditions where the compound cannot form.
First, explicit cost modeling. Before shipping any change to a per-request system, calculate maximum latency. Write the equation. Multiply the numbers. If the result is unacceptable, the design is unacceptable, regardless of how reasonable it looks on paper.
Second, post-deployment measurement. One curl command with a timing flag. Thirty seconds. Run it after every infrastructure change. The gap between "I think this is fast" and "I measured this and it is fast" is exactly the gap where five days of invisible degradation live.
Third, risk escalation. When a review identifies a risk and deems it acceptable for the current scope, the review must also state what happens if the scope changes. Acceptable here, dangerous if generalized is not a footnote. It is the most important sentence in the review.
Fourth, domain distinction. A token is not a password. A batch job is not a real-time request. A three user tool is not a five tool platform. When moving a pattern from one context to another, the first question is: what changed? Not what stayed the same. What changed.
The investigator filed the report on March sixteenth, twenty twenty six. Five decisions. Five reasonable choices. One catastrophic outcome. The fix took minutes. The investigation took hours. The lesson will take longer.
This was not a failure of the AI. It was a failure of the process. The architecture needed explicit cost modeling before design review. Performance testing after infrastructure changes. Architectural checklists for common pitfalls.
The investigator closed the last of the fifteen chat transcripts. Somewhere in a data center in Paris, PärKit was responding in three milliseconds to requests that had taken one and a half seconds twenty four hours earlier. The code was almost identical. Two lines moved. One cache condition added. The architecture was still bcrypt. The permanent fix was still pending. But the system was breathing again.
The accident investigation board has a term for what happened here. They call it an organizational accident. Not a failure of any individual component, but a failure of the system to recognize that individually safe components can create collectively unsafe conditions. Each bolt was tightened. Each seal was tested. Each instrument was calibrated. And the thing still came apart in flight, because nobody modeled what happened when all the individually tested parts operated together under load.
Five decisions. Five reasonable engineers, or in this case, five reasonable instances of the same model family across five sessions. Five moments where the right local choice produced the wrong global outcome. Not because anyone was careless. Because the process did not require anyone to look at the whole picture before the whole picture existed.
The investigator turned off the desk lamp. The evidence room would stay. The archive would keep its fifteen conversations about bcrypt, timestamped and searchable, ready for the next session that needed to understand not just what went wrong, but how five things went right in exactly the wrong combination.