Two Thirds of Nothing: What Seven Versions of Self-Analysis Actually Found

The Number

Here is the number that changes everything else about this story. After seven versions, fourteen plus agents, four days of work, and approximately five thousand dollars in AI spending analyzed — the Orchestra project finally asked the question nobody had thought to ask.

Of the four hundred and twenty two AI sessions during the transition period, how many actually produced something?

One hundred and thirty three. Thirty one point five percent. Two thirds of everything led nowhere.

Sessions that produced usable output: one hundred thirty three. Sessions with zero output: two hundred and five. Dead end chains consuming over nineteen hundred messages: thirty two sessions. Monthly production rate at the worst: twenty two percent in December. Twenty three percent in November.

For five versions, every report made it sound like everything worked. A person who used AI four point five times a month suddenly tripled their sessions and started building software. It sounded like a ramp. Like momentum. Like one insight leading to the next.

It was not. Two thirds of it was churning.

The Sunny Problem

The subject of the study — a newspaper editor in rural Sweden who became a software builder in one hundred and ten days — read the fifth version of his own analysis and said this:

So many things lead nowhere. When I read the previous reports it sounded so sunny. Like everything always happened.

Five versions. Three overnight swarms. One interactive research session. One interview round. And every single one inherited the same bias: the chatarkiv measures activity, not outcomes. A three hundred and ten message session reads as intense productive work. The failure audit reveals it was a website rebuild that produced nothing.

The agents selected for narrative interest. The RunPod serverless failure is interesting. The thirty Ollama comparison sessions are not. Both represent real time and energy. Only one made it into the story.

What Version Seven Found

Version seven was designed to be the anti-sunny version. Three investigation agents running overnight. One classifying every session by outcome. One cross-referencing AI sessions with three hundred and eighteen published newspaper articles. One correlating session intensity with a medication timeline that previous versions did not even know existed.

The article reality check is quietly devastating. The previous versions told a story where AI transformed newspaper production from the start. The data says otherwise.

Two thousand twenty three, first half: six point seven percent of articles involved AI. That is two articles out of sixty five. Seventeen of twenty issues had zero AI involvement. Two thousand twenty three, second half: eleven point one percent. The real transformation does not begin until late twenty twenty four. Even at the peak in twenty twenty five, more than half of articles were entirely human written.

The chatarkiv shows sessions about articles. It does not show which sessions produced articles that were actually published. The difference between those two numbers is the gap between the sunny version and the real one.

The One Clean Signal

The medication correlation produced exactly one finding that survived scrutiny. Not the September explosion — too many confounds. Not the January eruption — Claude Code adoption inflates everything. One clean signal.

When the off the books Elvanse experiment ended around October first, sessions dropped thirty five percent and message depth dropped fifty eight percent. This is the strongest evidence of a medication effect.

But even this comes with a caveat. The trajectory never reverses. Even in the unmedicated gap between October and November, activity stays four times above the pre medication baseline. Something changed in September and it stayed changed. The medication helped. The tool changes helped more. With a sample size of one, no control group, and six confounding variables at every transition point, this is observational, not causal.

The honest assessment from the reference document:

Something changed in September twenty twenty five. The change was permanent. The October first dip is real. Tool adoption was the bigger multiplier.

Claude Code: One Hundred Percent

The most striking number in the entire failure audit is not the sixty eight point five percent failure rate. It is the platform breakdown.

ChatGPT production rate: thirty one percent. Claude dot ai production rate: fourteen percent. Claude Code production rate: one hundred percent.

Fifteen sessions. Fifteen outputs. Every single Claude Code session during the transition period produced something that was used. Not because Claude Code is magic. Because by the time Claude Code entered on January fourth, the user had already done the thinking. The specs were written. The architecture was designed. The false starts were behind him. Claude Code was the executor. The two years of ChatGPT were the education.

The Eighty Eight Day Portfolio

Version seven also dug into the git history. Twenty repositories. Six hundred and eighty nine commits. All created within eighty eight days, starting January second, twenty twenty six.

Zero repositories with the subject's own code before January second. Twenty repositories and six hundred eighty nine commits by March twenty ninth. Three waves: ArebladetLive sprint in weeks one through three, two hundred and forty three commits. Utilities and infrastructure in weeks six through nine, ninety eight commits. ParKit explosion in weeks ten through thirteen, two hundred and seventy seven commits.

But the git history also shows the pattern the narrative wants to hide. One repository had four commits in two hours and was never touched again. Another was built in six hours and declared failed. A third had six commits spread across three days out of thirty two. The ADHD pattern in code: build fast, lose interest, move on.

The full chatarkiv is a never ending row of things that lead nowhere.

What It Cost

The financial story is its own narrative. Total tracked AI spending from April twenty twenty four to March twenty twenty six: approximately four thousand six hundred dollars plus about five hundred euros.

ThinkDiffusion: three thousand two hundred eighteen dollars. Sixty three percent of all spending. RunPod: one thousand twenty five dollars. Twenty percent. Anthropic: four hundred seventy three euros plus thirty one dollars. Ten percent.

Image generation was sixty three percent of total spend. Not code. Not AI assistants. Images. The ThinkDiffusion peak — six hundred dollars in July twenty twenty five — directly motivated the Parception spec. What if I owned the compute instead of renting it? That question, born from a monthly bill, led to the software product design that changed everything. The spec was never built. The server was never purchased. But the thinking transferred.

And the punchline: Claude Max at five times costs less per month than ThinkDiffusion did at its peak. The total AI spend went down while the output went from generated images to production infrastructure.

The Process Itself

Seven versions. Four days. Each one failed differently.

Version one: wrong metric. Shipped versus not shipped collapsed three different things into one broken boolean. The debate round was the one good idea.

Version two: wrong framing. Newspaper editor. He was in radio.

Version three: right analysis, zero files. Write permissions killed the output. Recovered from raw session logs.

Version four: the human in the room. Found the hospital, the gazebo, the summer job. The machine could not see any of it.

Version five: the pre-staged swarm. All files landed. Best narrative. Still sunny.

Version six: the interview. Agents ask, human corrects. Thirteen errors found in twenty five rounds. The most efficient version by far.

Version seven: the reference document. Failures included. Sunny bias corrected. Tables over prose.

The pattern: automated analysis produces clean narratives that feel true. Human correction produces messy realities that are true. The further you get from the human, the sunnier the picture. The closer you get, the more you see the two thirds of nothing, the dead end chains, the projects declared failed after six hours, and the newspaper editor who does not earn a salary from his newspaper.

The Last Finding

The reference document ends with a section called "What the Data Cannot See." It lists seven categories of invisible information: the A1111 period done without AI help, verbal conversations, emotional context, physical work, ThinkDiffusion sessions with no conversation logs, quick interactions without export, and standalone tools like MacWhisper.

The chatarkiv is not the story. It is the shadow the story casts on one wall.

Seven versions of looking at that shadow from different angles. Each one saw something the others missed. None of them saw the full shape.

The machine reads the text. The human reads the life. The truth is somewhere in the gap between.