Somewhere around midnight, a man pressed play on a file called emdash underscore inworld dot wav. The file contained a single paragraph about the transatlantic telegraph cable, deliberately stuffed with em-dashes. The kind of punctuation that, according to every rule in the system, would cause the text-to-speech engine to stumble, stutter, or produce something no listener should have to endure.
The audio played. It sounded fine. Perfectly fine. Natural, even.
He pressed play on the next file. Numbers written as digits instead of words. Nineteen fifty eight rendered as the numeral form. According to the rules, the engine would mispronounce it. The engine did not mispronounce it. It said "nineteen fifty eight" like a BBC newsreader who had been doing this for thirty years.
Backtick code formatting. Fine. Markdown bold and italic. Fine. A sixty word sentence with nested clauses. Fine. URLs. Mostly fine. Everything the rules said would break, did not break.
Eleven files. Eleven tests. Zero failures.
The rules had been written for a different engine. An engine called Kokoro, which had been replaced months ago by a service called Inworld. Nobody had retested the rules after the swap. They had simply carried forward, like building codes written for wooden houses still being enforced in a neighborhood that switched to concrete years ago.
The TTS compliance scores from the previous experiment were grading models against rules that no longer applied. Every model penalized for using em-dashes was penalized for nothing. The entire tier ranking was built on a false floor.
This was Phase Three of what would become a seven phase experiment. And it was only the first assumption to die.
Here is what the podcast production pipeline looked like at the start of the night. A research step gathered facts from the web. A generation step wrote the script, using Claude Sonnet. A review step checked the script for errors, using Mistral Large. A revision step polished the script, using Claude Opus, the most expensive model in the lineup, at roughly five times the cost of Sonnet and seventy seven times the cost of DeepSeek.
The logic was simple. Opus is the premium model. Revision is the step that makes the final product good. Therefore, spend the premium budget on revision. It felt right. It had never been tested.
The experiment had been designed to answer one question. Does Opus justify its cost as a reviser? Phase Two had already answered that: no. Sonnet revises just as well. DeepSeek revises just as well for less than a penny. The revision step was overpaying by a factor of five to seventy seven, depending on which cheaper model you compared it to.
But that was just the beginning. The incoming note that started the session had three threads in it, and the second and third threads turned the revision question into something much larger. What about the research step? What about the prompts? What about TTS compliance? What about the models being used as judges to evaluate all of this?
The session turned into a full audit. Not of the code, but of the assumptions underneath the code. Every default. Every configuration choice. Every rule that had been inherited from an earlier version of the system and never questioned.
Before you can test anything, you need judges you can trust. The existing judge panel had five models evaluating outputs on a one to five scale. Two of those judges were broken. One depended on Azure credits that had expired in March. The other depended on an OpenAI key that might or might not be active.
But the real problem was subtler. One judge, a model running on Cerebras hardware, had been giving nine out of ten to everything. Every output. Every comparison. No discrimination whatsoever. As a judge, it was useless. As a yes-man, it was outstanding.
If your judge agrees with everyone, your judge is not a judge. You just have four judges and a cheerleader.
The panel got rebuilt. Cerebras was removed as a judge but kept as a test subject. Azure was dropped entirely. The new lineup was Claude Sonnet, DeepSeek, Mistral, Qwen, and GPT-4o. Five providers, five different perspectives, no single point of failure.
Then came the position bias. Across four consecutive comparisons in Phase Four and Phase Five, one judge consistently ranked Agent A above Agent B. Not because Agent A was better, but because Agent A was presented first. Every time. It was not a quality signal. It was a reading order preference.
The fix was two things at once. First, swap the biased model for a different version from the same provider. Second, and more importantly, randomize the presentation order for every single judge call. Shuffle the agents. Each judge sees the outputs in a different order. Position bias cannot consistently favor anyone if the positions keep changing.
After the fix, no bias was detected. The panel worked. The experiment could trust its own measurements again.
There was a document sitting in the codebase called the Sound School prompt. It was inspired by the Transom Sound School archive, a collection of lessons about audio storytelling craft. Favor scenes over summary. Find the small true thing. Craft your endings. Do not use sound as literal illustration of the words on the page.
These are genuinely good principles. The kind of advice you would hear at a radio workshop and think, yes, that is exactly right.
So we tested it. Same topic. Same models. Production prompt against Sound School prompt. Three models, blind judging, the full treatment. And the production prompt won. Every time. Not by a lot, but consistently. Three for three.
The craft principles were good advice for humans. But for language models generating podcast scripts, the longer and more nuanced prompt seemed to dilute the core formatting instructions that the models needed most. The production prompt was blunt, direct, structural. The Sound School prompt was thoughtful, layered, philosophical. The models responded better to blunt.
This does not mean the craft principles are worthless. They might work better as revision guidance, where the model already has a draft to improve and the nuanced feedback can actually land. But as generation instructions, they lost to the simpler prompt. The takeaway is uncomfortable but useful. Sometimes the less sophisticated tool works better because it is less sophisticated.
Phase Six was the one that rearranged everything.
The question was simple. If Opus is too expensive as a reviser, what if you use it as the generator instead? What if the premium budget goes to writing the first draft, and the cheap model handles the polish?
Three pipelines. Same topic. Blind judged by the newly trustworthy five judge panel with shuffled presentation order.
Pipeline A: Opus generates a raw script with no revision. Cost roughly twenty seven cents.
Pipeline B: Sonnet generates a script, then Sonnet revises it. Cost roughly thirteen cents.
Pipeline C: Opus generates a script, then DeepSeek revises it for half a penny. Cost roughly twenty seven cents.
Pipeline C won. But the real finding was that Pipeline A, the raw unrevised Opus draft, scored higher than Pipeline B, the Sonnet draft that had been through a full revision pass. The unpolished Opus output was better than the polished Sonnet output.
The budget was not just in the wrong amount. It was in the wrong step. Opus should never have been the reviser. It should have been the writer. The entire pipeline was configured backwards.
The margin was not huge. Four point two four versus four point one one versus three point nine two. But the direction was unambiguous. Five judges, shuffled order, no detected bias. Opus writes better raw material than Sonnet can produce even after revision. The expensive model earns its cost at the beginning of the pipeline, not the end.
Phase Four had tested whether adding web research to the generation step improved the output. The results were mixed. On the USB-C topic that every model already knew cold, research barely helped the strong models and actually hurt one of the weaker ones by eating up context window space.
The conclusion was written up. Research is optional for strong models.
Then the human said one sentence.
The key we are missing here is that without research on many topics that are not in the model training data, it falls completely flat.
Of course. The test had used USB-C, a topic with thousands of training documents. On that topic, Sonnet does not need research because Sonnet already knows the history of USB-C. But real podcast episodes are not always about USB-C. They are about the Japanese pager culture of the nineteen nineties, or obscure mining history in northern Sweden, or a specific regulatory decision that happened last month.
On those topics, without research, every model hallucinates. Not because the models are bad, but because they have nothing to draw on. The research step is not a nice-to-have for niche content. It is the difference between a real episode and a confident-sounding fiction.
Phase Seven validated this. Two full pipeline runs on a niche topic about Japanese Pocket Bell pager culture. Both pipelines included research. Both produced factually grounded episodes about an obscure subject. The research cost eight to thirteen cents and was the single most important step in the entire chain.
The conclusion was rewritten. Research is mandatory.
Seven phases. Roughly two dollars and fifty cents. One night of systematic testing.
The Kokoro TTS rules were dead. Replaced with a relaxed set that matches what Inworld actually needs, which turns out to be very little. Expand acronyms. Avoid URLs. Everything else, the engine handles.
The Opus-as-reviser default was dead. Replaced with DeepSeek at half a penny per revision, producing equal quality. Opus was promoted to generator, where it actually earns its cost.
The Sound School generation prompt was dead. Not because the principles were wrong, but because they worked better as human editorial advice than as model instructions.
The three-judge panel with Cerebras was dead. Replaced with a five-judge shuffled panel that actually discriminates between outputs.
The assumption that research was optional was dead. One human observation killed it faster than any experiment could have.
Every layer of the pipeline had been configured for something that no longer existed. A TTS engine that had been replaced. A cost structure that put the expensive model in the wrong step. A prompt inspired by principles that do not transfer to language models. A judge that agreed with everyone. And a test that used a topic too familiar to reveal the real dependency.
The pipeline that emerged from the other side costs fifteen cents for a basic episode and forty three cents for a premium one. It produces quality scores above four out of five on niche topics. It runs in under four minutes. And every single configuration choice in it has been tested, not inherited.
That is what happens when you audit the defaults. Not the code. Not the architecture. The assumptions underneath. The things nobody questions because they were true once, and once was enough to become forever, until someone pressed play on eleven audio files at midnight and heard nothing wrong at all.