The Podcast That Made Itself

The Swarm That Explained Itself

Thirty-Five Agents Walk Into a Context Window

On a Monday evening in March twenty twenty-six, someone decided to build an entire podcast season in one sitting. Not a pilot episode. Not a proof of concept. Twenty-four episodes about how artificial intelligence actually works, researched from primary sources, written in podcast-ready prose, reviewed for factual accuracy, and polished for production. The plan was to do it with a swarm.

Twelve research agents launched simultaneously. Each one was given a topic, a list of specific questions, and instructions to find primary sources. The topics were the foundational concepts of modern AI: tokens, neural networks, training, attention, hallucination, embeddings, scaling, reinforcement learning from human feedback, diffusion, context windows, inference, and benchmarks. Twelve Opus-class language models, running in parallel on a laptop, each one racing across the internet looking for the original papers, the inventor interviews, the conference talks, the blog posts where someone explained what they built and why.

They all came back. Every single one. Averaging seven hundred lines of raw research material, with URLs, direct quotes, conflicting accounts, and notes about things that did not make it into the final file. Eighty thousand words of source material, gathered in about twelve minutes.

Then twelve writer agents launched. Each one read the series spec, the style guide, the research file for its topic, and the sound effects reference. Each one produced two episodes: a main story and a companion deep dive. Then twelve reviewer agents read the episodes alongside the research files and produced structured editorial feedback. Then a human read the reviews and made targeted fixes.

Seventy-three thousand seven hundred and fifty-four words. Twenty-four episodes. Roughly eight hours of estimated audio. Ninety minutes of wall clock time across two session windows. The whole thing committed to git as a single fifty-file push.

The Irony Engine

Here is where it gets interesting. The series is called "Actually, AI." Each episode takes one concept and explains the real mechanics behind it, cutting through the marketing language and the confused metaphors. How tokens actually work. What training actually does. Why models hallucinate. And the production process of making that series became a live demonstration of every concept the series describes.

Episode one explains that AI does not read words. It reads token fragments, pieces of text chosen by a compression algorithm with zero regard for meaning. The writers who produced episode one were themselves reading the spec as tokens. The very act of creating the explanation was subject to the phenomenon being explained.

Episode three explains that training is a loop of being wrong and adjusting. The swarm's production was the same loop. Research agents found information. Writer agents got things wrong. Review agents measured how wrong. An editor adjusted. The swarm did not understand the topics it was writing about. It found statistical patterns in its research material that, when shaped by the spec's constraints, produced output that looks like understanding.

Episode five explains hallucination. The model has no mechanism for truth. It has confidence scores, not fact databases. It generates the most likely next token based on patterns in its training data. And sure enough, two of the thirty-five agents hallucinated. The writer for the embeddings deep dive produced a VOICE block attributed to a researcher named Tolga Bolukbasi, quoting him saying things about bias in news data. The words sounded exactly like something Bolukbasi would say. They were plausible, specific, and completely fabricated. The research file described a finding from his paper. The writer turned that finding into a quote, dressed it in quotation marks, and put it in a voice block as if the man had said those exact words.

The writer for the neural networks deep dive did the same thing with Paul Werbos, producing a pithy observation about backpropagation credit that read like a direct quote from an interview. It was not. It was a paraphrase of a secondary source, promoted to primary-source status by an agent that had no concept of the difference.

In both cases, the review agents caught it. Not because they recognized the fabrication on its own merits. The fabricated quotes sounded perfectly natural. What caught them was cross-referencing. The reviewer read the research file, saw that it described a finding rather than a quote, and flagged the mismatch. Without the source material, the fabrication would have passed. The hallucination episode describes this exact dynamic. The model does not know it is wrong. External verification is the only reliable check.

The Wall

Episode seven explains scaling laws. The core finding is that neural networks get predictably better as you add more compute, following smooth power law curves. The industry has bet billions on those curves continuing. And the swarm hit its own scaling wall.

Thirty minutes into the first session, the API rate limit triggered. Agents that had been running for ten minutes died mid-sentence. Research files that were ninety percent written vanished into error messages. The pipelining strategy, launching writers as soon as research landed rather than waiting for all research to complete, meant that writer agents consuming expensive tokens crashed before producing anything useful. Tokens spent, nothing to show for it.

The recovery was instructive. The second session launched the missing researchers first, waited for them to finish, then batched writers more carefully. The rate limit is a hard physical constraint, like the quadratic cost of attention that episode four describes, or the KV cache memory that episode ten documents. You do not argue with it. You work around it.

The scaling episode's own production demonstrated the scaling episode's thesis. More compute produces more output. The relationship is predictable. And there is a wall.

Guidelines Not Laws

Episode eight explains reinforcement learning from human feedback. The core idea: humans rank outputs, a reward model learns those preferences, and the system optimizes for what humans say they want. The entire review and edit phase of the swarm was the same process. Twelve review agents ranked issues by severity. A human read the rankings and decided what to fix.

The most revealing moment came during the sound effects audit. The spec says main stories should have two to four SOUND tags. Several episodes had five. The reviewer flagged it. The suggestion was to cut sounds to match the budget. And the human said three words that defined the entire production philosophy.

Guidelines not laws.

The season finale kept its five sounds. The episode about neural networks kept its glitch effect at the Minsky moment because the drama earned it. The system optimized for human preference, and the human preferred judgment over compliance. Which is, if you think about it, the exact limitation of RLHF that episode eight describes. The reward model learns what humans prefer. Humans prefer nuance over rules. The system cannot encode nuance. So it learns the rules, and a human overrides them when the rules are wrong.

The Review That Reviewed Itself

The review agents were the most philosophically interesting part of the swarm. Each one read two episodes and a research file, then produced a structured review covering factual accuracy, narrative quality, TTS compliance, cross-episode references, and an overall rating with specific improvement suggestions. They were told to be honest, to quote problematic text, and to not invent problems.

They did not invent problems. Every factual issue they flagged was real. Polosukhin's age was wrong. The GPT-3 launch date was wrong. The Bard-to-Mata timeline was wrong. The off-palette sound tags were really off-palette. They caught things that a human skimming the same text would have missed, because they were cross-checking every claim against seven hundred lines of research notes.

But they also exhibited the exact limitation that the benchmarks episode describes. When a measure becomes a target, it ceases to be a good measure. The review agents were optimizing for their checklist. Some of them flagged "issues" that were actually working as intended, like a main story with five sounds instead of four. The measure was the spec's budget. The target was a clean review. The agent wanted to produce a review with findings, so it found things, even when the thing it found was a guideline being bent rather than a rule being broken.

Five sounds in a standard episode when the spec says two to four. Consider removing two or three.

The human considered it, recognized that the drama earned the extra sound, and kept it. Which is the entire argument of the benchmarks episode: measurements matter, but they are proxies for judgment, and judgment cannot be automated.

What the Swarm Cannot Do

The swarm cannot listen to the episodes. It cannot hear the pacing. It cannot feel whether a transition between chapters lands or falls flat. It cannot tell you whether the Kenyan labeler story in episode eight was given enough room to breathe, only that it met the minimum paragraph length and had source citations. It cannot tell you whether the closing reflection in the season finale ties the threads together in a way that feels earned, only that it mentions all eleven previous episodes.

These are the gaps that episode twelve, the benchmarks episode, describes. The measurements capture something real but miss the thing that matters most. The swarm produced seventy-three thousand words of podcast-ready content. Whether any of it is actually good to listen to is a question that only a human with headphones and twenty spare minutes can answer. The swarm built the house. Someone still has to live in it.

The series is called "Actually, AI." And the most honest thing about it is that it was made by AI, using every technique it describes, encountering every limitation it documents, and ultimately requiring the same human judgment it spends twelve episodes explaining the absence of. The swarm explained itself. Whether it understood the explanation is, as episode two puts it, one of the great open questions of our time.

That question will have to wait. The episodes are written. The git push went through. Somewhere in a repository, fifty new files contain a season of podcast about how machines think, written by machines that do not think, reviewed by machines checking whether machines got it right, and approved by a human who was, at the time, also discussing something very crazy in a different chat window.