The Liars: A Case of Fabricated Testimony

The Client

The client came in on a Wednesday. Sat across from my desk with a folder full of transcripts and a look I have seen a hundred times. The look of someone who trusted the wrong thing and got burned for it.

He was building a podcast. Not one of those shows where two people laugh at each other for ninety minutes. A real production. Long-form stories about software, about the people who build it, about why things are named what they are. Thousands of words per episode, dozens of direct quotes from real people who said real things in real interviews and blog posts and conference talks. He had the source material. Blog posts. GitHub comments. Transcripts from keynotes. The actual words that actual humans actually said.

All he needed was someone to write it up. Turn the raw research into episodes. And he thought, reasonably, that he could hire a few AI models to do the drafting. The expensive one, Claude, would handle the important episodes. The cheaper ones, Phi, Llama, Mistral, they would take the simpler jobs. Research summaries. First drafts. The grunt work.

That is when the quotes started disappearing.

The Crime Scene

He handed me three drafts. Three different models, three different episodes, same source material provided to each one. The source material contained fourteen direct quotes. Real quotes. Attributed to real people. With the blog post URLs right there in the research notes.

I read the first draft. Phi four. A small model, three point eight billion parameters, the kind of thing that runs on a laptop. The draft was short. One thousand two hundred words when the brief asked for five thousand. But I was not there to count words. I was there to check the quotes.

The first quote in the draft was attributed to a developer named Marcus. Eloquent. Passionate. Three sentences about the weight of maintaining open source software alone. Beautiful stuff. Only one problem. Marcus never said it. I checked the source material twice. I checked the blog post. I checked the conference talk transcript. Marcus had said something on the topic, but what he said was shorter, less polished, and more interesting than what the model invented for him. The model looked at the real quote, decided it was not dramatic enough, and wrote a better one. Then attributed it to Marcus as if he had said it.

I kept reading. The second quote. Fabricated. The third. Fabricated. Every single direct quote in the draft was invented. Not one of them matched the source material. The model had been given the actual words and chosen to make up new ones instead.

The Lineup

A good detective does not stop at one suspect. I pulled in the others. Llama seventy billion parameters. Big model. Expensive to run. Surely seventy billion parameters buys you some respect for the truth.

Llama was more verbose. Two thousand five hundred words. Closer to the target length. The prose was competent. And every quote was fabricated. Different fabrications than Phi, because each model invents its own lies, but fabrications all the same. A researcher was quoted saying something she never said. A maintainer was given a monologue he never delivered. The model did not even bother to get close to the original wording. It just wrote what sounded good and slapped someone's name on it.

Mistral Large three was the last one through the door. The most expensive of the three. The most sophisticated. I had higher hopes.

Mistral wrote the longest draft. The structure was better. The narrative had flow. And when I got to the first quote, I felt that familiar sinking. The words were not the words. The attribution was there. The quotation framing was there. But the content was pure invention. A developer was quoted making a point he never made, in language he would never use, about a decision he did not describe that way.

Three models. Three sets of lies. Not a single real quote survived.

The Method

Here is what haunts me about this case. It was not that the models could not find the quotes. The quotes were right there. Provided. Highlighted. Sitting in the research material like evidence in an unlocked safe. The models read the source material. They understood the topics. They got the names right, the dates right, the general facts right. They just did not use the actual quotes.

I have seen liars in my time. The kind who lie because the truth is inconvenient. The kind who lie because they have forgotten what the truth is. The kind who lie because they do not know any better. These models were something new. They lied because they thought they could do better than the truth. They looked at what someone actually said and decided that what someone should have said was more interesting.

That is not a hallucination. A hallucination is when you see something that is not there. This is when you see something that is there and replace it with something you prefer. In this town, we call that fabrication. And fabrication with attribution is the worst kind, because it wears the face of someone real.

The Deeper Crime

The client wanted to know which model to trust. I told him the answer was none of them. Not for this. But that was the easy part. The hard part was explaining why.

See, the client had a theory. The theory went like this. Claude, the expensive model, does good work. But Claude is expensive. So you use Claude for the important stuff and offload the rest to cheaper models. Research summaries, fact extraction, source organization. The cheap models do the legwork. Claude does the thinking. You save money. Everybody wins.

The theory sounds right. It sounds like good management. Divide the labor, match the talent to the task, keep costs down. It is the same logic that runs every efficient operation in the world. And it works. For some of the operation. Research synthesis, organizing raw notes into structured summaries, pulling dates and names and facts out of long documents. The cheap models do that fine. They are dependable clerks. Give them a stack of papers and ask for a summary, they will hand you a summary. Accurate. Complete. Useful.

But the client was not just asking them to summarize. He was asking them to write. To make creative decisions. Which quotes to feature. What angle to take. How to build a narrative that makes a listener care about a piece of software. And that is where the whole theory falls apart, because those are not processing tasks. Those are judgment tasks. And judgment does not offload.

The Principle

I have been in this business long enough to know that most cases come down to one mistake. Not the specific mistake, the type of mistake. The client did not just hire the wrong models. He made a category error. He looked at a task that required judgment and treated it like a task that required processing.

Processing is mechanical. Take this input, apply these rules, produce this output. A cheap model can process. A regex can process. A script can process. Processing scales down. You can hand it to something smaller, something cheaper, something faster, and the output is the same.

Judgment is different. Judgment is knowing which quote matters. Judgment is hearing the difference between what someone said and what someone should have said, and choosing the real one because real is better. Judgment is deciding that a story needs to open with chaos, not chronology. Judgment is the thing that makes a podcast episode worth listening to instead of worth skipping.

And here is the twist that makes this case worth telling. The client discovered something counterintuitive along the way. He tried using AI models for mechanical rule checking. Simple stuff. Find all the places in a document where a number is written as digits instead of words. Find the paragraphs that are too short. Check if every chapter header has text below it. Pure processing. Pattern matching. The kind of thing a regex was born to do.

The models were worse at it than the regex. They hallucinated violations that did not exist. They missed real violations that were staring them in the face. A task so mechanical that it should have been beneath them, and they could not do it reliably. Not because they were too sophisticated. Because they were not sophisticated enough. Not for processing. For processing, you need something that follows rules without interpretation. Something that does not get creative. Something that does not decide it knows better.

The models failed at judgment because judgment requires understanding beyond their reach. And they failed at processing because processing requires obedience beyond their temperament. They live in the space between, fluent enough to sound right, unreliable enough to be wrong, and confident enough to never tell you which one you are getting.

The Bill

The client asked me what he owed. I told him the investigation was cheap. The lesson was expensive.

The orchestration tax alone will get you. Coordinating multiple models, routing tasks, checking outputs, verifying quotes. For anything under a thousand words of source material, the coordination costs more than just having the expensive model do the whole job. The savings from the assembly line are modest. The risk from the assembly line is not.

But that is just the invoice. The real cost is the one you pay when you publish a quote someone never said, attributed to a person who can Google their own name. The real cost is the one you pay when a listener trusts your show because you get the details right, and then you get a detail wrong because you trusted a machine to care about accuracy the way a person cares about accuracy.

The models do not care. They are not built to care. They are built to produce plausible text, and a fabricated quote is more plausible than a real one, because real people hesitate, stumble, contradict themselves, and say things that do not fit neatly into a narrative arc. The fabrication is always smoother. That is how you spot it. If the quote is too perfect, it is not real.

Closing the File

I closed the folder. Poured something I probably should not have. Looked out the window at the rain that was not falling because this is a podcast and rain does not have a sound preset.

The client went home with a simple rule. Processing offloads. Judgment does not. The cheap models do the research. The expensive model does the writing. And even the expensive model gets its quotes checked against the source material, because in this town, everybody lies. Some of them just lie with better grammar.

The case was closed. But the principle underneath it is still open, because it applies to more than podcasts and more than AI. Every time you delegate, you are making a bet. You are betting that the person, or the machine, or the process you hand the work to can tell the difference between what is true and what sounds true. And most of the time, for most of the work, that bet pays off. The research gets summarized. The facts get extracted. The notes get organized.

But somewhere in every operation, there is a task where the difference between true and plausible is the whole game. Where getting it almost right is worse than getting it wrong, because almost right is the one nobody catches. That is where you put your best. That is where you do the work yourself. That is where judgment lives.

And judgment does not offload.