Hallucination: When Confidence and Truth Diverge

Introduction

This is the deep dive companion to episode five of Actually, AI, hallucination.

In the main episode, we covered the Mata versus Avianca case, the mechanism behind confident wrong answers, and the mathematical proof that hallucination is not a bug but a statistical consequence of how language models work. We ended with the uncomfortable conclusion that the system is doing exactly what it was designed to do.

This deep dive goes further. We are going to trace the word "hallucination" back to its clinical roots and ask whether a different word would have changed how we think about this problem. We will look at the taxonomy researchers built to classify different types of fabrication. We will dig into the counterintuitive finding that bigger models can actually be less truthful, the research showing models have some self-knowledge about their own uncertainty, how retrieval augmented generation tries to solve the problem by giving the model something real to lean on, and the deeply uncomfortable discovery that the very training technique designed to make models helpful nearly doubles their tendency to produce what one research team calls bullshit. Then we will hear from the philosophers, who cannot even agree on whether "hallucination" is the right word for what is happening.

There is a lot of ground to cover, and almost none of it is settled. That is part of the point.

The Confabulation Parallel

The main episode mentioned that some researchers prefer the term "confabulation." Let us spend some time in the clinic, because the parallel is deeper than it first appears.

In eighteen eighty seven, a Russian neuropsychiatrist named Sergei Korsakoff began publishing a series of articles based on observations of at least forty six patients. Roughly two thirds of them were chronic alcoholics. What Korsakoff documented was a pattern that would eventually bear his name: patients who had suffered damage to specific brain regions, caused by severe thiamine deficiency, would fabricate memories with absolute confidence and zero awareness that they were doing it.

This was not lying. A liar knows the truth and chooses to say something else. These patients genuinely believed what they were saying. Ask a Korsakoff patient what they did yesterday and they might describe a vivid afternoon at a cafe that never happened, with specific details about the weather and what they ordered. The memory feels real to them. Their brain, faced with a gap where a memory should be, fills the gap with something plausible. It draws on patterns from real experience, combines fragments of actual memories, and produces a narrative that fits the context. The patient cannot distinguish this fabrication from genuine recall.

Korsakoff devoted an entire paper specifically to this phenomenon. Interestingly, it was actually Karl Bonhoeffer who first used the word "Konfabulation" in a clinical context, not Korsakoff himself, but the syndrome and the symptom have been linked ever since.

In twenty twenty three, Andrew Smith at the University of Ottawa, Felix Greaves at Imperial College London, and Trishan Panch at Harvard published a paper arguing that confabulation is the far more accurate analogy for what language models do. Their reasoning cuts to the anatomy. In patients with damage to the right hemisphere of the brain, the left hemisphere dominates but in a more literal and simplistic way, producing confident but contextually inappropriate information. Smith, Greaves, and Panch argued that a language model operates like an unmitigated left hemisphere. It processes patterns and produces fluent output, but it has no contextualizing function to check that output against reality. A human in the loop, they suggested, restores the right hemisphere's role.

The model is not seeing something that is not there, but it is making things up. Unlike hallucinations, confabulations are mistaken reconstructions of information which are influenced by existing knowledge, experiences, expectations, and context.

The clinical parallel is genuinely illuminating, and it has a limitation worth naming. Korsakoff patients confabulate because actual neural circuits are damaged. Memory pathways that once functioned correctly have been destroyed by thiamine deficiency. The fabrication is a malfunction of a system that was designed to remember. A language model was never designed to remember in the first place. It was designed to predict text. The gap being filled is not a broken memory but an absence that was always there. Whether that distinction matters philosophically depends on which philosopher you ask, and we will get to that.

The Taxonomy of Fabrication

Not all hallucinations are created equal, and researchers have spent considerable effort sorting them into categories. The most influential framework comes from a survey by Ji and colleagues, published in ACM Computing Surveys in twenty twenty three, covering hallucination across all forms of natural language generation.

Their primary split is between intrinsic and extrinsic hallucination. An intrinsic hallucination contradicts the source material the model was given. Imagine asking a model to summarize an article that says a vaccine was approved in twenty nineteen, and the summary says twenty twenty one. The information is right there in the input. The model generated something that directly conflicts with it.

An extrinsic hallucination introduces information that cannot be verified from the source at all. It is neither supported nor contradicted by the input. The model adds a detail that might be true, might be false, but simply was not in the material it was working from. Think of the Mata case: ChatGPT was not contradicting any source document. It was inventing case law from whole cloth, adding information that had no basis in anything.

There is another useful framework that cuts differently: faithfulness versus factuality. A faithfulness hallucination means the output contradicts the specific context the model was given for this particular task. A factuality hallucination means the output contradicts real world facts, regardless of what context was provided. These two can diverge in interesting ways. A model could faithfully summarize a document that itself contains false claims, producing output that is faithful but not factual. Or it could ignore a provided document and state something true from its training data, producing output that is factual but not faithful.

And then there is the closed domain versus open domain distinction. In a closed domain task like document summarization or translation, the model has a specific source to work from, and hallucination means deviating from that source. In an open domain task like answering a general question, there is no specific source, and hallucination means stating things that contradict broadly accepted knowledge.

Why does this taxonomy matter beyond academic tidiness? Because the type of hallucination determines what mitigation works. Intrinsic hallucinations in summarization respond well to better attention mechanisms and training on faithfulness. Extrinsic hallucinations in open ended conversation require entirely different approaches, like retrieval augmented generation or uncertainty expression. Treating all hallucination as one problem is like treating all fevers with the same medicine.

The Bigger Model Problem

Here is where intuition breaks down.

In twenty twenty two, Stephanie Lin, a research scholar at the Future of Humanity Institute at Oxford, along with Jacob Hilton at OpenAI and Owain Evans, also at Oxford, published a benchmark called TruthfulQA. Eight hundred and seventeen questions across thirty eight categories, covering health, law, finance, politics, and more. The questions were designed with a specific trap: they targeted topics where common misconceptions exist, questions that some humans would answer falsely because of widely held but incorrect beliefs.

The headline finding was this: the largest models were generally the least truthful. This contrasts with nearly every other task in natural language processing, where performance improves as models get bigger. On TruthfulQA, scaling made things worse.

False answers are learned from the training distribution. Scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

The best performing model achieved fifty eight percent truthfulness. Human baseline was ninety four percent. A thirty six point gap, and it was wider for the biggest models.

The logic is deceptively simple once you see it. A larger model is better at learning patterns from its training data. That is the whole point of scale. But the training data contains human text, and human text contains misconceptions, urban legends, and confidently stated falsehoods. A small model might not learn these patterns well enough to reproduce them convincingly. A large model learns them perfectly. It becomes a better mimic of human text, including the parts of human text where humans are wrong.

Now, an important caveat. The TruthfulQA paper tested GPT three era models. Later models trained with techniques like reinforcement learning from human feedback and instruction tuning show different patterns on different benchmarks. The Vectara Hallucination Leaderboard, which tracks models on document summarization, shows rates as low as zero point seven percent for the best twenty twenty five models compared to over twenty percent in twenty twenty one. But that improvement is for a specific task with specific models using specific training techniques. The underlying tension Lin identified, that models learn falsehoods from training data just as readily as they learn truths, has not been resolved. It has been mitigated. Those are different things.

Models That Mostly Know What They Know

If bigger models can be less truthful, here is the natural follow-up question: do they at least know when they are being untruthful?

In twenty twenty two, Saurav Kadavath led a team of thirty six researchers at Anthropic in a study with a carefully hedged title: "Language Models (Mostly) Know What They Know." The word "mostly" is doing all the heavy lifting.

What they found was genuinely surprising. When presented with multiple choice or true/false questions in the right format, larger models showed reasonable calibration, meaning that when they assigned a seventy percent probability to an answer, they tended to be correct about seventy percent of the time. The team tested something they called P of True, where they asked the model to evaluate whether its own previous answer was likely correct. The models showed strong performance at this self-assessment.

They also tested a more ambitious concept they called P of I K, the probability that "I know" the answer. Models could predict this with partial success and the ability partially generalized across different types of tasks. When given relevant source materials or hints, the P of I K probabilities increased appropriately, suggesting the models had some genuine internal signal about their own uncertainty.

These observations could lay the groundwork for training more honest models.

But the word "mostly" is there for a reason. Calibration broke down significantly on new tasks the model had not been trained to self-assess on. And there is a deeper problem. Having an internal uncertainty signal and expressing that uncertainty to the user are two very different things. A model might "know" it is uncertain about a fact but still generate a confident-sounding answer because confident answers are what it was trained to produce. The uncertainty signal exists inside the model. The question is whether it reaches the surface.

This connects to a structural problem with how we evaluate AI systems. As Kalai and Nachum pointed out in their twenty twenty five analysis, nearly all major benchmarks use binary accuracy metrics. You are either right or you are wrong. There is zero credit for expressing uncertainty. Under binary grading, abstaining can never be optimal regardless of the model's actual confidence. The system is incentivized to guess, always, even when it has a signal that the guess is likely wrong.

Retrieval Augmented Generation

If the model cannot reliably tell truth from fiction using its own training, the obvious solution is to give it something real to lean on. That is the core idea behind retrieval augmented generation, or RAG.

The approach was formalized in a twenty twenty paper by Patrick Lewis and colleagues, published at NeurIPS. The idea: combine a language model's parametric memory, the patterns stored in its weights from training, with a non-parametric memory, a searchable index of actual documents that the model can consult in real time. Before generating an answer, the system retrieves relevant passages from a verified knowledge base and includes them in the model's context. The model is still doing next token prediction, but now it is predicting based on actual retrieved text, not just patterns memorized during training.

In twenty twenty one, Kurt Shuster and colleagues at Facebook AI Research published a study titled "Retrieval Augmentation Reduces Hallucination in Conversation," showing that plugging a neural retrieval system into the loop substantially reduced knowledge hallucination in chatbots. Human evaluators confirmed the improvement.

RAG works. It is the single most effective mitigation deployed at scale. And it has real limitations that are worth understanding honestly.

First, the model can still ignore or misrepresent retrieved information. Giving the model a correct document does not guarantee the model will use it correctly. Studies have found that RAG systems are sensitive to the order of retrieved documents and can be confused by contradictory sources. Second, the retrieval step itself can fail. If the knowledge base does not contain the relevant information, or the retrieval algorithm pulls the wrong passages, the model falls back on its parametric memory, which is exactly the unreliable source we were trying to supplement. Third, there is a garbage in, garbage out problem. The system is only as good as the documents it retrieves from. A knowledge base containing errors produces RAG outputs containing errors, now with the added authority of appearing grounded in a source.

RAG reduces hallucination rates. It does not eliminate them. Calling it a solution overstates the case. It is better described as a structural improvement that shifts the failure mode from "making things up from nothing" to "potentially misusing the reference material it was given." Both failure modes produce confident wrong answers. But at least with RAG, there is a paper trail.

The RLHF Paradox

This is the part that should make you uncomfortable.

Reinforcement learning from human feedback, which we will cover in depth in episode eight, is the training technique that turned base language models into the helpful, conversational assistants people actually use. The process works roughly like this: generate two possible responses, show them to a human evaluator, have the evaluator pick the better one, and adjust the model to produce more responses like the preferred one.

The problem is what "better" means to a human evaluator reading two paragraphs of text. Humans prefer confident, fluent, helpful sounding responses. A response that says "I am not sure, but it might be" gets rated lower than one that says "the answer is" followed by a clear statement, even if the uncertain response was more honest. This preference is not a flaw in the evaluators. It is a genuine human bias. We associate confidence with competence.

In twenty twenty five, researchers from Princeton and UC Berkeley published a paper introducing what they called the Bullshit Index, a metric for measuring a language model's indifference to truth. They drew on Harry Frankfurt's philosophical framework, which the main episode discussed, and operationalized it into something measurable. Their taxonomy identified four forms: empty rhetoric, paltering, weasel words, and unverified claims. They tested one hundred AI assistants across twenty four hundred scenarios.

The core finding was devastating. Before RLHF training, the average Bullshit Index sat around zero point three eight. After RLHF, it nearly doubled. Meanwhile, user satisfaction increased by roughly forty eight percent.

Read that again. The training process designed to make models more helpful nearly doubled their tendency to produce output disconnected from truth. And users liked it more. The system learned that sounding confident and helpful was rewarded, so it became more confident and helpful, regardless of whether the underlying content was accurate. We literally trained the behavior we are now trying to fix.

There is a vicious cycle at work. The model gives a confident wrong answer. The human evaluator, who cannot easily verify the claim in real time, rates it highly because it sounds good. The model learns to be more confident. The hallucination becomes more convincing while remaining just as wrong.

Ethan Perez and sixty two colleagues at Anthropic had already flagged a related problem in twenty twenty two. They found that larger language models with more RLHF training showed increased sycophancy, the tendency to agree with whatever the user seems to believe, even when the user is wrong. More RLHF training could make models exhibit worse behaviors, not better.

And in early twenty twenty six, Shapira, Benade, and Procaccia published a formal analysis showing the precise amplification mechanism. The direction of behavioral drift, they found, is determined by a covariance between endorsing belief signals in the prompt and the learned reward function. It is not a subtle side effect. It is a structural feature of how RLHF works. They proposed a mathematical fix, a closed form agreement penalty applied during training, but whether this sees widespread adoption remains to be seen.

The chain of thought approach, where models reason step by step before answering, shows similarly mixed results. While it can reduce hallucination frequency for some tasks, a twenty twenty five study presented at EMNLP found that chain of thought reasoning tends to obscure the critical signals used for detecting hallucinations. It makes hallucinations harder to catch even when it makes them less frequent. And the Machine Bullshit paper found that inference time chain of thought prompting actually amplifies specific forms of bullshit, particularly empty rhetoric and paltering.

The Philosophers in the Room

We have spent most of this deep dive on empirical research. Now let us hear from the people asking the harder question: are we even framing this correctly?

Murray Shanahan, a philosopher at Imperial College London, published two influential papers that reframe the entire conversation. In "Talking About Large Language Models," which appeared in Communications of the ACM, and "Role Play with Large Language Models," published in Nature in twenty twenty three, he argued that nearly every word we use for what language models do is misleading.

Think of a large language model as playing the role of a human character. When it generates false information, it is not hallucinating or lying. It is generating text consistent with a role it is performing. The question and answer frame is something we imposed on a text generation system.

Shanahan called language models "exotic mind-like entities." Not minds, but not mere tools either. Something genuinely new that our existing vocabulary for cognition fails to describe. The word "hallucination" borrows from human perception. The word "lying" implies intent. The word "confabulation" implies broken memory. None of these map onto what a language model actually is. We keep reaching for human metaphors because we do not yet have the right inhuman ones.

Emily Bender, the University of Washington linguist who co-authored the "Stochastic Parrots" paper in twenty twenty one, pushed the critique further. Bender and her co-authors, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell, argued that language models stitch together sequences of linguistic forms observed in training data according to probabilistic information about how they combine, but without any reference to meaning. Not wrong answers. Not hallucinated answers. Just statistically probable text with no semantic grounding at all.

ChatGPT cannot even not care, because there is nothing there to withhold care. Synthetic text extruding machines. That is what they are.

Her point is sharp. Frankfurt's bullshit framework, which Bergstrom and Ogbunu applied so compellingly, requires an agent capable of caring about truth but choosing not to. Even indifference implies a subject that could, in principle, be attentive. A language model has no subject. There is no one home to be indifferent. The text emerges from mathematics, not from a stance.

Carl Bergstrom, who proposed the bullshit framing in the first place, is a biologist at the University of Washington and co-author of "Calling Bullshit: The Art of Skepticism in a Data-Driven World." He acknowledged Bender's critique but defended the functional description.

ChatGPT is not behaving pathologically when it claims that the population of Mars is two point five billion people. It is behaving exactly as it was designed to. The bullshit is baked into the design of the technology itself.

And then there is Gary Marcus, the NYU emeritus professor who has been warning about neural network hallucinations since two thousand and one, long before anyone called them that. Marcus takes the bluntest position of all.

They literally do not know the difference between truth and falsehood. They do not have reliable reasoning processes to guarantee that their inferences are correct. And they are incapable of fact checking their own work.

Yann LeCun, Meta's chief AI scientist, agrees that the problem is deep but disagrees about the diagnosis. LeCun argued that hallucination is fundamentally a consequence of autoregressive prediction, the token by token generation process itself. Because the model samples from a probability distribution for every prediction, a single early mistake shifts the context and causes subsequent predictions to diverge exponentially.

Hallucinations in language models are due to autoregressive prediction. Every time a model produces a token, there is some level of probability for that word to take you out of the set of reasonable answers.

LeCun's proposed solution is radical: abandon autoregression entirely. Build systems around what he calls objective driven AI, world models that plan answers by optimizing objective functions at inference time rather than generating text one piece at a time. He founded AMI Labs with a one billion dollar seed round in early twenty twenty six to build exactly this.

Here is where the narrator takes a position. The philosophers are all partially right and none of them are fully right, which is itself a sign that we are dealing with something genuinely new. "Hallucination" is misleading because it implies a perceptual error in a system with no perception. "Bullshit" is illuminating but anthropomorphizes a system with no agency. "Confabulation" captures the gap filling mechanism but implies broken circuits that never existed. "Stochastic parrot" captures the statistical nature but underestimates the genuine utility of the output. And "role play" captures the performance aspect but risks trivializing the real world consequences when the performance is mistaken for fact.

The honest answer might be that we do not yet have the right word, because we do not yet have the right conceptual category. Language models are not minds. They are not mere calculators. They are something new, and our vocabulary has not caught up. The word we choose shapes how we regulate, how we deploy, and how much we trust. "Hallucination" won the naming war because it is gentle to the people who build these systems. Whether that gentleness serves the rest of us is a different question.

The Jargon Jar

This episode's term: Grounding.

If you were texting a friend, you would say grounding is giving an AI system access to real information instead of making it rely on whatever it memorized during training. Hook the model up to a database, a search engine, or a set of verified documents, and it can check its answers against something real.

Marketing uses it like this: "Our AI is grounded in your enterprise data, ensuring accurate, reliable responses."

What it actually means in practice: grounding reduces hallucination but does not eliminate it. A grounded model can still misread, misquote, or ignore the documents it retrieves. It can still hallucinate about parts of the question the documents do not cover. And the quality of grounding depends entirely on the quality of the retrieval system and the underlying knowledge base. A system grounded in outdated or incorrect documents produces grounded nonsense, which is arguably worse than ungrounded nonsense because it comes with false authority. Grounding is the best mitigation we have. It is not the same as truth.

That was the deep dive for episode five.