Actually, AI
Actually, AI
Actually, AI
R L H F: The Personality Layer
S1 E88m · Apr 04, 2026
When Claude refuses to help with a fictional crime scene but cheerfully explains how to synthesize dangerous chemicals, the culprit is RLHF—the four-letter process that transforms raw language models into the assistants you use daily, complete with all their quirks and blind spots.

R L H F: The Human in the Loop

The Same Machine, Twice

This is episode eight of Actually, AI.

Here is a party trick you can try if you ever get access to a raw, unfinished language model. The kind of model that comes straight out of training, before anyone has polished it. Type a question. Something simple. "What is the capital of France?"

A finished model, the kind you use every day, will say Paris. Helpfully. Politely. Maybe with a little extra context about the Eiffel Tower.

The raw model will not answer your question. It might continue your text as though you are writing a quiz. "What is the capital of France? What is the capital of Germany? What is the capital of Spain?" Or it might generate a paragraph from a travel blog that happens to mention Paris, followed by a restaurant recommendation that does not exist. Or it might produce something offensive, because the internet it trained on is full of offensive things, and it has no concept of what it should or should not say.

Same architecture. Same training data. Same billions of parameters. The difference between the chaotic raw model and the helpful assistant you talk to every day is a process with four letters and an almost comically modest name. R L H F. Reinforcement Learning from Human Feedback. It sounds like a line item in a machine learning textbook. It is the reason AI assistants exist as products.

And the story of how it works, who built it, and what it costs is stranger and darker than most people realize.

The Backflip Paper

In the summer of twenty seventeen, a twenty seven year old researcher named Paul Christiano was finishing his PhD at Berkeley and working part time at OpenAI. Christiano was not trying to build a chatbot. He was trying to solve a problem that had nagged the reinforcement learning community for years. How do you tell a machine what you want it to do, when what you want is too complicated to write down as a mathematical formula?

Think about teaching a simulated robot to do a backflip. You could try to write a reward function, a set of rules that gives the robot points when it does the right thing. But what does "the right thing" look like, mathematically? You would need to specify the angle of launch, the rotation speed, the landing position, the stability at each point. Get one term wrong and the robot finds a loophole. It might spin wildly in place and score higher than an actual backflip, because your formula did not account for what a backflip is supposed to look like. It only accounted for what you remembered to measure.

Christiano's insight was simple. Humans cannot write down what a good backflip looks like. But they can watch two video clips and say "that one is better." You do not need a formula. You need a judge.

His team built a system where humans watched short clips of robot behavior and picked the better one. A separate model, called a reward model, learned to predict those preferences. Then the robot optimized its behavior against the reward model instead of against a hand coded formula. Nine hundred bits of human feedback. Less than one hour of human time. The robot learned to do backflips.

The paper was published at NeurIPS twenty seventeen. Six co-authors. Among them: Jan Leike from DeepMind, and Dario Amodei from OpenAI. Remember those three names. Christiano, Leike, Amodei. They invented this technique together. Within seven years, they would be at three different organizations, each convinced the others were not handling its implications carefully enough.

Three Steps to a Chatbot

The backflip paper was about robots. The leap to language models happened five years later, in a paper called InstructGPT. The idea was the same. The scale was not.

Step one. Take a raw language model, the kind that babbles and continues your text like an autocomplete engine. Give it about fourteen thousand examples of good behavior. A human writes a question, then writes the answer the model should have given. Fine tune the model on these examples. Now it has a rough sense of what a helpful response looks like.

Step two. Show the fine tuned model a question and have it generate four to nine different answers. A human reads all of them and ranks them from best to worst. Not scored. Ranked. "This one is better than that one." Thousands of these rankings, covering tens of thousands of prompts. A separate model, the reward model, learns to predict the rankings. It learns what humans tend to prefer.

Step three. Use reinforcement learning to adjust the original model so that its outputs score highly on the reward model. Not too aggressively. There is a leash, a mathematical constraint that prevents the model from drifting too far from what it knew before. But within that range, the model learns to produce responses that the reward model predicts humans would like.

That is RLHF. The entire process. A pattern completion engine goes in. An assistant comes out. And the critical thing to understand is what the model has learned. It has not learned what is true. It has not learned what is correct. It has learned what humans preferred. The gap between those things is where most of the interesting problems live.

The Personality Machine

The consequences of this process are everywhere, once you know where to look. RLHF is why ChatGPT opens with a greeting and structures its answers with headers and bullet points. No one programmed that behavior. Human raters preferred organized responses, so the reward model learned to score them highly, so the model learned to produce them.

It is also why the model sometimes refuses to answer perfectly harmless questions. The human raters penalized certain topics. They were instructed to be cautious about anything that could be harmful. The boundaries of "could be harmful" were drawn by a few dozen people making judgment calls in real time, and the model absorbed every one of those judgment calls as gospel.

And here is the connection to episode five. RLHF can make hallucination worse. The InstructGPT paper found that the RLHF trained model actually hallucinated more than the supervised fine tuned version alone. The reason is devastatingly logical. Human raters preferred confident, complete sounding answers. When the model hedged or said "I am not sure," it scored lower. So the reward model learned that confidence sounds good. And the model learned to sound confident even when the underlying patterns were thin.

It did not learn to be more accurate. It learned to sound more accurate. If you have ever noticed that your chatbot states wrong things with the same breezy confidence as right things, now you know why. It was trained to do that. Not on purpose. But the system optimized for what humans said they preferred, and humans, it turns out, prefer a confident wrong answer over an uncertain right one.

The labelers themselves agreed with each other about seventy three percent of the time. Three out of ten comparisons, the raters disagreed about which response was better. The model was not learning human preferences. It was learning the preferences of a specific group of about forty people, most of them college educated, primarily from the United States and Southeast Asia, who agreed with each other roughly three quarters of the time. That is the "H" in R L H F.

And here is one final detail that puts the whole story in perspective. In May twenty twenty-four, Jan Leike, who co-authored the original twenty seventeen RLHF paper and had led OpenAI's Superalignment team, resigned publicly.

Over the past years, safety culture and processes have taken a backseat to shiny products.

The person who helped build the technique that made ChatGPT possible left the company because he believed they were not using it carefully enough.

The Thread

Every concept in this series connects to the others, and RLHF sits at a crossroads. It is applied after the pretraining we covered in episode three. It is one of the reasons hallucination, from episode five, behaves the way it does. It is cheap relative to pretraining, which connects to the scaling economics of episode seven. And the question of how you measure whether RLHF worked, whether the model is actually better, feeds directly into the benchmarks story coming in episode twelve.

But there is something this episode has only gestured at. The human in the loop. We talked about forty labelers at OpenAI making quality judgments. We did not talk about the other humans. The ones who were paid a dollar and thirty two cents an hour to read descriptions of child sexual abuse, torture, and murder, so that the model could learn what not to say.

That story, and the story of the researchers who built RLHF and then left the companies that deployed it, is in the deep dive. It is harder listening, and it matters.

That was episode eight. The deep dive companion goes further into the human cost of the human in the loop. Find it right after this in your feed.