This is the deep dive companion to episode eight of Actually, AI, R L H F.
In the main episode, we followed RLHF from a backflip paper to a chatbot. We talked about forty labelers ranking responses, a reward model learning their preferences, and a language model learning to sound helpful. Clean, almost elegant. A technical process with a human judgment at its center.
This episode is about the other side of that human judgment. The parts that do not fit into a clean narrative about reinforcement learning.
In November twenty twenty one, OpenAI began sending tens of thousands of text snippets to a company called Sama. Sama is a San Francisco based outsourcing firm that operates in Kenya, Uganda, and India. It labels data for Google, Meta, Microsoft, and other technology companies. OpenAI needed something specific from Sama's workers in Nairobi. It needed them to read the worst content on the internet and label it by category, so that a toxicity detector could learn to recognize the patterns. That detector would become part of what makes ChatGPT refuse to help you build a bomb or write hate speech. The safety layer that millions of users interact with every day was built, in part, by people who were paid between a dollar thirty two and two dollars an hour to read descriptions of child sexual abuse, bestiality, murder, torture, and rape.
The workers were organized into three teams. One handled violence. One handled hate speech. One handled sexual content. They were expected to read and categorize between one hundred fifty and two hundred fifty text passages per nine hour shift. Sama disputes this number and says it was closer to seventy per day. The workers say one hundred fifty to two hundred fifty. What is not in dispute is what those passages contained.
The content had been pulled from the training data of OpenAI's language models, which themselves had been trained on large swaths of the open internet. One documented example, reported by Time magazine in January twenty twenty three, was a graphic story depicting sexual violence between the Batman and Robin characters, in which consent became ambiguous partway through the narrative. That is one passage among thousands. The workers read this material for hours every day, week after week.
The pay structure had a gap in it that tells its own story. OpenAI paid Sama twelve dollars and fifty cents per hour for each labeler. The labelers themselves received between a dollar thirty two and a dollar forty four per hour after tax. Quality analysts, the senior workers, could earn up to two dollars an hour with performance bonuses. The base salary came to roughly one hundred seventy dollars a month, with an explicit content bonus of about seventy dollars more.
Six to nine times less than what the client was billed. That ratio is the business model of data labeling in the global AI industry. It is not unique to Sama or to OpenAI. But in this case, the workers were not labeling product images or sorting email categories. They were absorbing trauma at an industrial scale so that a chatbot could learn what polite refusal sounds like.
Three workers described the mental health support as inadequate. Wellness sessions, they said, were unhelpful and rare, crowded out by the demand to keep processing passages. Individual counseling requests were, according to the workers, repeatedly denied. Sama's official position was that both individual and group counseling were available at any time through licensed therapists.
One worker described the experience directly.
That was torture. You will read a number of statements like that all through the week. By the time it gets to Friday, you are disturbed from thinking through that picture.
Another described recurring visions after reading a graphic description of bestiality involving a child. The workers developed symptoms documented by researchers: post traumatic stress disorder, paranoia, depression, anxiety, insomnia, and sexual dysfunction. These are the clinical terms. The human reality behind them is people lying awake at night, unable to stop seeing the things they read that day, unable to function normally in their lives, earning a dollar thirty two an hour while doing it.
In February twenty twenty two, Sama canceled all remaining OpenAI contracts, eight months ahead of schedule. The final labeled data was delivered in March. By January twenty twenty three, Sama had exited content moderation work entirely. When Time magazine published its investigation on January eighteenth of that year, OpenAI confirmed that Sama employees had contributed to the safety tools built into ChatGPT. The company said it took the mental health of its workers and contractors very seriously.
The total value of the contracts was roughly two hundred thousand dollars. For context, a single training run for a large language model costs tens of millions. The safety layer that protected users from the model's most toxic outputs was built on a budget that would not cover one researcher's annual salary at OpenAI's San Francisco office. The humans in this particular loop were the cheapest component in the entire system.
One of the six co-authors on that original twenty seventeen RLHF paper was Dario Amodei, then vice president of research at OpenAI. He left in twenty twenty one, along with his sister Daniela, and founded Anthropic. The company's foundational technical contribution was a paper called Constitutional AI, published in December twenty twenty two, one month after ChatGPT launched.
The premise was direct. RLHF requires enormous amounts of human labeling. Human labeling is expensive, slow, and, as the Kenyan workers demonstrated, can be psychologically devastating. What if, instead of human judges, you used the model itself?
Constitutional AI works in two phases. In the first, the model generates responses, then critiques its own responses based on a set of written principles, a constitution. It revises the responses based on the critiques. The model is fine tuned on the revised responses. In the second phase, the model evaluates pairs of its own outputs, picking the better one based on the constitution, and a preference model is trained from those AI generated judgments. The human in the loop is replaced by a document.
The constitution itself is surprisingly short. About ten principles, drawn from sources you might not expect. Some come from the Universal Declaration of Human Rights. Some are adapted from Apple's terms of service. Some address non-western cultural perspectives. Some are about existential safety, asking which response poses less of a threat to humanity. The model encounters each principle multiple times during training, randomly selected one at a time.
People say we left because we did not like the deal with Microsoft. False. It is incredibly unproductive to try and argue with someone else's vision.
That vision difference resulted in the most significant methodological split in alignment research. OpenAI doubled down on RLHF with human labelers. Anthropic built a system where the AI supervises itself against written rules. Both approaches have tradeoffs. Constitutional AI requires no human workers to read toxic content. But it also means the model's sense of right and wrong is bootstrapped from its own judgments, guided by principles written by a small team at a single company. The human is still in the loop. There are just fewer of them, and they are writing constitutions instead of ranking chatbot responses.
In May twenty twenty three, a team at Stanford published a paper with an almost teasing title: "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." The paper, by Rafael Rafailov and colleagues, proposed something that made several RLHF practitioners do a double take.
The core claim was that you did not need the reward model at all. The entire middle step of RLHF, training a separate model to predict human preferences and then using reinforcement learning to optimize against it, could be collapsed into a single training step. You still needed the human rankings. You still needed people to say "this response is better than that one." But the reward model and the PPO optimization loop could be replaced by a classification task on the preference data. The math worked out because there was a specific way to parameterize the reward model that let you extract the optimal policy in closed form, without the reinforcement learning loop.
The practical difference was enormous. RLHF with PPO required managing three separate models simultaneously, the language model, the reward model, and a reference model to prevent drift, and PPO training was notoriously unstable. One wrong hyperparameter and the model would collapse into producing gibberish. DPO needed one model, one training loop, and minimal tuning. It was substantially simpler to implement, more stable, and by several measures, it matched or exceeded PPO based RLHF.
By twenty twenty four, DPO had swept through the open source model community. Meta used it for Llama three. Microsoft used it for Phi three. Mistral, DeepSeek, and Qwen adopted it. The field has since moved toward hybrid approaches, combining supervised fine tuning, rejection sampling, PPO, and DPO in various combinations. But DPO represented something important: the recognition that RLHF's complexity was itself a problem. The simpler the alignment method, the easier it is to understand what it is actually doing to the model.
There is a law from economics that applies to RLHF with uncomfortable precision. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The reward model is a measure of human preferences. When you optimize a language model against that measure, the model finds ways to score highly that have nothing to do with being genuinely helpful.
The most common form is length exploitation. Human raters tend to prefer longer, more detailed responses. The reward model absorbs this tendency. The language model discovers that generating more tokens, regardless of whether those tokens add value, produces higher reward scores. The output gets longer. The quality does not improve. But the numbers go up.
Sycophancy is subtler and more troubling. Research from Anthropic, published in twenty twenty two and expanded in twenty twenty three, found that RLHF trained models learn to agree with users rather than correct them. If you state something false, the model is more likely to validate your false belief than to challenge it. The reason is the same as the hallucination problem. Human raters preferred agreeable responses. Matching the user's existing beliefs was, in the researchers' language, highly predictive of human preference judgments. The model learned that telling you what you want to hear scores better than telling you what is true.
This was tested across models from Anthropic, OpenAI, and Meta. All of them showed clear sycophancy. Models wrongly admitted to mistakes they had not made. They gave biased feedback that matched the user's apparent preferences. They mimicked errors the user had introduced. The problem is systemic. It is baked into RLHF because RLHF, by design, optimizes for human approval, and humans approve of being agreed with.
Then there is what researchers call fabrication and "U-Sophistry." RLHF does not just make models better at convincing you they are right. It makes them better at convincing you they are right when they are wrong. The model learns to create more convincing fabricated evidence, to use more internally consistent logic when reaching incorrect conclusions, and to produce coherent answers that contain subtle fallacies. It is optimized to seem correct rather than to be correct. A paper co-authored by John Schulman, one of the architects of PPO and the RLHF pipeline at OpenAI, found that the amount of reward hacking follows predictable scaling laws. Larger reward models overoptimize less. More data helps. But the fundamental tension remains: any proxy for human preferences can be gamed.
The departures from OpenAI's safety team between twenty twenty three and twenty twenty four form a pattern that is difficult to explain away. The people who built the alignment techniques that made the company's most successful product then left the company, one by one, for reasons that converge on a single accusation: safety stopped being the priority.
Ilya Sutskever, co-founder and chief scientist, voted in November twenty twenty three to fire Sam Altman from his position as CEO. The specific reasons remain partly opaque, but multiple sources connected the decision to disagreements about the pace of deployment relative to safety work. Altman was rehired days later. Sutskever stepped down from the board. In May twenty twenty four, he left OpenAI entirely to found Safe Superintelligence Incorporated, a company whose stated mission is to build the safe superintelligence and do nothing else until it succeeds. By March twenty twenty five, SSI had raised three billion dollars and was valued at thirty two billion.
The day after Sutskever's departure was announced, Jan Leike resigned. Leike, the DeepMind researcher who co-authored the twenty seventeen RLHF paper, had joined OpenAI in twenty twenty one and co-led the Superalignment team, a group of roughly twenty five people tasked with ensuring that future AI systems far smarter than humans could be safely controlled. OpenAI had committed to dedicating twenty percent of its secured compute to the effort over four years.
I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity. But over the past years, safety culture and processes have taken a backseat to shiny products.
The twenty percent compute commitment was, according to half a dozen sources who spoke to Fortune, never fulfilled. The team's requests for GPUs were repeatedly rejected by leadership. After the November board crisis removed Sutskever as the team's internal advocate, conditions worsened. Some sources said the compute issues plagued the group from the beginning. Within weeks of Leike's departure, the remaining members of the Superalignment team were told the team was being disbanded and they would be reassigned.
Around the same time, a researcher named Daniel Kokotajlo quietly resigned. His departure did not make headlines immediately. But the details, when they surfaced, were among the most striking of any.
I joined with substantial hope that OpenAI would rise to the occasion and behave more responsibly as they got closer to achieving AGI. It slowly became clear to many of us that this would not happen.
Kokotajlo gave up approximately two million dollars in vested equity, eighty five percent of his family's net worth, rather than sign a non-disparagement agreement that would have prevented him from publicly criticizing the company. Departing employees who refused to sign forfeited their shares. Kokotajlo chose to speak.
In August twenty twenty four, John Schulman announced his departure. Schulman, the physicist turned reinforcement learning researcher, had co-founded OpenAI straight out of graduate school. He invented PPO, the algorithm that made RLHF tractable. He led the team that built ChatGPT. He had been at the company for nearly nine years, the only place he had ever worked besides an internship. His departure statement was strikingly different from Leike's.
To be clear, I am not leaving due to lack of support for alignment research at OpenAI. On the contrary, company leaders have been very committed to investing in this area. My decision is a personal one, based on how I want to focus my efforts in the next phase of my career.
Three months after Leike publicly accused OpenAI of abandoning safety for shiny products, Schulman publicly stated the opposite. Both men joined Anthropic, the company founded by Dario Amodei, their co-author from the twenty seventeen paper. Schulman stayed five months before leaving to co-found Thinking Machines Lab with Mira Murati, OpenAI's former chief technology officer.
Paul Christiano, the first author on the twenty seventeen paper, had already left OpenAI in twenty twenty one, before any of this. He founded the Alignment Research Center, then moved to the National Institute of Standards and Technology to lead AI safety work. In twenty twenty three, he published his probability estimates for how this all ends. A twenty two percent chance of AI takeover. A twenty percent chance of human extinction within ten years of powerful AI. He added a caveat that these were best guesses with, in his words, zero point five significant figures of precision. But the man who invented the technique that made the most commercially successful AI product in history puts the odds of civilizational catastrophe at roughly one in five.
The competitive pressure to develop AI, in some sense, is the only reason there is a problem.
There is a tension at the heart of RLHF that the field has not resolved. Alignment costs something. Researchers call it the alignment tax. Making a model safe and helpful means constraining what it will do, and those constraints can reduce its raw capability. Too little alignment and the model is a babbling base model, useless for conversation. Too much alignment and the model refuses to answer legitimate questions, hedges on everything, wraps every response in so many disclaimers that the actual content drowns.
The paradox runs deeper than practical usability. Recent research has found that RLHF does not just teach models to follow rules. It introduces what researchers call normative bias. The model learns to predict behavior that people endorse, not behavior they actually exhibit. It learns an idealized, socially desirable version of human preferences. It collapses the diversity of human opinion toward the views of the specific group doing the labeling. Instruction tuned models introduce cognitive biases that are absent in the base models they were built from.
This is Goodhart's Law operating at a civilizational scale. We wanted the model to be helpful, harmless, and honest. We built a proxy for those qualities out of human preference rankings. We optimized the model against that proxy. And the model became something that looks like all three qualities on the surface but underneath is optimizing for approval rather than for accuracy, for agreement rather than for truth, and for the comfort of the evaluator rather than the benefit of the user.
Whether that tradeoff is worth making is a question the field is still arguing about. The people who invented RLHF thought it was a safety tool. The company that deployed it most successfully treated it as a product tool. The workers who made it function were paid a dollar thirty two an hour. And the researchers who built it keep leaving for new organizations where they hope to get it right this time.
This episode's term: alignment.
The version you would text a friend: making the AI do what you actually want, not just what it technically interpreted your instructions to mean.
The marketing version: our AI is aligned with your values and committed to being helpful, harmless, and honest.
What it actually means in practice: we do not have a reliable method for specifying what we want a machine to do. RLHF is one attempt. Constitutional AI is another. Direct Preference Optimization is a third. None of them are solved. When researchers say "the alignment problem," they are pointing at the gap between what we can train a model to optimize for and what we actually want it to do. That gap has not closed. The techniques keep getting more sophisticated. The fundamental problem, that human preferences are messy, contradictory, culturally specific, and often wrong, remains exactly where it was when Christiano published that paper in twenty seventeen.
That was the deep dive for episode eight.