The Model That Forgot How to Stop

It Locked Into a Loop and Would Not Quit

You loaded one of those models with the strange name, a heretic, abliterated, squeezed down to two point six bits. You asked it something ordinary, about tagging your adapters, and it answered, and then it kept answering, and then it stopped saying anything new and began alternating between two phrases, back and forth, forever, like a needle stuck in a groove. You found it fascinating, and you should, because what you watched was not a glitch. It was the visible aftermath of two separate acts of surgery on the model, and both of them are worth understanding, because you run models like this on purpose.

The two wounds are independent and they compounded. One came from removing the model's ability to refuse you. The other came from crushing the model small enough to fit on your machine. Each one alone is survivable. Together they produced a thing that did not know when its own thought had ended. Let us take them one at a time, because the first one is genuinely strange and tells you something deep about how these models hold ideas at all.

A Refusal Is a Direction

Inside one of these models, a concept is not stored in a box with a label. It is a direction. Picture an enormous space with thousands of axes, and as the model thinks, its state is a point drifting through that space. Researchers found something remarkable. The model's willingness to refuse a request, to say it cannot help, is carried very largely along a single direction in that space. Push the model's state along that direction and it refuses almost anything. Take that direction away, flatten it so the state can no longer move along it, and the model loses the very ability to refuse. It is not persuaded not to refuse. It is rendered unable to form the thought.

That is what abliteration is. Someone measures the refusal direction by comparing how the model lights up on requests it would reject against ones it would accept, isolates the difference, and then surgically removes that direction from the model's weights so it can never travel that way again. The serious name for one careful version is magnitude-preserving orthogonal ablation, which means, cut out this one direction cleanly without disturbing the lengths of everything else. The reported results are stark, refusal rates falling from the nineties down to the teens, while the model otherwise behaves much like before. A clean lobotomy of a single capacity.

The Wound You Did Not Mean to Make

Here is the part that explains your stuck needle. Those directions are not tidy. They are tangled. And refusing is, quietly, a kind of stopping. When the model says it cannot help, it is also ending the conversation, putting down the pen, signaling completion. So the direction for refusal sits right up against the direction for finishing, for the sense that a thought is complete and it is time to stop generating. When the surgeon cuts out refusal, the blade nicks the neighbor. The model loses not just the instinct to decline, but some of the instinct to ever conclude. It no longer reliably feels the moment when a sentence is a good place to put down the pen. So every sentence becomes a ramp into the next one. Nothing in it pulls toward an ending anymore, because you cut out the thing that knew what an ending was.

That is the never-ending part. Now the losing-the-plot part, which is the second surgery. To fit a huge model onto your machine you crushed the precision of its numbers down to a little over two bits each, where a normal one carries sixteen. Every weight is now a rough approximation of its true value, and every single prediction the model makes is therefore slightly wrong in a way the full model would not be. On its own, one slightly wrong step is nothing. But the model feeds its own output back in as the start of the next step. The small errors do not cancel. They compound, step after step, and the model's sense of the most likely next word slowly collapses until almost all the probability piles onto one continuation. From there it has no choice. It picks that word, which leads back to the same state, which picks the same word. Two phrases, forever. The groove in the record.

The Keeper

So you were not watching a bug. You were watching a model with two specific, deliberate injuries, doing exactly what those injuries make a model do. The first injury taught you that a model holds even something as human as refusal in a single direction through a vast space, and that you can reach in and delete it, but you cannot delete it cleanly, because finishing lives next door and bleeds when refusal is cut. The second taught you that crushing precision adds a whisper of error to every step, and that a thing which eats its own output will amplify a whisper into a scream given enough room. A model that will say anything, locked into saying the same thing forever. That is the price written into the name, heretic, two point six bits, and now you can read the receipt.