You have downloaded models with absurd labels. One hundred and twenty-two billion parameters, squeezed to two point three four bits each. Not two bits, not three. Two point three four. The first time you see a fractional bit depth it looks like a typo or a joke, because how could you possibly store something in a third of a bit. But that fraction is real, it is deliberate, and someone crafted it by hand for that exact model. Understanding how a fraction of a bit is even possible unlocks the whole reason your machine can run a mind that should never fit inside it, and the reason those same models sometimes go quietly insane.
Begin with what a weight is. A model is a vast pile of numbers, billions of them, each one a setting that shapes how signals flow. In full quality each of those numbers is stored in sixteen bits, which buys you fine gradations, a smooth range of possible values. That precision is expensive. Billions of sixteen-bit numbers is why a serious model wants far more memory than a laptop has. So you do the obvious brutal thing. You round. Instead of letting each weight be any of thousands of fine values, you force it to pick from a tiny menu.
Four bits means a menu of just sixteen possible values. Every weight in the model must round itself to the nearest of those sixteen rungs. Two bits means a menu of only four rungs. The model goes from a smooth dial to a coarse staircase. The file shrinks dramatically, because you no longer store a precise number for each weight, just which rung it landed on. That is quantization. You are not deleting weights. You are blurring them, snapping each one to the nearest allowed step and accepting that it is now slightly off from where it truly wanted to be.
The shock is that this works at all. You would expect rounding billions of numbers to ruin everything. It mostly does not, because these models are deeply redundant. No single weight carries the whole show, the important behavior is smeared across thousands of them, so when each one is nudged a hair off its true value, the errors are small and scattered and the model as a whole still finds its way. It is like a photo dropped to fewer colors. From across the room it looks identical. Lean in close and you start to see the banding, the places where the smooth gradient became visible steps.
Now the fraction makes sense. Not every part of the model is equally fragile. Some layers are load-bearing and hate being rounded, others barely notice. So instead of one menu size for the whole thing, you mix them. Keep the delicate layers at three or four bits, where the menu is generous, and crush the forgiving layers down to two, where it is tiny. Average the bits per weight across the whole model and you land on something like two point three four. The fraction is not a single weird precision. It is the bookkeeping average of a model rationed unevenly, generosity where it matters and brutal thrift everywhere else. Newer tooling lets you steer this by hand, protecting the layers you care about and trashing the rest.
There is a threshold, though, and you have walked past it. Down at a little over two bits, the staircase gets so coarse that the rounding error stops being a harmless blur and becomes genuine noise, big enough to change what the model would have said. And because a language model feeds its own output back into itself, that noise does not stay put. It compounds. A slightly wrong step leads to a slightly wronger next step, and far enough down that road the model loses the plot entirely. That is the same degeneration you watched in the heretic model, the part where it stopped making sense. The quantization did not break one answer. It poisoned the whole chain, one rounded weight at a time.
And yet, the strangest finding in this whole corner is that more crushing sometimes makes a model better. People reproduced a trick where rotating the numbers before squeezing them, at three bits, beat the full sixteen-bit version on certain tests. The rounding, applied cleverly, acted like a gentle discipline, smoothing out fragile over-fit behavior the way a slightly lossy copy can look cleaner than a noisy original. It is the kind of result that should make you suspicious, and it is real. Throwing information away, done with care, occasionally sharpens the thing that remains.
So carry the picture. A weight wants to be a smooth, finely tuned number, and quantization forces it onto a short staircase of allowed values, snapping each one to the nearest rung. Fewer bits, fewer rungs, smaller file, more blur. The model usually survives because its knowledge is spread thin and redundant, so scattered small errors wash out. The fraction in the name is just the average when you ration precision layer by layer, lavish on the fragile parts and stingy on the rest. And there is a cliff. Push past it and the blur becomes noise, the noise compounds through the model's own feedback, and a mind that fit on your Mac is a mind that no longer entirely knows what it is saying. Two point three four bits is the exact width of that bargain, written into the file name.