Speculative Decoding: Letting a Small Model Do the Guessing

The Other Way to Make a Model Faster

When you want a local model to run faster, your instinct is to shrink it. Crush its precision down to a few bits, make it smaller, accept a little damage. That is one road, and you know it well. But there is a second road that almost nobody outside the field has heard of, and it is the opposite philosophy. It keeps the big, smart, full-quality model exactly as it is, untouched, and still makes it spit out words two or three times faster, with no loss of quality at all. The output is provably identical to the slow model running alone. It sounds like a free lunch, and the trick that makes it work is one of the most elegant ideas in modern artificial intelligence. It is called speculative decoding, and the whole thing rests on a lovely asymmetry.

Writing Is Slow, Reading Is Fast

Here is the asymmetry. For one of these models, producing a word is slow, because it has to do an enormous amount of computation, then take the result and feed it back in to produce the next word, then feed that back in again, one at a time, a strict single-file line. Each word waits for the one before it. That serial chain is the reason generation feels like it trickles out. But there is a hidden second talent. The same model can read and check a whole stretch of words in roughly the same time it takes to produce one. Checking many words at once is cheap. Producing them one by one is expensive. Writing is slow, marking is fast.

Most of what a model produces is not actually hard. Huge stretches of any sentence are obvious filler, the, and, of, a comma here, the predictable end of a common phrase. A model spends its precious slow steps grinding out these gimmes with the same heavy machinery it uses for the genuinely hard, surprising words. That is the waste speculative decoding attacks. Why make the giant, expensive model labor over every easy little word, when the easy ones could be guessed by something far cheaper.

The Apprentice and the Master

So you bring in a second, tiny model alongside the big one. Think of it as an eager apprentice. The apprentice is fast and shallow, good enough to guess the obvious words but not to be trusted on the hard ones. You let the apprentice race ahead and scribble down its best guess for the next several words in one quick burst. Then you hand that whole batch to the master, the big model, and ask it to do the thing it is secretly fast at, checking. The master reads the apprentice's entire guessed run at once and marks each word, yes I agree, yes, yes, no, here it diverges.

Every word the master agrees with is kept, for free, because checking them all cost about as much as producing one. The moment the master disagrees, it throws away the rest of the apprentice's guess from that point and writes the correct word itself, then the apprentice races ahead again from there. On the easy stretches, the apprentice guesses four or five words right in a row and the master waves them all through in a single cheap check, so you got five words for the price of one. On the hard words, the apprentice is overruled and you fall back to normal speed. The output is exactly, bit for bit, what the master alone would have written, because the master signs off on every single word. You did not change the answer. You just stopped making the expensive model do the easy parts.

Why It Belongs Next to Quantization in Your Head

This deserves a slot right beside quantization in how you think about speeding up local models, because the two are completely different bargains and they stack. Quantization shrinks the model and pays a quality price, the blur you have studied. Speculative decoding keeps the model whole and pays no quality price at all, it just reorganizes the labor so the cheap apprentice handles the obvious and the master only does what the master is for. You can even do both at once, run a quantized big model and an even tinier draft model in front of it. The framework you run your local models with supports this, and most people never switch it on because they have never heard the idea. The speedup is real and it costs you nothing in accuracy, which is the rarest kind of win in this whole field.

The Keeper

So hold both roads in mind. To make a model faster you can make it smaller, crushing precision and accepting damage, the road you know. Or you can leave the smart model perfectly intact and attack a different waste entirely, the fact that it grinds out easy words and hard words with the same heavy effort, in a slow single-file line. Speculative decoding hands the easy guessing to a fast little apprentice, then uses the master's hidden talent for cheap bulk checking to approve whole runs of guesses at once, correcting only where the apprentice was wrong. Writing is slow, marking is fast, and most words are easy. The master signs every word, so the answer never changes. It is the closest thing to a free lunch the field has, hiding behind a switch you have probably never flipped.