LoRA: The Two Skinny Matrices Hiding Inside Your Sweeps

Rank Sixteen, and You Never Asked Why

Every overnight run you queue, you set a number called rank. Sixteen for a face, thirty-two for fine detail, sometimes up to sixty-four. You learned the recipe by feel, sixteen good, four too low, and you tune it like a thermostat. But rank is not a vague dial of strength. It is a precise, almost stingy claim about how much a thing is allowed to change, and once you see what it actually counts, every odd rule you follow stops being folklore and turns into arithmetic. So let us open the box you have been setting a number on this whole time.

You Are Not Training the Model

Start with the thing people get backwards. When you train one of these adapters on photos of yourself, you are not retraining the image model. The image model, all twelve billion of its internal numbers, sits frozen. Untouched. You are not allowed to nudge it, and you would not want to. What you are training is a tiny attachment that clips onto the side and whispers a correction into the frozen model as it works. That attachment is the whole adapter. It is why the file you get out is a couple hundred megabytes instead of twenty-four gigabytes, and why you can stack one on top of another, or pop one off and clip on a different one. The base never moved.

Now, why does that even work. The honest way to teach the model your face would be to adjust a gigantic grid of numbers inside it, a grid with millions of entries in a single layer. That grid is what the model uses to transform one batch of internal signals into the next. Teaching it your face means computing a change to that whole grid. And here is the insight the whole technique rests on. That change, the correction you want to make, is not random scribble spread across all those millions of entries. It is simple. It points in only a handful of meaningful directions. Most of that giant grid does not need to move at all.

The Bottleneck That Counts Directions

So instead of storing a full giant correction grid, the adapter stores it as two much skinnier grids multiplied together. Picture the big square correction you would have made. Now imagine squeezing it through a narrow waist. Information goes in wide, gets pinched down to a thin channel, then fans back out wide on the other side. The width of that thin channel, the pinch in the middle, is the rank. Rank sixteen means the correction is forced to pass through a channel just sixteen wires across. Everything the adapter is allowed to learn about your face has to fit through those sixteen wires.

That single picture explains your whole recipe. Rank is the number of independent directions of change the adapter can express. A low rank, four or eight, is a brutally narrow waist. For a small old image model that was plenty, because the corrections were genuinely simple. But the model you train on now is enormous, twelve billion numbers, and a face you really want to capture has more going on than four wires can carry. The signal arrives at the pinch, too much of it, and most gets crushed flat. That is what undertraining a face on rank four actually is. Not weak effort. A channel too narrow for the thing trying to pass through it. Raise the rank, widen the waist, more of the face survives the squeeze. Go too wide and the adapter has room to memorize noise and skin blemishes you never meant to teach it.

The Sister Knob

There is a second number you set right next to rank, called alpha, and it has confused people for years because it looks like it should do the same job. It does not. Rank decides how much the adapter can learn. Alpha decides how loudly what it learned gets played back into the frozen model. The two are tied together by a tiny fraction, alpha divided by rank, and that fraction quietly scales how hard your training pushes. This is why the wise old advice you saw was that any recommended learning rate is meaningless unless someone also tells you their alpha and their rank. Change the rank and you have silently changed the effective strength of every training step, even if the learning rate on the label never moved. People share a magic learning rate, forget to mention the other two, and wonder why it does nothing on their setup.

The Keeper

So next time you name a run, see what you are really choosing. The model is frozen. You are training a small clip-on correction, and you store that correction as two skinny grids squeezed through a narrow waist. Rank is the width of the waist, the count of independent directions the adapter may move in. Too narrow and the face cannot fit through and comes out undertrained. Too wide and it pours through unfiltered, memorizing flaws. Alpha is the volume knob on top, and because strength is alpha over rank, you can never read one number without the other two. Your whole feel for these sweeps, sixteen for a face, lower learning rate for delicate detail, was always describing the same simple machine. A giant grid you refuse to touch, and a thin channel you are quietly teaching to whisper through it.