The Quadratic Wall and the Revolt Against the Transformer

A Model That Is Not a Transformer

On the fifth of May this year, a company almost nobody had heard of shipped something that was not supposed to be commercially possible yet. A large language model that is not a transformer. Every famous model you use, every one, is built on the same core machinery invented in twenty seventeen, and that machinery has a wall baked into it that has shaped the entire economics of the field. This new model, from a company calling itself Subquadratic, threw the core machinery out, and shipped with a context window of twelve million tokens, claiming roughly a fifth of the cost on long inputs and attention many times faster at scale. To understand why that is either a revolution or a trap, you have to understand the wall, because you run straight into it every time you feed a model a long document.

Why Doubling the Input Quadruples the Work

Here is the wall. The heart of a transformer is a step called attention, and attention works by having every token in your input look at every other token to decide what is relevant. Every word compares itself to every other word. That sounds reasonable until you count it. With a hundred words, that is a hundred times a hundred comparisons, ten thousand. With a thousand words, a thousand times a thousand, a million. The work does not grow with the length of your input. It grows with the square of the length. Double the document and you quadruple the cost. Ten times the document, a hundred times the cost.

That single fact explains so much of what you have noticed. It is why long context is expensive, why the price of a model call climbs steeply as you stuff more in, why those impressive million-token context claims so often come with quiet warnings that quality sags past a certain length. The all-pairs comparison is gloriously powerful, every word genuinely aware of every other, but it is the time bottleneck of the whole architecture. For nine years the field has mostly just paid the tax, throwing more hardware at the square, because attention works so well that nobody wanted to give it up.

The Recurrent Rebels

But a rebellion has been brewing, and it goes back to an older idea. What if, instead of every word comparing to every other word, the model read the text the way you read a sentence, left to right, keeping a running summary in its head and updating that summary as each new word arrives. No all-pairs explosion, just a rolling state that absorbs each token in turn. The cost then grows merely in step with the length, not its square. Read twice as much, do twice the work, not four times. These are the state space models, the most famous called Mamba, along with cousins built on what is called linear attention. The clever part of the best of them is making that running summary smart enough to decide, for each new token, what to remember and what to forget, so it does not just blur everything together.

The new commercial model is the moment that family grew up and walked onto the market wearing a price tag, with a twelve-million-token window that the old square-law architecture could never afford to offer cheaply. And the appeal is obvious to anyone who has watched a long-context bill. If reading is linear instead of square, then enormous inputs, whole codebases, whole archives, whole books, stop being a luxury and become routine. That is the dream the rebels are selling.

The Theorem That Says You Cannot Win Cleanly

And now the trap, the part that makes this genuinely deep rather than just a product launch. In twenty twenty-four, theorists proved something inconvenient. They showed that any model fast enough to dodge the square-law wall, any truly subquadratic model, whether Mamba or linear attention or whatever comes next, provably cannot perform certain tasks that a full transformer can. Their example was a similarity task, given a big pile of documents, find the pair that is most alike. They proved a transformer can do it, and they proved that no truly fast model can, resting on a widely believed conjecture about the fundamental hardness of computation. The all-pairs comparison was not waste. It was the price of a capability. Throw out the square, and you provably throw out the ability to truly relate every item to every other item.

So the rebellion is real but it is not free. The running-summary models are not simply a better transformer. They are a different bargain. They are dramatically cheaper on long inputs, and for a huge range of tasks you will never notice the difference, because most tasks do not secretly require relating everything to everything. But there exists a class of problems, the ones that genuinely need every part compared to every other part, where the fast model cannot follow, no matter how cleverly it is built, because a theorem stands in the way. This is why the smart money is on hybrids, models that run the cheap linear machinery for most of their layers and sprinkle in a little of the expensive all-pairs attention only where it is truly needed. And it is why people are taking expensive, fully trained transformers and distilling their knowledge down into the cheaper architecture, trying to keep the wisdom while shedding the cost.

The Keeper

So here is the finale to carry. Every model you know is built on attention, where every word compares to every other word, which makes them brilliant and makes their cost grow with the square of the input, the wall behind every steep long-context bill. A rebellion of running-summary models, Mamba and its kin, dodges the wall by reading like a person, keeping a rolling state, paying only in step with the length, and this year one of them shipped commercially with a twelve-million-token window. But a proof from twenty twenty-four says the dodge has a real price. The fast models provably cannot do certain everything-to-everything tasks that the square-law machinery can. So the future is almost certainly not one or the other. It is a careful blend, cheap reading most of the time, expensive all-pairs attention only where a theorem says you truly need it. The wall is real, the escape is real, and so is the catch.