Two Mac Studios, Wired Together, Slower Than a Raspberry Pi

A Quarter of a Token Per Second

Someone took two of the most expensive computers Apple sells, two Mac Studios with the giant memory option, well over half a million kronor of silicon between them, and wired them together with the fastest cable Apple makes, to run one enormous artificial intelligence model that would not fit on either machine alone. The result of all that money and ambition was a system that produced about a quarter of a word per second. You could read faster than it could think. A single one of those machines, running a model it could actually hold, does fifteen or sixteen words a second. Bolt two together and you get a fiftieth of that. They assembled a small fortune of the finest hardware on Earth and built something a child's hobby board would outrun. The why is one of the most useful lessons in all of computing, and it is one you keep brushing up against in your own machine-learning experiments.

The Model That Does Not Fit

The model in question was a monster, around three hundred and seventy-eight gigabytes. No single machine in the experiment had enough memory to hold it, so it had to be cut in half and spread across the two, part of its brain living in one Mac, part in the other. And this particular kind of model has an unusual inner structure. It is not one big uniform brain. It is a committee of hundreds of smaller specialists, called experts, and for any given word, only a few of those experts are actually consulted. The model has a little router at each step that decides, for this word, go ask these two or three experts, ignore the other two hundred and fifty.

That design is brilliant when everything lives in one place. You get the knowledge of a giant model while only doing the work of a small one each step, because you only wake a handful of experts at a time. But now picture those experts scattered across two machines connected by a cable. The router picks the experts it needs for the next word, and some of them happen to live on the other machine. So the half-finished thought has to be packed up, shipped down the cable to the second Mac, processed there, and shipped back. For every single word. And which experts are needed changes constantly, so there is no clever way to keep the conversation on one side. The work ping-pongs across the wire endlessly.

The Wire Is the Whole Story

Here is the thing people underestimate, over and over. Inside a single machine, the processor talks to its own memory at a firehose rate, hundreds of gigabytes a second, the data already right there. The cable between two machines, even the fastest one Apple makes, is a drinking straw by comparison. So the moment your model has to send its thoughts across that straw for every word, the straw becomes the entire bottleneck. The two powerful processors spend almost all their time waiting, idle, twiddling their thumbs, while a thin trickle of data crawls back and forth between them. You did not build a brain twice as big. You built one brain with its two halves connected by a garden hose, and a brain that has to shout across a garden hose to think a single thought thinks very, very slowly.

And there was a second insult on top of the first, particular to Apple's machines. The graphics chip in a Mac has a hidden safety timer. If any single piece of work on it runs longer than about forty-five seconds, the system assumes something has frozen and kills it. On a normal Mac that protects you from a hung app locking up your screen. But when you are trying to grind through a giant model and the work naturally takes longer than that, the timer becomes a guillotine. And Apple gives you no way to turn it off. No setting, no secret command, no boot flag. They simply decided forty-five seconds is the limit and that is final. So even the heroic slow crawl kept getting decapitated by a timer designed for a completely different purpose.

Why This Matters to You

You run models on your own machine, and you have flirted with exactly this dream, lashing hardware together to punch above its weight. This is the cautionary tale that should sit in the back of your mind whenever that dream returns. The instinct is always to count the processors and the memory, to think two machines must be roughly twice as capable as one. But for any work that has to constantly pass data between the pieces, the number that decides everything is not the power of the chips. It is the speed of the link between them. A modest setup with everything in one place will demolish a magnificent setup forced to talk across a slow connection. The chips were never the problem. The conversation between them was.

The Keeper

So carry this picture out. A giant committee model is cheap to run only when its hundreds of experts all live close together, because each word wakes just a few of them. Split that committee across two machines and every word triggers a frantic exchange down a cable that is a thousand times slower than the memory inside either box, and the whole system grinds to the speed of that cable. Add a safety timer that cannot be switched off, chopping the work to pieces, and you get a quarter of a word per second from half a million kronor of hardware. The lesson is older than these models and outlasts them. In any system built from parts, find the slowest link between the parts first, because that link, not the parts, is what you actually built.