MLX: The Worker Thread That Talked to the Wrong Graphics Card

A Crash With No Pattern

The model generated text fine, until it didn't. On the main thread, perfect. Move the same generation onto a background worker thread, the kind you spin up so the web page stays responsive while the machine thinks, and on a newer version of the framework it would simply fall over. No bad input. No memory exhaustion. The exact same code, the exact same model, crashing only because of which thread happened to be running it. That is the kind of bug that makes you doubt your own eyes, and your code review surfaced it before a single real user hit it. Let us look at what was actually going on, because it is a window into how your Mac talks to its own graphics chip.

What a Stream Really Is

The framework here is Apple's array library for its own silicon, the thing you run your local models on. When you ask it to multiply two big arrays, it does not do the work on the main processor. It hands the work to the graphics chip, which is really just a processor with thousands of tiny hands instead of a few big ones. But you cannot just shout instructions at a graphics chip directly. You queue them. There is an orderly line, a stream, and you drop your commands into that line, and the chip works through them in order. Almost everything fast on your machine, the models, the image generation, the whisper transcription, is really just commands being fed into one of these streams.

The library tries to be friendly about this. It keeps a default stream ready so you never have to think about the line at all. You write what looks like ordinary math, and behind your back it is quietly queuing work onto that default stream. Most of the time you are blissfully unaware the line even exists. And that convenience is exactly what set the trap, because the question nobody asks is, whose default stream, and on which thread.

The Thread Did Not Know Where to Send the Work

Here is the heart of it. The setup, the part that prepared the model and arranged which stream the work should go to, ran on one thread, the main one. Then the actual generation got handed off to a separate worker thread so the rest of the program could stay awake. But the arrangement of which stream to use did not automatically travel across that handoff. The worker thread woke up holding a model and an instruction to generate, but without the local sense of which line to drop its commands into. On older versions the library papered over this. On the newer one it stopped being so forgiving, and the worker thread, asked to submit work with no valid stream of its own, crashed.

The fix has an ugly name and a simple shape. On the worker thread, before generating, you re-establish the stream locally, right there, on that thread. You tell it, explicitly, this is the line you send your work to. Thread-local, meaning each thread keeps its own copy of that setting rather than sharing one. It is the same lesson people relearn constantly with anything attached to hardware. A connection to the graphics chip is not a free-floating fact the whole program shares. It has an owner, and when work jumps threads, the connection does not follow on its own. You have to re-introduce the new thread to the chip.

Slow Is Not the Same as Finished

The same review caught a second bug in the same neighborhood, and it is the more universal of the two. Tokens, the little pieces of generated text, come out of the model one at a time, and they go onto a shared queue. Another part of the program drains that queue and shows them to you. The drainer was polling, checking the queue on a fixed rhythm, every quarter of a second, and using a stretch of silence to decide the model must be done. If no token had shown up recently, it assumed the well was dry and stopped.

You can already feel the flaw. A model does not produce tokens like a metronome. Sometimes it stalls for a moment, thinking harder, and the gap between two tokens stretches past that quarter second. When that happened, the drainer took the silence as the end of the story, walked away, and quietly threw out every token that came afterward. The output was not wrong in a way that screamed. It was just truncated, missing its tail, and you would have to know what the full answer should have looked like to even notice. The problem was that a timeout was being used to answer a question it cannot answer. A pause and a finish look identical if all you measure is silence.

The Keeper

So two ideas to carry, and they rhyme. The first is that talking to a graphics chip is not magic you can fling around freely. Work goes into an ordered line, that line belongs to a thread, and when your work jumps to a new thread you must hand it the line again, by hand. Convenience hid that ownership until a version bump stopped hiding it. The second is broader than any framework. Never use a stretch of quiet to conclude that something has ended. A slow producer and a stopped producer sound exactly the same to a listener who is only timing the gaps. If you want to know a stream is truly done, the producer has to say so, plainly, with a signal that means done and nothing else. Your storyteller now rebinds its line and waits for the real word. It stopped guessing from silence, and it stopped crashing on the wrong thread.