This is episode nine of Actually, AI.
You typed a prompt. You asked for a person holding a coffee cup in a cozy cafe, with a chalkboard menu behind them reading "daily specials." The image came back gorgeous. Warm light, beautiful skin texture, steam curling from the cup. But the hand wrapped around that cup has six fingers, one of them growing sideways from the wrist. The chalkboard says something that looks like English from across the room but dissolves into nonsense up close. And despite asking for the person on the left side of the frame, they are dead center.
These are not bugs. They are not things the developers forgot to fix. They are direct, predictable consequences of how every image generator you have ever used actually works. Midjourney, DALL-E, Stable Diffusion, FLUX. All of them. And once you understand the mechanism, you will never look at a generated image the same way again. More importantly, you will know how to get better results.
Because image AI does not generate pictures. It denoises them. It starts with pure static, random noise with no structure at all, and removes that noise step by step until a picture appears. Your text prompt steers the denoising, nudging it toward cats or spacesuits or Mars at every step. But the model has no concept of what a cat is. No concept of anatomy. No concept of spelling. It has learned patterns of noise from millions of training images, and it subtracts its way toward something that looks right. The word "looks" is doing all the work in that sentence.
The mechanism has two phases, and neither involves any imagination.
In the first phase, the model learns by watching destruction. Take a real photograph. Add a tiny amount of random noise. The image is barely changed, just a few pixels slightly off. Show the model the noisy version and ask it one question: what noise was just added? The model predicts. You tell it how wrong it was. It adjusts slightly. Then you add more noise and ask again. After about a thousand steps of increasing noise, the original photograph is completely destroyed. Pure static. But at every step along the way, the model practiced predicting what noise was present. It learned, at every level of corruption, how noise sits on top of structure.
In the second phase, generation, the model works in reverse. You hand it a field of pure random static and ask the same question it practiced millions of times: what noise do you see? It predicts. You subtract that predicted noise. The result is slightly less random. Ask again. Predict again. Subtract again. Over dozens of iterations, coherent structure emerges from nothing. Edges form. Colors coalesce. Shapes become recognizable.
Your text prompt enters at every single step. The model does not predict noise generically. It predicts noise conditional on your description. "A cat wearing a spacesuit on Mars." At each step, the text pulls the denoising in a specific direction. Without it, the model would denoise toward whatever generic image its training data suggests. With it, cat-like structures emerge instead of dog-like ones. The spacesuit forms around the cat instead of a coat. Mars appears in the background instead of a kitchen.
The idea came from physics. In twenty fifteen, a researcher applied mathematics from thermodynamics, the study of how order dissolves into disorder, to machine learning. If you could model the process of destroying structure in tiny controlled steps, you could train a neural network to reverse each step. Nobody noticed. The paper sat largely forgotten for five years while a flashier approach called generative adversarial networks dominated the field. It took until twenty twenty for other researchers to make diffusion practical, and until twenty twenty-two for it to reach millions of people. The deep dive tells that full story, from the physicist who started it all to the company that blew up trying to give it away for free.
Now for the part that changes how you use these tools.
The model learned to denoise by studying patterns in millions of photographs. Faces appear constantly in training data, usually centered and well-lit, in a limited range of poses. The statistical patterns are strong and consistent. But hands are a different story. They are small in most photographs. They appear at wildly different angles, in different configurations, partially hidden behind objects, gripping tools, folded together, gesturing. Each specific hand pose is individually rare.
And the model has no concept of "five fingers." It has no concept of anatomy at all. It has no internal model of a hand as a three-dimensional object with joints and bones and a fixed number of digits. It has statistical patterns that usually produce something hand-shaped, learned from millions of partial, inconsistent examples. When those patterns conflict, which happens often because the training data is so varied, the result is a hand that looks plausible from a distance and nightmarish up close. The model is not miscounting fingers. It does not know what counting is. It is averaging noise patterns, and the average of a thousand different hand configurations is a blob with roughly the right shape and an uncertain number of digits.
This has improved dramatically. Midjourney version seven, FLUX two, and the latest generation of models produce better hands than anything from two years ago. But the improvement comes from better training data and larger models with more capacity to learn fine-grained patterns, not from any sudden understanding of anatomy. The fundamental mechanism is unchanged. The hands are better because the noise prediction is more precise, not because the model learned what a hand is.
The same mechanism explains why text in your generated images comes out garbled. When you ask for a storefront with a sign reading "open," the model does not know what letters are. It does not know that "open" has four specific characters in a specific sequence. It knows roughly what a word painted on a sign looks like, because it has seen thousands of training images of signs. But it treats letters as visual patterns, not as symbols with rules about sequence and spelling.
Think about what the denoising process is actually doing at each step. It predicts noise across the entire image simultaneously. There is no mechanism that says "first render the O, then the P, then the E, then the N." The model removes noise from all parts of the image at once, guided by the general statistical pattern of "what text on a sign looks like." Sometimes the letters land close to correct. More often, you get something that looks like writing from three feet away and dissolves into alphabet soup when you zoom in.
Real progress has been made here. DALL-E three achieved legible short text by training with synthetic captions that explicitly described what text appeared in images. The FLUX two model family includes a built-in typography system trained specifically on text rendering. GPT Image, which replaced DALL-E three in late twenty twenty-five, handles text even better because a language model processes your words before the image generation begins, so it actually understands what "open" means as a sequence of characters. But longer text strings and complex layouts still trip up most models. The core challenge remains: denoising is a spatial process that operates on the whole image at once, and text is a sequential medium where every character matters.
You asked for the subject on the left. The model put them in the center. You asked for a bird's eye view. You got a forty-five-degree angle. You described a complex scene with six elements in specific spatial relationships. The model rendered three of them and arranged them however it pleased.
This is the third major consequence of the denoising mechanism. Your text prompt steers the direction of noise removal, but it steers through statistical association, not spatial instruction. The model learned that the word "portrait" is associated with centered compositions because most portraits in the training data are centered. It has no mechanism for parsing "on the left side" as a spatial coordinate. It treats spatial language the same way it treats any other word: as a statistical influence on noise prediction.
There is a parameter that controls how strongly the model follows your text versus its own learned instincts. It is called the guidance scale, and most tools default to around seven or eight. Higher values make the model follow your text more aggressively, but images start to look oversaturated and harsh. Lower values let the model wander, producing more natural-looking results that might ignore what you asked for. You are always trading control for quality, and the denoising mechanism is why.
A tool called ControlNet, published in twenty twenty-three, partially solves this by letting you provide structural inputs: edge maps, depth maps, human pose skeletons that constrain where things go. The practical companion covers ControlNet and every other technique for getting the images you actually want. But the base mechanism, text steering noise prediction, has no built-in concept of where things should be.
Diffusion is not how language models work. In episode three, we talked about training as next-token prediction, a model learning to guess what word comes next, sequentially, left to right. A diffusion model generates an image by refining the entire picture simultaneously, at every step, from noise to clarity. The two approaches share almost no architecture, no training method, no generation logic.
This matters because the phrase "artificial intelligence" makes it sound like one thing. It is not. The AI that writes your emails and the AI that generates your images are as different from each other as a submarine and an airplane. Both move through a medium. Both are engineered. Both are impressive. But the mechanisms are completely different, and the failures are completely different, and the workarounds are completely different. Understanding diffusion makes that visible. The next time someone says "AI can do this," you are now equipped to ask: which AI? Doing what? And failing how?
The deep dive goes further. The full history of who built this, from a physicist borrowing thermodynamics to the compression trick that made it run on your laptop to the open release that set off a legal firestorm. The practical companion covers what to actually do with everything you just learned: prompt strategies that work, why negative prompts are not superstition, how to fix the hands, ControlNet for composition, and which tool to pick for which job. Find them right after this in your feed.
That was episode nine.