Diffusion Deep Dive: From Thermodynamics to FLUX

Diffusion Deep Dive: From Thermodynamics to Theatre d'Opera Spatial

The Physics Underneath

This is the deep dive companion to episode nine of Actually, AI, diffusion.

In the main episode, we used the ink-in-water analogy to explain how diffusion models work. The analogy is good, but it conceals something interesting. Jascha Sohl-Dickstein did not borrow loosely from physics. He imported specific mathematics. The foundational paper, published in twenty fifteen, is called "Deep Unsupervised Learning using Nonequilibrium Thermodynamics," and that last word is doing real work.

Thermodynamics describes systems exchanging energy and matter. Equilibrium thermodynamics studies systems that have settled, where nothing is changing anymore. The uniform tint of fully diffused ink, the room-temperature coffee, the heat-dead universe. Nonequilibrium thermodynamics studies the process itself, the active exchange, the diffusion still happening. Sohl-Dickstein's insight was that the mathematical tools physicists use to model processes like heat flow and particle diffusion could be repurposed. If you could model the forward process of information being destroyed, the gradual addition of noise that turns structured data into randomness, then you could parameterize a neural network to learn the reverse.

The specific physics involves something called Langevin dynamics, a set of equations that describe how particles move when buffeted by random thermal fluctuations. In physics, Langevin dynamics models Brownian motion, the jittery path of a pollen grain in water. In diffusion models, the "particle" is the data point, the image, and the "thermal fluctuations" are the noise being added at each step. The forward diffusion process is a Markov chain where each state depends only on the previous state. At each of roughly one thousand timesteps, a small amount of Gaussian noise is added according to a predetermined schedule, until the original image is indistinguishable from pure random noise. The reverse process, learned by the neural network, approximates the time reversal of this stochastic differential equation.

This is not a metaphor dressed up as science. It is genuine mathematical machinery from statistical physics, applied to a completely different domain. A physicist trained in thermodynamics saw a generative modeling problem and recognized the same mathematical structure he had studied in a different context entirely. Surya Ganguli, who directed the Stanford lab where this work happened, comes from applied physics. The collision of physics and machine learning produced something neither field would have found alone.

The Parallel Path

Here is where the history gets strange. While Sohl-Dickstein's paper gathered dust, a graduate student at Stanford named Yang Song was working on a seemingly unrelated problem. Song wanted to make score-based generative models practical. The "score" in this context is the gradient of the log probability density, a mathematical object that tells you, at any point in the data space, which direction to move to reach higher probability. If you know the score function everywhere, you can generate data by starting at a random point and following the score uphill, using a technique that is, once again, Langevin dynamics.

Song tried to make this work and failed. Even on simple datasets like handwritten digits, his initial approach produced garbage. Then he made three improvements that changed everything: perturbing the data with noise at multiple scales, using a specific neural network architecture called a U-Net, and chaining the Langevin sampling across those noise scales from high to low. The results were suddenly competitive.

In twenty nineteen, Song and his advisor Stefano Ermon published "Generative Modeling by Estimating Gradients of the Data Distribution" at NeurIPS. It was developed independently from Sohl-Dickstein's work, using different mathematical language and different motivations. But the structural similarity was unmistakable. Both approaches perturbed data with noise. Both learned to reverse the corruption. Both used the same stochastic dynamics underneath.

The unification came in twenty twenty, when Song published a paper proving that both approaches are discretizations of the same family of stochastic differential equations. Score-based models and diffusion models were not two different ideas. They were two views of the same idea, arrived at from different directions. The paper's author list includes both Yang Song and Jascha Sohl-Dickstein, a collaboration between the two paths that had finally converged.

Sohl-Dickstein later posted something revealing on social media. He admitted that he had once tried to convince Song that score matching was too local to be useful for generative modeling. Even the inventor of diffusion did not see the full landscape of what he had started.

The Paper That Lit the Fuse

For five years after Sohl-Dickstein's twenty fifteen paper, generative adversarial networks dominated image generation. GANs pitted two neural networks against each other, a generator that created images and a discriminator that tried to spot fakes. The results were impressive. By twenty eighteen, NVIDIA's StyleGAN was generating photorealistic human faces that fooled most viewers. But GANs had problems. Mode collapse, where the generator fixated on producing a narrow range of outputs. Training instability, where the delicate balance between the two competing networks would break down unpredictably. And a fundamental difficulty with diversity: GANs were good at producing sharp, convincing individual images but struggled to capture the full variety of their training data.

In June twenty twenty, a UC Berkeley PhD student named Jonathan Ho submitted a paper that changed the trajectory of the field. "Denoising Diffusion Probabilistic Models," co-authored with Ajay Jain and his advisor Pieter Abbeel, took Sohl-Dickstein's framework and made it practical. Ho simplified the training objective, connected diffusion probabilistic models with denoising score matching and Langevin dynamics into a unified framework, and showed that diffusion could produce images that matched or exceeded GAN quality on standard benchmarks. On the CIFAR ten dataset, the model achieved state-of-the-art results with an FID score of three point one seven.

Ho had studied electrical engineering and computer science at Berkeley as an undergraduate, completed his PhD there in twenty twenty with a thesis on deep generative models, and then joined Google Brain in Amsterdam. The paper, known as DDPM, became one of the most influential machine learning publications of the decade. Within a year, Prafulla Dhariwal and Alex Nichol at OpenAI published a follow-up with a title that doubled as a verdict: "Diffusion Models Beat GANs on Image Synthesis." The era of GAN dominance was over.

There was a speed problem, though. Generating a single image required running the model through all one thousand denoising steps. Each step meant a full forward pass through the neural network. In October twenty twenty, Jiaming Song, Chenlin Meng, and Stefano Ermon published a technique called DDIM, Denoising Diffusion Implicit Models, that used a non-Markovian process with the same training objective but could skip steps during generation. Instead of one thousand steps, fifty would suffice. Ten to fifty times faster, with the same training procedure. This made diffusion practical for applications where people expected results in seconds, not minutes.

The Compression Trick

There was still a fundamental cost problem. Diffusion operated in pixel space. A five hundred and twelve by five hundred and twelve image with three color channels contains nearly eight hundred thousand numbers. Running one thousand denoising steps on eight hundred thousand numbers, with each step requiring the full neural network, meant enormous compute costs. High-resolution images were essentially out of reach.

The solution came from a group at the University of Heidelberg. Robin Rombach had studied physics there before moving into computer science, joining the Computer Vision and Learning Group run by Professor Bjorn Ommer. The group, known as CompVis, had already produced important work on image generation. When Ommer moved to Ludwig Maximilian University in Munich in twenty twenty-one, the core team followed.

Rombach and his colleagues, including Andreas Blattmann, Dominik Lorenz, and Patrick Esser, published "High-Resolution Image Synthesis with Latent Diffusion Models" in late twenty twenty-one. The key idea was deceptively simple. Instead of running the diffusion process on raw pixels, first compress the image into a much smaller representation using a pretrained autoencoder, run the diffusion process in that compressed space, then decompress the result back to pixels.

A five hundred and twelve by five hundred and twelve image with three channels became a sixty-four by sixty-four representation with four channels. Forty-eight times less data. The diffusion model operated in this compact space, learning to denoise latent representations rather than pixel grids. The autoencoder handled the translation between pixel space and latent space. The result was at least two point seven times faster than pixel-based diffusion, with equivalent or better image quality.

Everyone wanted to poach Robin. He is the maestro.

That was a former colleague's description, reported by Sifted. The colleagues who knew Rombach described someone consumed by the work itself.

He is just a guy that loves to create and build AI. Anything that detracts from that, like speaking to the press, is not needed.

Latent diffusion was the architectural foundation of what came next. But the piece that made text-to-image generation actually good was a separate innovation: classifier-free guidance.

Following the Prompt

Early diffusion models generated images from noise without any external direction. They would denoise toward whatever patterns they had learned from training data. Adding text control required a mechanism for the text prompt to steer the denoising process. The first approach, called classifier guidance, used a separate image classifier trained on noisy images to nudge the diffusion model in the right direction. It worked, but it required training and maintaining a separate classifier, and that classifier could take shortcuts, exploiting surface-level patterns in the input rather than truly understanding the prompt.

Jonathan Ho, the same researcher who had made diffusion practical with DDPM, teamed up with Tim Salimans at Google Research to publish a cleaner solution in twenty twenty-two. The idea was elegant in the way good engineering often is: solve two problems with one change. During training, the model randomly drops the text conditioning some percentage of the time, replacing it with a null token. This forces the model to learn two things simultaneously: what images look like given a specific text description, and what images look like in general. During generation, the model makes two predictions at each denoising step, one guided by the text prompt and one without any text. The actual denoising direction is computed by amplifying the difference between those two predictions. A guidance scale parameter controls the strength of this amplification. Higher values mean the model follows the prompt more faithfully but loses some natural image diversity. Lower values produce more varied but less prompt-faithful results.

OpenAI's GLIDE model, published in December twenty twenty-one, ran a direct comparison between classifier guidance and classifier-free guidance. Human evaluators preferred the classifier-free approach on both photorealism and caption similarity. The GLIDE team noted something telling about why the old approach failed: classifiers can take shortcuts and ignore most of the input while still getting competitive classification results, but generative models have no such luxury.

Classifier-free guidance became the standard approach for every major text-to-image system that followed. DALL-E two, Stable Diffusion, Imagen, Midjourney. All of them use it. The technique is one of those unglamorous innovations that rarely makes headlines but without which none of the headline-grabbing applications would work.

The Three Horses

By late twenty twenty-two, three products had brought diffusion models to millions of people. Each took a radically different path.

DALL-E was OpenAI's entry. The first version, announced in January twenty twenty-one, was not actually a diffusion model at all. It used a discrete variational autoencoder combined with an autoregressive transformer, more closely related to the language models OpenAI was famous for. The name was a portmanteau of Salvador Dali and the Pixar robot WALL-E. Led by researcher Aditya Ramesh, DALL-E showed that AI could create coherent images from arbitrary text descriptions. DALL-E two, released in April twenty twenty-two, switched to a diffusion architecture and produced dramatically better results. DALL-E three, integrated into ChatGPT in late twenty twenty-three, finally achieved readable text in images, a problem that had plagued every previous system.

We are able to construct a sentence to describe any situation that we might encounter in real life, but also fantastical situations.

Ramesh described DALL-E as a creative co-pilot for artists. He went on to build the team that created Sora, OpenAI's video generation model, and is now vice president of research there. OpenAI kept DALL-E behind an API and a paywall. Access was controlled. The weights were never released.

Midjourney took a completely different approach. David Holz, the founder, had grown up in South Florida, the son of a dentist who sailed the Caribbean providing dental services. He was a self-taught programmer who studied physics and applied math, began a PhD, worked at the Max Planck Institute and NASA, then left academia entirely. His first company, Leap Motion, built hand-tracking technology backed by Andreessen Horowitz. When he saw CLIP-guided diffusion in twenty twenty-one, he pivoted his entire life.

I was like, why am I working on all this stuff? I just want to work on one cool thing that I care about.

He founded Midjourney with about ten engineers, no investors, and no business plan beyond making something beautiful. The product launched exclusively through Discord, a chat platform for gamers. Users typed prompts into a Discord channel, and a bot responded with generated images. There were no press releases, no marketing campaigns. Holz announced new versions in the Discord server. By twenty twenty-three, eighteen million people had joined.

It is not actually about art. It is about imagination.

When asked about the dangers of the technology, Holz reached for physics rather than science fiction.

There is danger in water. You can drown in it. But the danger is different from a tiger. Water has no will, it has no spite.

Midjourney was profitable from its first month. No venture capital, no burn rate, no fundraising rounds. Revenue went from fifty million dollars in twenty twenty-two to two hundred million in twenty twenty-three to an estimated five hundred million in twenty twenty-five, with fewer than two hundred employees. Holz described the business model with a simplicity that bordered on defiance.

It is a pretty simple business model, which is, do people enjoy using it? Then if they do, they have to pay the cost of using it. We add a percentage on top of that, which is hopefully enough to feed and house us.

The third horse was Stable Diffusion, and its story is the most turbulent. In August twenty twenty-two, a company called Stability AI released the model weights of Stable Diffusion one point four for free download. The model was built on Rombach's latent diffusion architecture, trained on a massive dataset called LAION-5B, five point eight five billion image-text pairs scraped from the internet by a German nonprofit. Within days, the open-source community had it running on Windows laptops, M1 Macs, and custom home servers. Community-built interfaces like AUTOMATIC1111 and ComfyUI made it accessible to anyone. Fine-tuning with a technique called LoRA meant anyone with a single consumer graphics card could customize the model.

No model of this capability level had ever been released with open weights before. It was a Pandora's box, and the person who opened it was Emad Mostaque.

The Rise and Fall of Stability AI

Mostaque's biography reads like an unreliable narrator wrote it. Born in nineteen eighty-three to a Bengali Muslim family in Jordan, raised in Bangladesh, moved to the United Kingdom at age seven. He held a degree in mathematics and computer science from Oxford, though he later clarified in a blog post that he did not attend his graduation ceremony and had his degrees mailed to him after paying the university. He spent thirteen years in hedge funds, trading crude oil and advising governments on Middle Eastern affairs and Islamic extremism, before co-founding Stability AI in late twenty twenty with Cyrus Hodes.

End corporate control and dominance over such technologies.

That was his stated motivation for the open release. The Stable Diffusion launch in August twenty twenty-two generated massive attention, and within two months Stability AI raised funding at a significant valuation. For a moment, Mostaque was the champion of open AI, the counterweight to OpenAI's closed approach.

Then things unraveled. In June twenty twenty-three, an investigation drawing on more than thirty sources reported that Mostaque had misled investors about his education, his relationship with Amazon Web Services, and his personal involvement in developing Stable Diffusion. He had made unsubstantiated claims about partnerships with the United Nations and the government of Malawi. His co-founder Hodes sued, alleging he had been defrauded into selling his fifteen percent stake for one hundred dollars, a stake valued at one hundred and fifty million dollars just five months later.

By October twenty twenty-three, Stability AI was burning roughly eight million dollars per month. Forbes reported the company was struggling to pay wages and payroll taxes. Amazon threatened to shut down their servers over unpaid cloud computing bills. Investors lost confidence. Lightspeed said Mostaque's mismanagement had "severely undermined" their trust. Another investor, Coatue, pushed for his resignation.

In March twenty twenty-four, three of the five authors of the original latent diffusion paper, Robin Rombach, Andreas Blattmann, and Dominik Lorenz, resigned from Stability AI. The people who had built the technology walked out. Mostaque resigned as CEO on March twenty-third.

You cannot beat centralized AI with more centralized AI. The concentration of power in AI is bad for us all. I decided to step down to fix this at Stability and elsewhere.

Five months later, Rombach, Blattmann, and Esser founded Black Forest Labs in Freiburg, Germany. Their seed round was thirty-one million dollars from General Catalyst and Andreessen Horowitz. They released the FLUX model family, which quickly topped the Hugging Face download charts and surpassed DALL-E three as the leading image generation system. A follow-up round of three hundred million dollars valued the company at approximately one billion dollars. The researchers who built the technology ended up building their own company. Stability AI, the company that made diffusion a household concept, became a cautionary tale about what happens when the people who understand the science leave.

The Names That Became Prompts

In September twenty twenty-two, MIT Technology Review published a story about a Polish artist named Greg Rutkowski. Rutkowski is a digital painter known for fantasy landscapes, rich atmospheric scenes reminiscent of the Romantic painters. His name had been used more than ninety-three thousand times as a prompt modifier in Stable Diffusion. More than Picasso. More than van Gogh. More than Leonardo da Vinci, by an order of magnitude. Users had discovered that adding "in the style of Greg Rutkowski" to any prompt produced lush, dramatic results. The model had absorbed so much of his publicly posted work during training that his name functioned as a reliable aesthetic dial.

It has been just a month. What about in a year? I probably will not be able to find my work out there because the internet will be flooded with AI art.

Rutkowski described the core grievance plainly.

AI generators were using names of artists as a reference or guide for the AI algorithm to follow a specific style without asking anyone for permission to use their works in the database.

Rutkowski was eventually removed from Stable Diffusion's recognized prompts. The community brought him back through fine-tuning. His style had been extracted, distributed, and made available to millions, and there was nothing he could do to retrieve it.

In October twenty twenty-two, the South Korean illustrator Kim Jung Gi died. Within days, someone trained a model on his life's work and shared it publicly as an "homage." The art community was furious. A living artist's style being absorbed by a model was one thing. Cloning a dead artist's hand before the funeral flowers had wilted was another.

That same month, a man named Jason Allen entered a painting into the Colorado State Fair's "Digital Arts and Digitally-Manipulated Photography" category. It was called "Theatre d'Opera Spatial." It won first place and a three hundred dollar prize. Allen had created it using Midjourney, feeding more than six hundred prompts through the system and refining the results in Photoshop. The judges were unaware AI had been involved. When the story broke, thousands of social media comments called the win unfair. In September twenty twenty-three, the United States Copyright Office ruled the image was not eligible for copyright protection, finding that Allen's creative input was too minimal.

Lawsuits followed. In January twenty twenty-three, artists Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a class action against Stability AI, Midjourney, and DeviantArt, alleging copyright infringement and violation of the Digital Millennium Copyright Act. Greg Rutkowski joined an amended complaint later that year. Getty Images sued Stability AI separately, claiming infringement of more than twelve million photographs. The AI had even reproduced Getty's watermark in some outputs, a particularly damning detail. As of the research date for this episode, the cases are moving through discovery. A federal judge denied Stability AI and Midjourney's motions to dismiss, finding the infringement claims plausible.

The debate beneath these lawsuits is genuinely difficult. On one side: artists whose work was scraped from the internet without consent, fed into a training dataset of nearly six billion images, and used to build products that directly compete with the artists themselves. On the other: the argument that learning from existing art is what every human artist does, that copyright protects specific works but not styles, and that restricting AI training data would freeze progress for everyone. Both sides have real arguments. The legal system is still deciding which framework applies. The answer will shape not just image generation but every form of AI that learns from human output.

Controlling the Noise

Text prompts tell diffusion models what to generate, but they give almost no control over composition. You can ask for "a cat wearing a spacesuit on Mars" but you cannot specify where the cat stands, what angle its head is tilted, or how the light falls. The model decides all of that based on its own statistical patterns.

In February twenty twenty-three, Lvmin Zhang, a PhD student at Stanford, published ControlNet. The paper, co-authored with Anyi Rao and Maneesh Agrawala, described a way to add structural control to pretrained diffusion models without retraining them. The approach locked the original model's parameters and added a parallel trainable copy connected through special zero-initialized convolution layers. This parallel network could accept additional inputs: edge maps, depth maps, human pose skeletons, segmentation masks, even rough sketches. The diffusion model would then generate an image that respected those structural constraints while still following the text prompt.

The impact was immediate. Sketch a rough stick figure, add a text prompt, and the model generates a photorealistic person in exactly that pose. Draw a rough room layout and the model fills it with photorealistic furniture. ControlNet was adopted by every major Stable Diffusion interface and made precise compositional control possible for the first time. For professional workflows, where "a cat on Mars" is not specific enough and the exact framing matters, it was transformative.

Video diffusion extended the same principles into time. Where an image is a two-dimensional field of pixels, a video is a three-dimensional field: two spatial dimensions plus time. The challenge is temporal consistency. A face that looks correct in frame one must not morph by frame thirty. Objects must obey physics. Gravity must work across the entire clip. OpenAI's Sora, announced in February twenty twenty-four and released in December, used a Diffusion Transformer architecture that decomposed video into three-dimensional spacetime patches, essentially treating time as a third spatial dimension and denoising across all three simultaneously. The results were impressive and imperfect. Physics still breaks. Complex actions over long durations still produce artifacts. But the direction is clear: diffusion is expanding from still images into motion.

The Jargon Jar

This episode's term: latent space.

If someone texts you asking what latent space means, here is what you tell them. It is a compressed version of the data that the model actually works with. Instead of operating on a full-resolution image with hundreds of thousands of pixel values, the model compresses the image down to a much smaller set of numbers, does all its work there, and then decompresses the result back to pixels.

The marketing version: "Our model operates in a rich latent space for enhanced generation capabilities."

What it actually means in practice: latent is just a fancy word for hidden. The latent space is the hidden representation that sits between the encoder, which compresses, and the decoder, which decompresses. It is a compression trick. The image goes in at five hundred and twelve by five hundred and twelve pixels, comes out as sixty-four by sixty-four latent values, and the diffusion model never touches the full-resolution version at all. This is what made Stable Diffusion possible on consumer hardware. Without latent diffusion, you would need a server rack. With it, you need a graphics card that can play video games. The math is the same. The bill is forty-eight times smaller.

That was the deep dive for episode nine.