This is the practical companion to episode nine of Actually, AI, diffusion.
You know the mechanism now. The model starts with static and removes noise step by step, steered by your text prompt. It has no concept of anatomy, no understanding of spelling, no spatial reasoning. It has patterns of noise learned from millions of images, and it subtracts its way toward something that matches the statistical signature of your words.
That knowledge changes everything about how you should use these tools. Most people treat the prompt box like a search engine. Type what you want, hope for the best. But your words are not a search query. They are a steering signal applied at every single denoising step. Every word shapes every step of the journey from noise to image. That is not a metaphor. That is the literal mechanics of classifier-free guidance. So let us talk about what actually works.
You have probably seen prompt guides that recommend adding phrases like "highly detailed, professional photography, eight K resolution, beautiful lighting" to the end of your prompts. It looks like superstition. It looks like saying "please" to a vending machine. But it works, and now you know why.
The model was trained on millions of images, each paired with a text description. High-quality photographs from professional stock sites came with descriptions full of words like "detailed," "high resolution," "professional." Low-quality snapshots from phone cameras came with simpler descriptions. The model learned a statistical association: when the text contains quality descriptors, the noise patterns correspond to sharper edges, more coherent textures, better lighting. When you add "highly detailed" to your prompt, you are not flattering the AI. You are shifting the denoising trajectory toward the region of its learned patterns that corresponds to images originally labeled with those words.
This means your quality words are doing mechanical work. They are not decoration. Choose them with the same care you would choose the subject of the image itself.
In tools that support them, negative prompts let you specify what you do not want. The model runs its denoising twice at each step, once with your positive prompt and once with your negative prompt, then steers away from the negative direction. You are not telling the model to avoid mistakes. You are telling it to steer harder away from the noise patterns associated with those descriptions.
Here is what actually works. Be specific rather than vague. "Extra fingers, fused fingers, missing fingers, malformed hands" targets the hand problem more precisely than "bad quality." Stack related terms to cover more territory: "blurry, out of focus, motion blur" works better than any single word. Keep negative prompts concise. There is a token limit, and every word competes for influence. Five well-chosen terms outperform twenty vague ones.
The platforms handle this differently. Stable Diffusion and FLUX interfaces typically have a separate negative prompt field. Midjourney uses a "no" flag: type your prompt, then add two dashes followed by "no" and the things you want to avoid. GPT Image does not expose negative prompts directly, relying instead on the language model layer to interpret your intent. If you want explicit negative control, Stable Diffusion and FLUX are your tools.
One thing worth knowing: recent research has found that negative prompts have a delayed effect. They do not kick in at the first denoising step. They work through a kind of mutual cancellation in the latent space, and this cancellation is more effective in the later steps of the process than the early ones. This explains why negative prompts sometimes feel like they are not doing anything and then suddenly clean up the final result. They are working, just not immediately.
There is a model-specific note here too. Stable Diffusion's second generation models, from two-point-zero onward, lean heavily on negative prompts. Users found that version two required good negatives to produce decent results, much more than version one-point-five did. Newer models like FLUX two and Midjourney version seven are less dependent on negatives because their base quality is higher, but negatives still help refine results. If you are using any Stable Diffusion model and not writing negative prompts, you are leaving quality on the table.
Most image generators expose a setting called the guidance scale, sometimes called CFG scale. The default is usually around seven to eight. Most people never touch it. But now that you know what it does, you can use it deliberately.
The guidance scale controls how aggressively the model follows your prompt versus its own instincts. At each step, the model makes two noise predictions: one with your text and one without. The guidance scale is a multiplier on the difference between those two predictions. A scale of one means the model barely listens to your prompt. It denoises toward whatever looks natural. A scale of seven means it follows your prompt firmly. A scale of twenty means it follows so aggressively that the image starts to look oversaturated and hallucinatory.
Think of it like a volume knob on a stereo. At one, the music is so quiet you can barely hear it over room noise. At seven, it sounds great. At twenty, the speakers are distorting. The music is technically there, but the signal is so amplified that it destroys itself.
In practice: use lower guidance, around three to five, when you want the model to surprise you. Give it a loose prompt and let its natural image priors fill in the gaps. Use higher guidance, around nine to twelve, when you know exactly what you want and the model keeps drifting. If your results look washed out or too generic, try raising the guidance. If they look harsh with weird artifacts, try lowering it. The sweet spot depends on the model, the prompt, and what you are making. FLUX two defaults to a guidance of three point five, which is lower than Stable Diffusion's typical seven. Different models, different sweet spots.
You know from the main episode why hands fail. Statistical averaging of inconsistent training data. No anatomical understanding. The model is guessing based on incomplete patterns. Here is what you can actually do about it, in order from easiest to most reliable.
First, negative prompts. Add "extra fingers, fused fingers, missing fingers, malformed hands, bad hands, poorly drawn hands" to your negative prompt. This steers the denoising away from the most common failure patterns. It does not guarantee perfect hands, but it shifts the probability meaningfully.
Second, generate in batches. Hands fail because the model is averaging ambiguous patterns, and different random noise starting points resolve that ambiguity differently. Generate ten or twenty variants of the same prompt. Some will have acceptable hands. This is not laziness. It is the correct workflow for how the physics works. You are exploring different starting points until you find one where the noise happened to resolve in your favor.
Third, inpainting. If you love an image but the hands are wrong, mask just the hands and regenerate that region. Make the mask generous. Include the wrists and part of the forearms. A tiny mask forces pixel-level matching at the boundary, which is the hardest constraint for the denoising process. A generous mask lets the model build coherent structure and blend naturally at the edges. Most tools, Midjourney, Stable Diffusion interfaces, FLUX, and GPT Image, support inpainting in some form.
Fourth, ControlNet. If you are using Stable Diffusion or FLUX through ComfyUI or AUTOMATIC1111, ControlNet lets you provide a structural guide for the image. Feed it a pose reference with clearly defined hand positions, and the diffusion model will respect that structure while filling in photorealistic detail. This is the most reliable approach for hands in specific poses. It is more work than the other three, but for anything going into a portfolio or publication, it is worth it.
Fifth, model choice. This is the simplest and most impactful change. Midjourney version seven and FLUX two produce dramatically better hands than older models. If you are still running Stable Diffusion one-point-five checkpoints because you have a favorite fine-tuned model, switching to a newer base model is the single biggest improvement you can make. The mechanism is the same, but larger models trained on more data resolve the hand ambiguity better.
ControlNet deserves its own section because it solves the composition problem that plagues every text-only workflow.
Published in twenty twenty-three by Lvmin Zhang at Stanford, ControlNet adds structural control to pretrained diffusion models without retraining them. The approach locks the original model's parameters and adds a parallel trainable copy connected through special zero-initialized layers. What this means in practice: you can provide the model with a skeleton of what you want, and it fills in the detail while respecting your structure.
The types of control include edge maps, which are outlines of where objects should be. Depth maps, which encode what is close and far. Human pose skeletons, stick figures showing body position. Segmentation maps, colored regions showing where different elements go. And even rough hand-drawn sketches.
Sketch a stick figure in the pose you want. Add a text prompt describing the character. The model generates a photorealistic person in exactly that pose. Draw a rough room layout. The model fills it with furniture. Provide a depth map from a three-dimensional scene. The model generates a photograph that matches the spatial relationships perfectly. This is where "subject on the left side" finally works, because you are showing the model where you mean, not hoping it interprets your English correctly.
ControlNet is available in ComfyUI, AUTOMATIC1111, and most Stable Diffusion interfaces. It works with FLUX models as well, and Stability AI released official ControlNet models for Stable Diffusion three point five. For Midjourney, the Omni Reference system introduced in version seven provides similar control through reference images rather than structural maps. You upload a reference image, and the model matches its composition, style, or character identity depending on the settings.
If you do any professional image generation work, spending an afternoon learning ControlNet will improve your results more than any amount of prompt engineering. It takes the composition out of the model's statistical guessing and puts it in your hands. Literally.
Here is what the mechanism teaches us about writing better prompts.
Structure your prompt in layers. Start with the core subject. Then add the medium or style. Then the lighting and mood. Then the quality modifiers. "A weathered fisherman mending nets on a dock, oil painting, golden hour light, warm tones, highly detailed, textured brushstrokes." Each layer shifts the denoising in a different direction. The subject tells the model what to make. The medium tells it which region of visual patterns to draw from. The lighting narrows the color and shadow patterns. The quality modifiers push toward the sharper, more coherent end of the training distribution.
Be specific about what you want, not about what to avoid. Save avoidance for negative prompts. "A woman with red hair wearing a blue dress standing in a wheat field at sunset" gives the model five concrete steering signals at every denoising step. "A nice picture of a woman" gives it almost nothing to work with, and the model will fall back on its most generic patterns, which means centered, symmetrical, studio-lit.
Artist names and style references activate dense, well-trained clusters of visual patterns. "In the style of Vermeer" shifts denoising toward warm light, muted colors, and domestic interiors. "In the style of Moebius" shifts toward clean lines, alien landscapes, and saturated colors. Combining two style references forces the model to balance both at every step, sometimes producing beautiful hybrids, sometimes producing incoherent mixes. The key is that the styles need to be compatible in some dimension. "Vermeer plus cyberpunk" works because warm domestic lighting applied to futuristic subjects creates an interesting tension. "Moebius plus Rembrandt" might fight because their approaches to color and line are fundamentally opposed.
Prompt weighting, where supported, lets you amplify specific terms. In Stable Diffusion interfaces and ComfyUI, parentheses increase weight: putting a term at one point five means it is a stronger steering signal. Use this sparingly. Heavy weighting on one term can dominate the denoising and produce images that are technically prompt-faithful but visually broken. A weight of one point two to one point four is usually the useful range. Above two, things get weird fast.
One prompt strategy that most guides skip: describe the scene, not the shot. Instead of "close-up portrait of a woman," try "a woman's face fills the frame, shallow depth of field, bokeh background." The second version gives the model more visual information to work with. "Close-up" is a camera instruction that the model interprets loosely. "Fills the frame" plus "shallow depth of field" describes what the image actually looks like, which is what the model is trained to produce.
Most image generators offer inpainting, where you mask a region and the model regenerates just that area. Inpainting is literally the same diffusion process applied to a specific region. The masked area gets noise. The unmasked area stays fixed. The model denoises the mask while trying to maintain coherence with the surrounding pixels.
The practical lesson: use generous masks. If you want to fix a hand, do not mask just the fingers. Mask the whole hand, the wrist, part of the forearm. A tiny mask forces the model to match pixel-level detail at the boundary, which is the hardest possible constraint for a denoising process. A generous mask lets it build something coherent and blend naturally at the edges.
Inpainting works best for replacing or modifying isolated elements: swapping a background, fixing a hand, changing an expression, removing an unwanted object. It struggles when the masked region needs to maintain complex relationships with the rest of the image, like regenerating one person in a group scene where the others must stay exactly the same. The model has no understanding of spatial relationships between the masked and unmasked regions. It only predicts noise. Cross-boundary coherence is a statistical hope rather than a guarantee.
For iterative refinement, generate your base image first, identify the problems, then inpaint them one at a time. Each inpainting pass can introduce new issues at the mask boundary, so work from the most important fix to the least. Professional artists working with these tools often do five or six inpainting passes on a single image, progressively cleaning up problem areas until the result is polished. This is not a failure of the tool. This is the workflow the tool is designed for.
Here is an honest assessment of the major tools as of early twenty twenty-six.
Midjourney is the easiest to use and produces the most consistently beautiful results with the least effort. Version seven handles complex prompts well, produces excellent hands most of the time, and its aesthetic bias toward pleasing images means you rarely get ugly results even with simple prompts. The Omni Reference system provides composition and style control without technical setup. The downside: it is a closed system. You cannot fine-tune it, cannot run it locally, cannot control the architecture. You get what Midjourney decides to give you, and you pay a monthly subscription. Best for: quick concept art, beautiful images with minimal effort, anyone who does not want to think about technical settings.
FLUX two from Black Forest Labs is the current leader in open-source quality. It includes built-in typography that renders readable text, multi-reference capabilities, and quality that rivals or beats Midjourney in many scenarios. It runs locally if you have a capable graphics card, and the open weights mean the community can fine-tune, customize, and extend it in ways Midjourney never allows. ControlNet works with FLUX models. The thirty-two billion parameter architecture produces stunning results, and NVIDIA collaboration cut memory requirements by forty percent through FP8 optimization. The downside: setting it up through ComfyUI requires more technical comfort than signing up for Midjourney, and running it locally needs a decent GPU. Best for: anyone who wants control over the full pipeline, professional workflows requiring ControlNet, text in images, custom fine-tuning for specific styles.
GPT Image from OpenAI replaced DALL-E three in late twenty twenty-five. It integrates image generation directly into the language model, which means it understands your intent better than any standalone diffusion system. The text rendering is excellent because the language model processes your words before the image generation begins. It handles complex multi-part instructions well because the language model breaks them down. The downside: entirely closed, entirely cloud-based, and professional users report that results can feel flatter or more generic than dedicated image models at their best. No fine-tuning, no local running, no ControlNet. Best for: users already in the ChatGPT ecosystem who want images with good text rendering, complex scene understanding, or conversational iteration on a concept.
Stable Diffusion three point five is still the workhorse for anyone deep in the open-source ecosystem. The largest collection of fine-tuned models, LoRA adapters, and community tools exists for Stable Diffusion variants. The Flash model runs on mobile devices in four steps. The Large variant with eight billion parameters produces excellent results. Many specialized fine-tuned models for specific styles, characters, or use cases are only available for older Stable Diffusion architectures. Best for: anyone with existing Stable Diffusion workflows, specialized fine-tuned models, the Flash variant for mobile or edge generation.
Honestly, the right answer for most people is to use two tools. Midjourney or GPT Image for quick concepts and iteration, plus FLUX through ComfyUI for anything requiring ControlNet, specific fine-tuning, or images with readable text. The two workflows complement each other.
Here is the most practically useful thing in this episode. Generating ten images and picking the best one is not laziness. It is the correct workflow, and it falls directly out of how diffusion works.
Every generation starts from a different random noise field. Different noise, different starting point, different path through the denoising process, different final image. Even with the exact same prompt and settings, you get a different result every time. This is not a deficiency. It is the fundamental nature of the process. You are coaxing images out of static, and different static contains different latent structures.
Professional artists who use these tools know this. They generate large batches, sometimes fifty or a hundred variants, and curate the results. The skill is not in writing the one perfect prompt that produces the one perfect image on the first try. That is not how the physics works. The skill is in writing a prompt that puts you in the right neighborhood, generating enough variations to explore that neighborhood, and having the eye to recognize which result landed closest to what you wanted.
The seed value matters here. Every generation uses a random number to initialize the noise. If you find an image you almost love, note the seed. Generate again with the same seed but a slightly modified prompt. The starting noise is identical, so the changes in the result come entirely from your prompt modifications. This lets you iterate precisely instead of randomly. Midjourney shows the seed in the job details. Stable Diffusion and FLUX interfaces let you set it directly. GPT Image does not expose seeds, which is one reason professional users often prefer the open-source tools for final production work.
Honestly, diffusion is not the right tool for everything, and knowing when to reach for something else saves time and frustration.
If you need exact text in an image, use GPT Image or FLUX two with its typography system. Older diffusion models will frustrate you endlessly.
If you need a specific real photograph, take the photograph. Diffusion generates plausible images, not accurate ones. It cannot reproduce a specific building, a specific person's face, or a specific moment because it does not know what those things look like. It knows what buildings, people, and moments look like statistically.
If you need video, the field is still early. Several tools can produce short clips. But physics still breaks in all of them over longer durations. Complex actions, consistent characters across scenes, and accurate motion over more than a few seconds remain unsolved. If your use case requires reliable video, traditional production tools are still more predictable.
If you need precise control over every element in a scene, consider whether a three-dimensional rendering tool would serve you better. Diffusion gives you approximate control through prompts and ControlNet. A three-dimensional tool gives you exact control over everything. The tradeoff is speed versus precision.
Open any image generator. Type a simple prompt: "a coffee cup on a wooden table." Generate the image. Now add style words: "a coffee cup on a wooden table, oil painting by Vermeer." Generate again.
Look at the difference. The subject is the same. The table is still there, the cup is still there. But the lighting has changed. The color palette has shifted. The textures are different. Those words did not add instructions on top of a finished image. They changed the trajectory of every single denoising step from the very first one.
Now try it with a negative prompt if your tool supports it. Same prompt, but in the negative field add "blurry, low quality, dark shadows." Compare the result with your first generation. You should see sharper edges, cleaner lighting, more coherent detail. The negative prompt did not fix problems. It steered the denoising away from the noise patterns associated with those words.
Now generate the same prompt ten times with different seeds. Look at the variety. Some compositions will be tighter, some looser. The cup handle will face different directions. The table grain will run different ways. Each image was hiding in a different block of static, and your prompt extracted it.
This is the practical heart of diffusion: you are not instructing a painter. You are tuning the chisel that carves a sculpture from noise. Your prompt determines which sculpture. Your guidance scale determines how aggressively the chisel strikes. Your seed determines which block of marble you start with. And generating multiple times lets you explore different blocks until you find the one that already contains what you were looking for.
That was the practical companion for episode nine of Actually, AI.