Garbage In: A Fine-Tuning Disaster in Three Acts

February Twelfth: The Brilliant Idea

You know that feeling when you have an idea so good it keeps you up at night? Not because you are worried, but because you are excited. Because you can already see how it ends, and the ending is glorious. That is where this story starts. With an idea that felt like a gift.

The plan was elegant. Sweden has a digital library called Litteraturbanken. Think of it as Project Gutenberg, but Swedish, and curated by actual literature scholars. Thousands of works by authors who died long enough ago that their writing belongs to everyone now. Public domain. Free. Just sitting there, waiting to be useful.

And the idea was this. Take a small language model, a three point eight billion parameter model called Phi four mini, and teach it to write like two specific Swedish authors. Not just generic old-timey Swedish. The specific, recognizable voice of two writers who could not be more different from each other. Carl Jonas Love Almqvist, born seventeen ninety three, a romantic who wrote experimental prose that bent the rules of what Swedish literature could be. And Elin Wägner, born eighteen eighty two, a journalist and feminist whose writing was sharp, grounded, and alive with the specific details of the world she moved through.

Imagine it. A small model, running locally on your own hardware, that could produce prose in the style of Sweden's literary giants. Not a parlor trick. A genuine tool for understanding how these writers thought, how they built sentences, how they saw the world. The applications felt endless. Educational tools. Creative writing aids. Literary analysis. And the best part? Fine-tuning a small model on Azure AI Foundry costs almost nothing. Two to six dollars per training run. The whole experiment might cost less than a nice lunch.

The optimism was, in retrospect, the first warning sign.

February Fifteenth: The Data

Every fine-tuning project starts with data. This is the part where most tutorials show you a clean spreadsheet and move on. But this is not a tutorial. This is what actually happened.

The raw material was forty five works by Almqvist and thirty four by Elin Wägner. Novels, short stories, essays, letters. Decades of writing by two of Sweden's most distinctive literary voices. Downloading them from Litteraturbanken was the easy part. The hard part was turning nineteenth century literature into something a language model could learn from.

Fine-tuning requires prompt-response pairs. You give the model a prompt, and you tell it what the correct response should be. Do this thousands of times, and the model learns the pattern. So the text was split into chunks, nineteen thousand seven hundred and seventy three of them, and each chunk was paired with a writing prompt. Prompts like "write a passage about nature and freedom in the style of Almqvist" or "describe a scene of social tension in Wagner's voice."

Here is the thing that nobody said out loud at the time. The prompts were random. They were generated from a list of themes, shuffled, and paired with whatever chunk of text came next in the queue. A prompt about describing a stormy coastal landscape might be paired with a chunk from the middle of a chapter about a dinner party. A prompt about writing dialogue might land on a passage of dense philosophical reflection with no dialogue in it at all.

Nineteen thousand examples. And the connection between what was asked and what was shown was, to put it generously, loose. To put it honestly, nonexistent.

But the real problem was hiding deeper than that. These texts were not born digital. They were scanned from physical books, some of them over a hundred years old, and converted to text through optical character recognition. And OCR on old Swedish books is not a solved problem. The scans had artifacts. Cyrillic characters that appeared from nowhere, scattered through the Swedish text like uninvited guests. Broken words split across line breaks and never reassembled. Encoding noise that turned perfectly good prose into digital confetti.

A cleaning pipeline ran over the data. It caught some of the problems. It did not catch enough of them. The Cyrillic characters survived. The broken words survived. The encoding noise survived. Nineteen thousand seven hundred and seventy three training examples, and an unknown but significant percentage of them were, to use the technical term, garbage.

Nobody checked. The data looked roughly right if you squinted. There was a lot of it, which felt like a good thing. And so the training job was submitted to Azure, and everyone went to bed feeling productive.

February Sixteenth: The Training

The training itself was uneventful. Azure processed the nineteen thousand examples, adjusted the model's weights, and reported back that everything had gone smoothly. The loss curves looked reasonable. The job cost twenty nine dollars, which was more than the two to six dollar estimate, but not alarming. A three point eight billion parameter model learning from nearly twenty thousand examples takes a few hours of GPU time, and GPU time is not free.

The model was ready. A custom version of Phi four mini, fine-tuned on the collected works of two of Sweden's greatest authors. All that remained was to try it out.

February Seventeenth: The First Test

The first prompt was simple. Write about nature and freedom in the style of Almqvist.

The base model, the one that had not been fine-tuned, went first. It produced perfectly coherent Swedish prose. Generic, yes. It did not sound particularly like Almqvist. But it was grammatically correct, thematically on topic, and clearly the work of a model that understood what it was being asked to do.

Then the fine-tuned model got the same prompt.

Genom att stiga upp med en fast anda så länge så skälmlig och fridlös och slakt åren, fångade tyckningsrikarna den ståtliga färgen.

That is not Almqvist. That is not anyone. It is a sentence that starts with something that might be Swedish, drifts into words that sound Swedish but mean nothing, and ends in a place no human writer would ever arrive at. "Tyckningsrikarna" is not a word. "Slakt åren" means something like "slaughter the years," which would be poetic if it were intentional, but it was not. The model was not writing. It was hallucinating in Swedish.

But the optimism had not died yet. Maybe that prompt was an outlier. Maybe other prompts would work better. Try something more concrete. Describe a stormy night at sea.

One hundred and seventy one. One hundred and seventy four. One hundred and seventy four. One hundred and seventy one. One hundred and seventy one. One hundred and seventy one. One hundred and seventy four. One hundred and seventy four. One hundred and seventy one.

That is the fine-tuned model's idea of a stormy night at sea. A sequence of numbers, repeated in a pattern, forever. Not words. Not Swedish. Not even language. Just numbers. One hundred and seventy one, one hundred and seventy four, back and forth, like a broken metronome counting nothing.

And then, in one particularly memorable test, the model produced Swedish text with Arabic script mixed into it. Not Arabic words, just Arabic characters, scattered through the Swedish like someone had bumped the keyboard in a different country. Nobody involved in the project spoke Arabic. None of the training data was in Arabic. The model had invented a language that did not exist, drawing from some deep well of confused pattern matching that nobody could explain.

February Eighteenth: The Diagnosis

The denial phase lasted about an hour. Then the comparisons started. Every prompt was tried on both models, the base and the fine-tuned version, and the results were laid side by side. The pattern was devastating in its consistency.

The base model, every single time, produced coherent, on-topic, grammatically correct Swedish. Not brilliant prose. Not Almqvist. Not Wägner. But competent, readable, responsive to the prompt.

The fine-tuned model, every single time, produced something worse. Sometimes gibberish. Sometimes number sequences. Sometimes almost-Swedish that dissolved into nonsense mid-sentence. Sometimes Arabic. Never, not once, anything that could be called an improvement over what the base model already did for free.

The fine-tuned model was strictly worse on every dimension. Every single one.

And here is the part that turns this from a failure into a lesson. The model did not fail to learn. That would have been a simple outcome, easy to diagnose, easy to fix. No. The model learned exactly what it was taught. It learned to reproduce OCR artifacts, because the training data was full of OCR artifacts. It learned to ignore prompts and output random literary fragments, because the training data paired random prompts with unrelated text chunks. It learned that the correct response to any question was a disconnected chunk of noisy, poorly-formatted nineteenth century text, because that is what nineteen thousand seven hundred and seventy three examples told it the correct response was.

The model was a perfect student. It was the teacher that was the problem.

February: The Autopsy

Three things went wrong simultaneously, and any one of them would have been enough to sink the project.

First, the OCR artifacts. The training data was dirty. Not a little dirty, not "some minor noise in a few examples" dirty. Systematically dirty, with Cyrillic characters and broken words and encoding garbage spread throughout the corpus. The cleaning pipeline was too gentle. It looked for the problems it expected and missed the problems it did not.

Second, the mismatched pairs. A fine-tuning dataset is a set of instructions. Each pair says "when you see this, do that." When the "this" and the "that" have nothing to do with each other, the instruction becomes "when you see anything, do whatever." The model learned to treat prompts as noise and outputs as random, because that is what the data showed.

Third, the sheer volume. Nineteen thousand examples is a lot for a three point eight billion parameter model. With clean, well-matched data, that volume would have been a strength. With dirty, mismatched data, it was a hammer. The model did not have enough capacity to find the signal buried in all that noise. The noise won.

The lesson sounds like a cliché when you say it out loud. Garbage in, garbage out. Everyone knows that. Everyone nods along when they hear it. And then everyone submits their training data without properly inspecting it, because checking nineteen thousand examples is boring and the GPU meter is running and surely it is fine.

Six Months Later: The Sequel Nobody Asked For

Here is where the story should end. Lesson learned. Data quality matters. Clean your training data. Check your prompt-response alignment. Move on.

But that is not what happened.

Six months later, the same project attempted another fine-tuning run. This time the source material was not nineteenth century literature. It was newspaper articles. Modern text, born digital, no OCR to worry about. Progress, right?

The training data was not checked for quality. And forty two percent of it turned out to be advertisements. Not articles. Not journalism. Ads. The model was being taught to write like a newspaper by feeding it a diet that was nearly half marketing copy for farm equipment and patent medicine and subscription offers.

The combined spend across both fine-tuning attempts came to about fifty dollars. Not a fortune. But fifty dollars on models that were never properly evaluated, never used in production, and produced results that were measurably worse than the free base models they were built from. The most expensive part was not even the training. It was the hosting. Running a custom model on Azure costs about eighty cents an hour, and those hours add up when you are running tests trying to figure out why your model is speaking in number sequences.

The lesson file for this experiment now includes a mandatory checklist that must be completed before any future fine-tuning attempt. Clean data verification. Prompt-response alignment checks. Held-out evaluation sets. Baseline comparison requirements. The checklist exists because the same mistake was made twice, and the project decided that twice was the maximum number of times it was willing to pay for the same education.

The Principle Underneath

Fine-tuning is seductive because it feels productive. You have data. You have a model. You push a button. The training curves go down, which means learning is happening. The whole process has the satisfying texture of work being done.

But the training curves going down only means the model is getting better at reproducing the training data. If the training data is garbage, the model gets better at producing garbage. The loss curves for this experiment looked perfectly normal. The model was converging beautifully. It was converging on nonsense, but the numbers did not know that.

The real cost of dirty data is not the twenty nine dollars for the training run. It is the time spent debugging, the false confidence in a model that appears to work, and the opportunity cost of not using a base model that was already good enough. The base Phi four mini could write decent Swedish prose out of the box. No training required. No data preparation. No cleaning pipeline. No twenty nine dollars. Just a prompt and a response that made sense.

Sometimes the most expensive option is the one that costs almost nothing to try but teaches you nothing about why it failed. And sometimes the best model is the one you already had before you started trying to improve it.

One hundred and seventy one. One hundred and seventy four. One hundred and seventy four.

That is the sound of a model that learned exactly what it was taught. The question was never whether the student could learn. The question was whether the teacher knew what it was teaching.