Projekt Kansen: One Man, One Grant, One Swedish Foundation Model

The Grant Letter

A letter arrives at a farmhouse in Kall, Jämtland. Population: let's just say the reindeer outnumber the people. The letter is from Vinnova, or maybe Vetenskapsrådet, or some newly formed agency called Myndigheten för Absurda men Tekniskt Möjliga Projekt. It says: Congratulations. You have been awarded fourteen million kronor to train a Swedish foundation model from scratch. The conditions are strict. Only Swedish training data. One person. No team. No hired specialists. AI tools are permitted — the whole point is to explore what a single technically literate non-specialist can accomplish with AI assistance. You have twelve months.

The man who opens this letter is a trained radio journalist who runs a free local newspaper, rents out party equipment, and has spent the last year directing AI agents to build podcast pipelines, face recognition systems, and an orchestrator that makes language models argue at a virtual bar. He has never written a line of CUDA. He has never computed a gradient by hand. But he has fifty-plus GitHub repositories, an ADHD brain that treats side quests as fuel, and a Claude Max subscription that he's been putting to work roughly eighteen hours a day.

His name is Pär. And he's about to find out how much Swedish text actually exists.

The Swedish Data Problem

Here's the first reality check. Modern foundation models train on five to fifteen trillion tokens. Llama three trained on fifteen trillion. DeepSeek V3, fourteen point eight trillion. But these are overwhelmingly English. When AI Sweden built GPT-SW3, the first genuinely large Swedish language model, they assembled a dataset of three hundred and twenty billion tokens. And that wasn't just Swedish — it included Norwegian, Danish, Icelandic, English, and code. The Swedish portion was a fraction of the total.

To understand the scale problem, consider this: Swedish Wikipedia has roughly three and a half million articles, but most are short stubs. The total text is maybe two to three billion tokens. English Wikipedia has about four billion words, call it five billion tokens. Swedish is somewhere around a third of that. And Wikipedia is just the starting point.

So where does Swedish text come from? There are five main buckets, and Pär is going to have to raid every single one of them.

Bucket one: the web. Common Crawl has Swedish pages, but they're a tiny slice. The OSCAR corpus, which filters Common Crawl by language, has Swedish data. The mC4 dataset, a multilingual version of the cleaned Common Crawl, includes Swedish text. Between these sources, you might scrape together somewhere between fifty and a hundred billion Swedish tokens. That sounds like a lot, but it's noisy. It's full of cookie consent banners, boilerplate navigation text, machine-translated garbage, and SEO spam in Swedish that makes regular SEO spam look like Strindberg.

Bucket two: government and institutional text. Sweden is a goldmine here. The Riksdag has published every parliamentary debate, motion, and interpellation going back to eighteen sixty-seven. The National Library, KB, has digitized four hundred and fifty years of parliamentary prints — more than three million documents from the fifteen twenties to nineteen seventy. Språkbanken, the Swedish Language Bank at Gothenburg University, maintains dozens of annotated Swedish corpora. The Swedish legal code, government reports called SOU:er, regulatory texts from every myndighet — all of this is public domain or Creative Commons licensed. This bucket might yield another ten to twenty billion tokens of extremely high-quality, formal Swedish.

Bucket three: literature and books. Project Runeberg is the Swedish equivalent of Project Gutenberg, with digital versions of classic Nordic literature. But copyright in Sweden lasts seventy years after the author's death, so you're limited to works by authors who died before nineteen fifty-six. That gives you Strindberg, Lagerlöf, Bellman, Fröding, Heidenstam, and many others, but not Astrid Lindgren, not Vilhelm Moberg, not anything from the modern era. Maybe five to ten billion tokens from freely available literature.

Bucket four: news archives. Swedish newspapers have digitized archives, but most are paywalled or restricted. You'd need licensing deals, which is hard for one person on a grant. However, SVT and SR publish enormous amounts of text online — articles, program descriptions, transcripts. Some of this is usable under Creative Commons or similar terms.

Bucket five: user-generated content. Swedish Reddit, Flashback Forum — Sweden's legendary anonymous forum — has an enormous archive of written Swedish. Flashback alone has millions of threads spanning twenty-plus years. The text quality ranges from brilliant to absolutely horrifying, but it's authentic spoken Swedish in written form. The licensing situation is murky at best.

The Inventory

So Pär sits down with a spreadsheet. What does the total look like?

Web crawl, filtered and deduplicated: maybe sixty to eighty billion tokens. Government and institutional text: ten to twenty billion. Classic literature: five to ten billion. News and media, where accessible: five to ten billion. User-generated content, where licensing allows: ten to twenty billion.

Grand total, optimistically: a hundred and forty billion tokens. Realistically, after aggressive quality filtering and deduplication: maybe eighty to a hundred billion.

That's a problem. The Chinchilla scaling law from DeepMind suggests that a seven billion parameter model should train on about one hundred and forty billion tokens for compute-optimal performance. But the Llama paper showed that you can keep training well beyond that — they trained their seven B model on a trillion tokens and it kept improving. So our hundred billion tokens puts us right at the Chinchilla optimum for a smaller model, maybe three to four billion parameters. Or we could train a one-and-a-half billion parameter model and be data-rich.

But here's the thing. AI Sweden already trained GPT-SW3 with models up to forty billion parameters on three hundred and twenty billion tokens, and they had a team of researchers, access to the Berzelius supercomputer at Linköping University, and collaboration with RISE and the WASP program. They're now working on a multimodal model with at least a hundred billion parameters.

We needed a large-scale Swedish dataset of high quality. Since no such datasets existed before this initiative, we collected data in the Nordic and English languages.

So the honest question becomes: why would one person in Kall duplicate what an entire national AI initiative already did, but worse?

The Actual Interesting Part

And this is where the grant application gets clever. Because the goal isn't to beat AI Sweden. The goal isn't even to make a good model. The goal is to answer a research question: what can one non-specialist, armed with current AI tools, actually accomplish in the space of foundation model training?

Think about what that means. Five years ago, training any language model was exclusively the domain of PhD researchers at large institutions. Two years ago, fine-tuning became accessible to skilled hobbyists. Today, Pär routinely directs Claude to build entire software systems, orchestrate multi-agent workflows, and process complex data pipelines — all without writing the code himself.

The research question isn't "can you make a good Swedish LLM?" The answer to that is: just use GPT-SW3, or fine-tune Llama on Swedish data, or wait for AI Sweden's next model. The research question is: "how far up the capability ladder can AI-assisted development push a single person?" And foundation model training is the stress test, because it combines data engineering, distributed systems, machine learning theory, and infrastructure management — all areas where Pär has zero formal training but extensive practical experience directing AI to do the work.

The Claude Problem

Now here's a genuinely interesting wrinkle. Pär's primary tool for building anything is Claude. Claude Code for the engineering, Claude chat for the thinking, Claude for data pipeline design and debugging. But Anthropic's usage policy has something to say about this.

The policy is quite clear. You can use Claude's outputs to train models that don't compete with Anthropic's own models. Specialized classifiers, internal tools, domain-specific applications — those are fine. But you cannot use Claude's outputs to train a model that competes with Anthropic's services, meaning a general-purpose language model. A Swedish foundation model is, by definition, a general-purpose language model.

We prohibit customers from using our services to train or develop AI models without our written permission. When outputs are used to train new models without our oversight, safety controls may be lost.

So there are two separate issues. First, using Claude to generate synthetic Swedish text as training data — that would be a clear violation without written permission. You'd essentially be distilling Claude into your model. Second, using Claude as a coding assistant to build the training pipeline — this is a greyer area, because you're not using Claude's outputs as training data, you're using Claude to write the Python scripts that process the training data.

The grant application probably needs to include a request for written permission from Anthropic. Or Pär could frame it as research that Anthropic might want to support — after all, understanding how far a single person can push AI-assisted development is exactly the kind of question that matters for AI democratization. A politely worded email to Anthropic's partnerships team, explaining the research nature of the project, might actually work.

Alternatively, Pär could use an open-source coding assistant like Qwen Coder or DeepSeek Coder running locally on the MacBook Pro M5 for the data pipeline work, and reserve Claude for the non-training-related thinking and planning. Belt and suspenders. Keep Claude for podcast scripts and grant reports, use open models for anything that touches the training pipeline.

The Hardware Plan

Let's talk compute. The target is a one-and-a-half to three billion parameter model trained on eighty to a hundred billion tokens. Using the rough formula of six times parameters times tokens for total floating point operations, a three B model on a hundred billion tokens needs about one point eight times ten to the twenty-first FLOP. At forty percent utilization on H100 GPUs doing about one point nine petaFLOPS peak, that's roughly twenty-six GPU-days.

On sixteen rented H100s, that's less than two days of wall-clock time. At current cloud prices, maybe three to four thousand dollars in raw GPU rental. Even with experimental runs, failed attempts, hyperparameter sweeps, and the inevitable catastrophic loss spike at three AM, the compute budget could stay under fifty thousand kronor. Out of a fourteen million kronor grant, that's less than half a percent.

But wait. The grant says no outside human help, and it's an exploration of what one person can do. So maybe the compute shouldn't be rented from a faceless cloud provider. Maybe it should be done on the Berzelius supercomputer, where AI Sweden already trained GPT-SW3. Swedish taxpayers paid for that machine. A research grant could include access. And Berzelius has hundreds of Nvidia A100 GPUs.

Or — and here's the option that would make the PärPod audience lose their minds — what if Pär drove to that datacenter liquidation auction in Porjus, bought a stack of Dell PowerEdge servers, loaded them with second-hand GPUs, and trained the model on hardware he owns? The electricity in Norrbotten is about thirty öre per kilowatt-hour. Training a three B model would use maybe fifteen hundred kilowatt-hours total. That's four hundred and fifty kronor in electricity. The heating bill for the house would actually go down because the servers would replace the radiators.

The problem with this plan is that consumer GPUs don't have the interconnect bandwidth needed for distributed training, and you'd need at least four eighty-gigabyte GPUs to fit a three billion parameter training run in memory. Four used A100 eighty-gig cards cost somewhere around two hundred thousand kronor at current prices. That's still within the grant budget, but now you're also a datacenter operator, a hardware technician, and a cooling engineer, all while being the sole researcher on the project.

The Data Pipeline

This is where most of the actual work lives. Not the training itself — that's surprisingly mechanical once you have clean data and a working training script. The data pipeline is where Pär would spend six of his twelve months.

Step one: web crawl processing. Download the Swedish subset of Common Crawl, OSCAR, and mC4. Run language identification to confirm everything is actually Swedish and not Norwegian or Danish, which is harder than it sounds because written Norwegian Bokmål and Swedish are close enough to confuse most automated classifiers. Deduplicate using MinHash. Score for quality using a classifier trained on Swedish Wikipedia versus random web text. Remove personally identifiable information. Remove pages that are primarily navigation, ads, or cookie consent.

Step two: institutional data. Write scrapers for riksdagen.se open data, Språkbanken corpora, KB's digital collections, the Swedish legal code, and government agency publications. These are high-quality but come in varied formats: TEI XML, PDF, plain text, HTML. Each source needs a custom parser.

Step three: literature. Download Project Runeberg. Process the scanned-and-OCR'd texts from KB's digital library. The older texts are in Fraktur script or nineteenth-century Swedish spelling, which adds complexity. "Hafva" needs to be normalized to "ha" for the model to learn modern Swedish, or kept as-is if you want the model to understand historical text too.

Step four: tokenizer training. You can't use the GPT-2 tokenizer for Swedish — it was designed for English and wastes tokens on Swedish words. A Swedish-optimized tokenizer needs to be trained on a representative sample of the corpus. SentencePiece or the Hugging Face tokenizers library can do this, but the vocabulary size choice — thirty-two thousand? Sixty-four thousand? — affects both model size and how efficiently Swedish morphology is represented.

Step five: mix the data. Not all sources are equally valuable. You want to upsample the high-quality government text and literature, and downsample the noisy web crawl. Getting this mixture right is black art as much as science. AI Sweden spent significant research effort on their data mixture for GPT-SW3.

All of this needs to happen before a single gradient is computed. And all of it is the kind of work where having an AI coding assistant is the difference between six months and six years.

The Training Run

Let's say it's month eight. The data is clean, tokenized, and sitting in a cloud storage bucket. Pär has chosen his architecture — probably a standard Llama-style decoder transformer, because the training code is open source, well-documented, and battle-tested. Three billion parameters. A hundred billion tokens. The training configuration is lifted mostly from the Llama paper with adjustments for the smaller scale.

The training script is based on either NanoGPT for the educational purity of it, Llama's official training code, or something like the Hugging Face Transformers training loop with DeepSpeed for distributed training. Pär has tested on tiny runs — a fifty-million parameter model on a billion tokens — to verify the pipeline works. The loss curves look reasonable. The generated text is garbage, but it's Swedish garbage, which is actually a good sign.

He starts the real run. Sixteen H100 GPUs on a cloud cluster. The estimated training time is thirty-six hours. The cost, about twenty-five thousand kronor. He watches the loss curve drop on a Weights and Biases dashboard. The first few hours look smooth. Then, around hour twelve, the loss spikes. The gradient norm explodes. The training run has diverged.

This is the moment that separates a researcher from a hobbyist. A real ML researcher would check the learning rate schedule, examine the gradient statistics, look for data corruption, adjust the warmup steps. Pär, being Pär, asks Claude — no wait, he can't use Claude for this part. He asks the locally running Qwen model on his MacBook. The answer comes back: probably a learning rate too high for this batch size. Try reducing by a factor of two and restarting from the last checkpoint.

It takes three attempts. Each failed run costs six to eight thousand kronor. By the end of month nine, the model has trained for eighty billion tokens and the loss has converged to a reasonable value. Total compute cost: about ninety thousand kronor. Still well within the grant budget.

What Comes Out

The model generates Swedish text. Let's not sugarcoat this — the text is mediocre. It can complete sentences. It has a sense of Swedish grammar. It knows that Stockholm is the capital and that midsommar involves herring and snaps. But compared to GPT-SW3's forty billion parameter model, or Llama fine-tuned on Swedish, or just asking Claude, it is noticeably worse at everything.

The base model has no instruction-following ability. You can't ask it a question and get an answer. You can give it a prompt and it will continue writing in a plausible-sounding Swedish that occasionally veers into Danish because the language boundary was hard to enforce in the web crawl data. It writes "mig" when it should write "mig" and occasionally writes "meg" which is Norwegian, and once, memorably, produces an entire paragraph about Jämtland that sounds like it was originally a tourist brochure from nineteen eighty-seven.

To make it actually useful, it would need supervised fine-tuning on instruction-response pairs, and then reinforcement learning from human feedback, or at least DPO. Which is another few months of work. The grant runs out in month twelve.

But here's what Pär can write in the final report.

The Report

One person, with no formal machine learning training, using AI coding assistants and open-source tools, successfully trained a three billion parameter Swedish foundation model from scratch. The data pipeline processed over two hundred sources of Swedish text, producing a clean corpus of approximately ninety billion tokens. The training run completed on commodity cloud hardware for under a hundred thousand kronor in compute. The total grant expenditure, including hardware, cloud costs, software licenses, and the one person's salary for twelve months, came to roughly three million of the fourteen million kronor budget.

The model itself is not state of the art. It would not be useful deployed as a product. But the point was never the model. The point was the process.

Five years ago, this project would have been impossible for a hundred-person team without a hundred million kronor. Ten years ago, it would have been impossible at any price. Today, one person did it from a farmhouse north of Åre, using AI tools that didn't exist eighteen months before the project started.

The report's conclusion writes itself: the barrier to training a foundation model is no longer technical skill, financial resources, or institutional backing. It's data. If you have the data, the tools exist to turn it into a model. The question for Sweden isn't whether we can build Swedish AI models. The question is whether we can assemble and maintain the datasets that make those models worth building.

The committee notes with interest that the grantee returned eleven million of the fourteen million kronor awarded. This is, to our knowledge, unprecedented.

AI Sweden spent years and millions building something good. Pär spent twelve months and pocket change building something worse. But he proved that the distance between "impossible" and "mediocre" has collapsed. And the distance between "mediocre" and "good" is just data. It always was.

The farmhouse in Kall is warm. The servers, if he bought them in Porjus, are running. The reindeer outside are unimpressed. And somewhere in the podcast feed, an AI-generated voice is reading Swedish text that was written by a model that was trained on data that was cleaned by code that was written by another AI. It's turtles all the way down. But the turtles speak Swedish, and that's not nothing.