Foundation Model on a Budget: What Would It Actually Take?

The Absurd Question

Here's a question that sounds like it belongs in a pub at two in the morning. What if you wanted to train a foundation model? Not fine-tune one. Not slap a LoRA on top of Llama. A real, from-scratch, pre-trained foundation model. The kind of thing that learns what language is by reading trillions of words. What would it cost, what would you need, and could you do it for less than the price of a nice apartment in Stockholm?

The answer, spoiler alert, is no. You cannot do it for the price of a nice apartment in Stockholm. But you might — might — do it for the price of a very nice house in Åre. Let's walk through what it would take, piece by piece, from the data to the hardware to the electricity bill that would make your accountant cry.

The Data Mountain

The first thing you need is data. And not a little bit. The original Transformer paper from twenty seventeen, the one that started all of this, trained on a dataset so small it cost about nine hundred dollars in compute. That was a translation model. We're not talking about that. We're talking about a model that understands language broadly enough to be a foundation for everything else.

Modern foundation models train on somewhere between five and fifteen trillion tokens. A token is roughly three quarters of a word, so fifteen trillion tokens is about eleven trillion words. To put that in perspective, the entire English Wikipedia is about four billion words. You'd need roughly three thousand Wikipedias just to get started. The good news is that most of this data is freely available, or at least theoretically free. The five main sources that every foundation model draws from are web crawls, reference works like Wikipedia, books, scientific papers and code, and social media or user-generated text.

Common Crawl is the backbone. It's a nonprofit that crawls the web and publishes snapshots for free. The April twenty twenty-one snapshot alone was three hundred and twenty terabytes of raw HTML. After aggressive filtering — removing spam, duplicates, porn, malware pages, and generally terrible content — you end up with something usable. The RefinedWeb project extracted five trillion tokens from Common Crawl alone, and made six hundred billion of them publicly available. That was enough to train the Falcon forty B model. So the raw material is out there. You don't have to pay for it. But you absolutely have to clean it.

The Data Pipeline Nobody Talks About

And this is where the first real cost sneaks in. Cleaning data for foundation model training is not a weekend project. You need language identification to filter out non-target languages. You need quality scoring to distinguish a well-written article from SEO garbage. You need deduplication, and not trivial exact-match deduplication, but fuzzy deduplication using techniques like MinHash to find near-duplicate documents. You need toxicity filtering. You need personally identifiable information removal. And you need all of this to run at a scale of hundreds of terabytes.

The Stanford twenty twenty-five Foundation Model Transparency Index found that average transparency scores actually dropped from fifty-eight out of a hundred in twenty twenty-four to forty out of a hundred in twenty twenty-five. Training data and compute were the most opaque areas. Nobody wants to tell you exactly what's in the sausage. But we know the recipe because enough open-source projects have done it in public. You mix roughly eighty percent filtered web crawl with smaller portions of Wikipedia, books, code from GitHub, scientific papers from sources like Semantic Scholar, and maybe some curated conversational data.

The data pipeline itself — the filtering, scoring, and deduplication infrastructure — would cost you maybe fifty to a hundred thousand dollars in engineering time and compute to build and run once. That's the cheapest part of this whole adventure, and it's already more than most people's annual salary.

How Small Can You Go?

Now here's where it gets interesting. The question isn't just "what does it cost to train GPT-four?" — that answer is somewhere north of a hundred million dollars, maybe two hundred million. The question is: what's the smallest foundation model that's still genuinely useful? That still has emergent capabilities, can follow instructions after fine-tuning, can reason at least a little bit?

Microsoft's Phi three gave us a fascinating data point. They trained a three point eight billion parameter model on what they called textbook-quality synthetic data, and it competed with models twenty-five times its size on certain benchmarks. The Phi approach suggests that data quality can substitute for raw scale, up to a point. Meta's original Llama paper from twenty twenty-three made a similar argument from a different angle. They found that a seven billion parameter model continues to improve even after one trillion tokens of training, well beyond the Chinchilla-optimal ratio that DeepMind had recommended. Their Llama seven B outperformed GPT-three, which had a hundred and seventy-five billion parameters, on most benchmarks. It did this by simply training longer on more data.

So let's set our target. A seven billion parameter dense model. Trained on somewhere between one and two trillion tokens of curated, high-quality data. This is roughly the architecture of the original Llama, and it's the smallest size where you reliably get a model that can actually do things.

The Hardware Question

Here's where it gets expensive. Training a seven billion parameter model on one trillion tokens requires roughly six times seven billion times one trillion floating point operations. That's about four point two times ten to the twenty-second power FLOPs. For reference, a single Nvidia H100 GPU can do about one point nine petaFLOPS of FP16 compute, which is one point nine times ten to the fifteenth FLOPS. If you somehow achieved perfect utilization, which you won't, that's about two point two million GPU-seconds, or roughly twenty-five GPU-days.

But you never get perfect utilization. Real-world model FLOPS utilization, or MFU, typically runs between thirty and fifty percent for well-optimized training runs. At forty percent MFU, you're looking at about sixty-three GPU-days on a single H100. But nobody trains on a single GPU. You need at least eight GPUs for a seven B model to fit in memory during training, and ideally thirty-two or sixty-four to finish in a reasonable time. With thirty-two H100s at forty percent utilization, you're looking at roughly two days of wall-clock time for one trillion tokens. That's... actually not that bad?

The cost question is what kills you. If you rent H100 GPUs in the cloud, you're paying somewhere between two and three dollars per GPU per hour. Thirty-two GPUs for forty-eight hours is roughly three thousand to five thousand dollars just in GPU rental. But that's the absolute floor. You need to add storage, networking, the instances themselves, and the inevitable failed runs. A realistic budget for the compute alone, including a few restarts and experiments, lands somewhere between twenty and fifty thousand dollars for a seven B model on one trillion tokens.

The DeepSeek Lesson

Now let's talk about what happens when very smart engineers optimize the hell out of every piece of the stack. DeepSeek V3 is the most instructive example in the history of foundation models. They trained a six hundred and seventy-one billion parameter mixture-of-experts model — with thirty-seven billion parameters active per token — on fourteen point eight trillion tokens. The total training cost? Five point five million dollars, assuming two dollars per H800 GPU hour.

Training DeepSeek V3 on each trillion tokens requires only one hundred and eighty thousand H800 GPU hours. Three point seven days on our cluster with two thousand forty-eight H800 GPUs. Our pre-training stage is completed in less than two months.

They achieved this through four key innovations. Multi-head latent attention cut their memory usage to just seventy kilobytes per token in the key-value cache, about one seventh of competing models. Their mixture-of-experts architecture meant only thirty-seven billion of the six hundred and seventy-one billion parameters activate for any given token, cutting training compute by roughly ninety percent compared to a dense model of the same total size. FP8 mixed precision training halved compute and memory usage with minimal accuracy loss. And a custom multi-plane network topology solved the communication bottleneck in cross-node training.

The lesson from DeepSeek isn't "foundation models are cheap." It's that brilliant engineering can compress the cost by an order of magnitude. Their five point five million dollar figure doesn't include research costs, failed experiments, salaries, or the fact that they already owned the two thousand forty-eight GPUs. But it proves that the compute itself, for a genuinely frontier-capable model, can be surprisingly affordable if you know what you're doing.

Your Shopping List

So let's put together the cheapest plausible foundation model training run. We're building a seven billion parameter dense transformer. We're training on one point five trillion tokens of curated data, mostly from Common Crawl, mixed with Wikipedia, books, code, and scientific papers.

For data, you need between fifty and a hundred terabytes of raw source material, filtered down to roughly two terabytes of clean, tokenized training data. Common Crawl is free. Wikipedia dumps are free. Project Gutenberg has seventy thousand public domain books. GitHub code is available. ArXiv papers are open access. The filtering and processing pipeline needs maybe two hundred CPU-hours on a decent machine, which costs under a hundred dollars in cloud compute.

For hardware, you're renting sixty-four H100 GPUs on a cloud provider. At two dollars per GPU per hour, running for roughly fifty hours including overhead, that's about six thousand four hundred dollars. Double it for failed runs and debugging, and you're at thirteen thousand. Add another five thousand for storage and networking.

For software, you're using an open-source training framework. NanoGPT for experiments, then something like the Llama training codebase or MegatronLM for the real run. These are free but they need expertise to configure and operate. If you're doing this yourself, you're spending at least two months of your time. If you're hiring someone, an ML engineer with foundation model experience costs twenty thousand dollars a month minimum.

For electricity, if you somehow had your own GPUs, sixty-four H100s draw about forty-five kilowatts total. Fifty hours of training is twenty-two hundred fifty kilowatt-hours. At Swedish electricity prices, that's maybe two thousand kronor. The electricity is genuinely the cheapest part.

The Real Bottom Line

Here's the uncomfortable total. If you do everything yourself, know exactly what you're doing, use open-source tools, rent GPUs at the cheapest rates, and get lucky with no catastrophic training failures, you could train a genuine seven billion parameter foundation model for somewhere between twenty and fifty thousand dollars. That's the floor. That's the "everything goes right" number.

But you forgot about the months of your own time, the failed experiments that taught you what hyperparameters to use, the three runs you threw away because the loss curve went sideways, and the fact that you need to eat while doing all of this.

A more realistic budget, including one experienced ML engineer's time for three months, a few experimental runs to find the right learning rate and data mixture, and proper evaluation, is somewhere between a hundred and two hundred thousand dollars. Which is still astonishingly cheap for something that would have been worth a billion dollars to any tech company ten years ago.

And here's the thing that makes this thought experiment genuinely interesting. The resulting model? It would be decent but not competitive with anything released in the last two years. Llama three point two's three B model, which Meta released for free, would probably beat your fifty thousand dollar seven B model on most benchmarks. Because Meta spent hundreds of millions training the larger models that the small ones were distilled from. You can't shortcut the research. You can only shortcut the final training run.

The Dataset Is the Model

The deepest insight from all of this is something that the industry has been slowly realizing: the data is worth more than the compute. Data annotation for frontier models now exceeds compute costs by up to twenty-eight times, according to the Stanford AI Index. The global data annotation market is projected to grow from two point three billion dollars in twenty twenty-five to nearly ten billion by twenty thirty.

If you're serious about training a foundation model that does something no existing model does — say, a model that deeply understands a specific language, or a specific domain, or a specific style of reasoning — the bottleneck isn't GPUs. It's data. Can you assemble a trillion tokens of genuinely high-quality text in your target domain? Can you clean it, deduplicate it, filter it for quality? Can you do that without accidentally training on someone's copyrighted novel or personal medical records?

This is why the cheapest path to a useful foundation model might not be training from scratch at all. It might be continued pre-training on an existing open model. Take Llama, extend its training on a hundred billion tokens of domain-specific data, and you get something that knows everything Llama knows plus everything your corpus contains. That costs maybe five to ten thousand dollars in compute. Still not pocket change, but it's the price of a used car, not the price of a house.

The Kall Datacenter Fantasy

So where does this leave us? Sitting in Kall, Jämtland, population roughly not very many, imagining a world where someone with a good internet connection and a pile of second-hand Dell PowerEdge servers from a datacenter liquidation auction could train a foundation model. The electricity is cheap. The cooling is free for eight months of the year. The ambition is infinite and the budget is finite.

The honest answer is: you could do it. A seven B foundation model trained from scratch, for the cost of a nice car. But unless you have data that nobody else has — unless you've assembled a corpus that makes your model know something that Llama and Mistral and Qwen don't already know — you're better off standing on the shoulders of the giants who already spent the hundreds of millions. Train your LoRA. Do your continued pre-training. And save the from-scratch training for the day you've got a trillion tokens of something truly unique.

But it's fun to know that the door isn't locked. It's just expensive to open.