GGUF: Cramming a Brain into a Shoebox

The Laptop That Should Not Have Been Enough

In March twenty twenty-three, Meta released a language model called Llama. It was not the largest model in the world, but it was the first genuinely powerful one that the research community could access without paying a fortune. The problem was that running it required the kind of hardware that most people do not have. A full-precision seventy-billion-parameter model needs around a hundred and forty gigabytes of memory just to hold the weights. That is more memory than most servers have, let alone a laptop.

Within days of the release, a developer named Georgi Gerganov posted something remarkable. He had taken the Llama model and run it on a MacBook. Not a MacBook Pro with a maxed-out GPU. Not a server disguised as a laptop. A regular machine with a regular CPU and a regular amount of memory. The model was slower than it would be on a data centre GPU, obviously, but it worked. It generated coherent text, answered questions, and did all the things that language models are supposed to do. On hardware that cost a fraction of what a data centre GPU setup costs.

The tool he built was called llama dot cpp, and the crucial trick was quantization, the mathematical art of making a model smaller by reducing the precision of its numbers. Before we can understand what Gerganov did, we need to understand what a language model actually is, at the level of numbers.

Billions of Tiny Decimals

A language model like Llama or Qwen or Mistral is, at its core, a collection of numbers. Billions of them. These numbers are the weights, the learned parameters that encode everything the model knows. During training, each weight is a thirty-two-bit floating point number, which means it can represent values with about seven decimal places of precision. A single weight might be something like zero point zero zero three four five six seven eight.

When you have seven billion weights, each stored as a thirty-two-bit number, you need twenty-eight gigabytes just to hold them. For a seventy-billion-parameter model, that jumps to two hundred and eighty gigabytes in full thirty-two-bit precision. Even halving the precision to sixteen-bit, which is what most inference systems use as a baseline, a seventy-billion model still needs a hundred and forty gigabytes. That is more than twice the memory in your MacBook.

Here is the insight that makes quantization possible. Not all of that precision matters. A weight of zero point zero zero three four five six seven eight and a weight of zero point zero zero three five are, for practical purposes, the same thing. The model's behaviour changes very little if you round each weight to a less precise representation. The question is: how aggressively can you round before the model starts producing noticeably worse output?

This is the same principle behind audio compression. A CD stores music at sixteen bits per sample. An MP3 reduces that by throwing away information that human ears cannot easily detect. The result is a smaller file that sounds almost identical. Quantization does the same thing for neural networks: throw away precision that the model does not functionally need, and the output stays remarkably close to the original.

From Whisper to Llama

Georgi Gerganov is a Bulgarian developer who tends to work quietly and let the code speak. He does not do conference tours or write long blog posts. His GitHub profile is the résumé.

In late twenty twenty-two, before Llama existed, Gerganov started work on a C library for tensor algebra called GGML. The initials GG stand for his name, Georgi Gerganov. Machine Learning became ML. The design was inspired by the work of Fabrice Bellard, the legendary French programmer who has created an almost absurd number of foundational tools including QEMU, FFmpeg, and a C compiler small enough to compile itself. Bellard had released a library called LibNC, and Gerganov saw in it a pattern for building something lean and fast.

GGML was deliberately minimal. Written in pure C with no external dependencies, it focused on two things: strict memory management and efficient inference on CPUs. This was an unusual choice. The entire machine learning world was obsessed with GPUs. Training models requires massive parallel computation, and GPUs excel at that. But Gerganov was not interested in training. He was interested in inference, the act of running a model that has already been trained. And for inference, especially on quantized models, a CPU with plenty of memory can be surprisingly effective.

His first major project with GGML was whisper dot cpp, a C implementation of OpenAI's Whisper speech-to-text model. It ran on CPUs, it was fast, and it proved the approach worked. When Meta released Llama a few months later, Gerganov pivoted immediately. Within days, llama dot cpp existed, and the local AI revolution had a tool to build on.

The Format That Packages Everything

The early versions of GGML used a simple binary format that worked but was fragile. Model architecture details were sometimes hardcoded. Metadata lived in separate files. Different model types needed different loading code. As the community exploded and people started quantizing dozens of different architectures, the format could not keep up.

In August twenty twenty-three, Gerganov introduced GGUF, the format that solved these problems. The name stands for GGML Universal Format, or sometimes Georgi Gerganov Universal Format, depending on who you ask. The design was simple and powerful. A single binary file containing everything needed to run a model: the weights, the tokenizer vocabulary, the model architecture details, the quantization parameters, and any other metadata. One file, no dependencies, no separate configuration.

The file structure has three sections. A header that identifies the file and tells you how many tensors and metadata entries it contains. A metadata section that stores everything the inference engine needs to know, context length, vocabulary size, layer configuration, in a self-describing key-value format. And a data section that holds the actual weights, packed efficiently according to their quantization type.

This self-contained design is why you can download a single GGUF file from Hugging Face and run it immediately. There is no setup step, no conversion script, no compatibility check. The file contains everything. Ollama, LM Studio, llama dot cpp, and dozens of other tools all read the same format. It became the common language of local AI, the way MP3 became the common language of digital music.

The Mathematics of Compression

Let's talk about what the quantization numbers actually mean, because you see them every time you pick a model and the choice matters more than most people realise.

When a model is quantized to Q eight, each weight is stored as an eight-bit integer instead of a sixteen-bit float. This cuts the file size roughly in half, and the quality loss is minimal, often imperceptible. You lose about one decimal place of precision, and the model barely notices. Q eight is the safe choice, the one where you sacrifice almost nothing.

Going down to Q four means each weight gets only four bits. The file is now a quarter of the original sixteen-bit size. A seven-billion-parameter model that would need fourteen gigabytes in half precision fits in about four gigabytes. This is where things get interesting, because four bits can only represent sixteen different values. Imagine being told you can rate a movie, but only using the numbers one through sixteen. You lose nuance, but you can still communicate a clear preference.

The trick that makes Q four work better than you might expect is that the quantization is not uniform. The weights are processed in blocks, and each block gets its own scaling factor. Within a block of thirty-two weights, the quantizer finds the range of values and maps them onto the sixteen possible four-bit values as efficiently as possible. Some blocks have weights clustered tightly together and quantize beautifully. Others have outliers that get rounded more aggressively.

Then there are the K-quant variants, which is where the naming gets confusing but the engineering gets clever. Q four K M, the quantization you probably use most often, means four bits with the K-quant strategy at medium quality. The K stands for the approach of using different precision for different parts of the model. The attention layers and certain critical feed-forward weights get more bits. Less important layers get fewer. This mixed-precision approach was a significant innovation in the GGUF ecosystem, developed by the open-source community through trial and error, and it noticeably outperforms naive uniform quantization at the same file size.

At the extreme end, Q two K uses only two bits per weight. A seven-billion-parameter model shrinks to about two and a half gigabytes. You can run it on a phone. The quality is noticeably degraded, sentences become slightly less coherent, reasoning gets shakier, but for simple tasks it still functions. This is the equivalent of compressing a photograph until you can count the individual pixels, usable but not beautiful.

What Happens When You Type Ollama Run

Now you can picture the full chain of events. You type ollama run followed by a model name. Ollama checks its local library for the GGUF file. It reads the header to identify the model architecture and quantization type. It reads the metadata to configure the tokenizer and the context window. Then it begins loading the weight tensors into memory, each one decompressed from its quantized format on the fly.

When you send a prompt, the model processes your text through layer after layer of matrix multiplications. At each layer, the quantized weights are briefly dequantized, multiplied against the input, and the result passes to the next layer. The trick that GGML and llama dot cpp pioneered is doing this dequantization in a way that is fast enough on a CPU to be practical. Gerganov wrote hand-optimised code for the most common quantization types, using SIMD instructions, the special mathematical operations that modern CPUs provide for processing multiple numbers simultaneously.

Your MacBook's M-series chip is particularly good at this because it has a unified memory architecture. The CPU, the GPU, and the neural engine all share the same pool of memory. There is no slow copying between separate memory pools. The model weights sit in one place, and whichever processor is doing the computation can access them directly. This is why Apple Silicon machines are disproportionately popular for local AI work, not because they have the fastest raw computation, but because their memory architecture removes a bottleneck that plagues traditional computers.

For your setup with Ollama and your twenty-plus models and your MLX experiments, understanding quantization means making better choices. A Q five K M model is roughly fifteen percent larger than Q four K M but measurably more coherent for complex reasoning tasks. For a model you use daily for coding or writing, that extra gigabyte or two of storage is worth it. For a model you keep around for quick lookups or simple tasks, Q four K M is perfectly fine. And for experimenting with massive models that barely fit in your thirty-two gigabytes, dropping to Q three K M might be the difference between running and not running at all.

The Bulgarian and the Frenchman

There is a poetic symmetry to this story. Fabrice Bellard, the French programmer whose work inspired GGML, is legendary for doing things that should not be possible. He computed pi to a record-breaking number of digits on a desktop PC. He wrote a PC emulator in JavaScript that could boot Linux in a web browser. His projects are characterised by obsessive efficiency, by squeezing maximum performance from minimal resources.

Georgi Gerganov belongs to the same tradition. Where the AI industry was building ever larger models requiring ever more expensive hardware, Gerganov went the opposite direction. He asked: what is the minimum precision needed? What is the simplest possible implementation? What is the leanest file format? And he did it not in a research lab with a team and a budget, but as an open-source contributor who, by his own community's admission, prioritises code over documentation and results over papers.

The GGUF ecosystem was not built by a company. It was not published as a research paper. It was built through pull requests and GitHub issues, by a small group of developers who wrote code faster than anyone could write documentation about it. An unofficial documentation project on GitHub explains the quantization methods, noting that the official developers simply do not prioritise writing things down. This is open source at its most raw, where the code is the documentation, and the community fills in the gaps.

The Invisible Machine, Complete

With this episode, we have traced a path through five layers of invisible technology. MQTT moves data from your sensors through a protocol designed for oil pipelines. WireGuard wraps that data in a tunnel so elegant it fits in four thousand lines. systemd orchestrates the services that process and serve that data, keeping them alive and secure. eBPF gives you a programmable window into the kernel underneath it all. And GGUF packages artificial intelligence into files small enough to run on the hardware sitting on your desk.

Each of these technologies was built by a small number of people solving a specific problem. Two IBM engineers and a satellite link. A security researcher and a rootkit. A German developer and a boot process. A Berkeley lab and a packet filter. A Bulgarian programmer and a laptop. Each one created something that outlived its original purpose and became infrastructure that millions of people depend on without knowing it exists.

That is the invisible machine. It is not one system. It is a stack of solutions, layered on top of each other, built by people who cared about elegance and efficiency, running silently underneath everything you build. You were already using all of it. Now you know what it is.