Stupid Content Tracker

An Egotistical Bastard Names His Project

Open the very first file Linus Torvalds wrote for his new version control system. Not the code. The README. The first line reads: "GIT, the stupid content tracker."

Not "the powerful distributed version control system." Not "the next-generation source code management tool." The stupid content tracker. And underneath that, Linus explains what the name means. He offers four interpretations, depending on your mood. A random three-letter combination that is pronounceable and not used by any common Unix command. Stupid, contemptible and despicable, simple, take your pick from the dictionary of slang. A global information tracker, for when it actually works. Or, and this is the one he saved for last, a goddamn idiotic truckload of something unprintable, for when it breaks.

But there's a fifth explanation, the one Linus gave in interviews with a grin that told you he wasn't entirely joking.

I'm an egotistical bastard, and I name all my projects after myself. First Linux, now Git.

In British English, calling someone a git means they're an annoying, unpleasant person. Linus is Finnish, lives in Oregon, and clearly enjoys the irony. He named the most important developer tool of the twenty-first century after a mild British insult. And he documented it, in the project's own README, forever.

Last episode, we watched Linus build Git in two weeks. We saw the speed, the urgency, the audacity of migrating the Linux kernel onto a tool that barely existed. But we skimmed past something crucial. What did he actually build? Not the timeline. The thing itself. What is Git, underneath the commands and the workflows and the GitHub interface that most people think is Git?

The answer turns out to be absurdly simple. And that simplicity is what makes it brilliant.

Snapshots, Not Differences

Every version control system before Git thought about the world the same way. You have a file. You change the file. The system stores the difference. Line forty-seven changed from this to that. Line one hundred and twelve was deleted. Line two hundred was added. Your history is a stack of differences, and to reconstruct any version, the system replays them from the beginning, applying one change after another until it arrives at the version you want.

This is how CVS worked. This is how Subversion worked. It's intuitive, it's space-efficient, and it has a fundamental problem. The further back you go in history, the more differences the system has to replay. Want to see what a file looked like three thousand commits ago? The system starts at the beginning and applies three thousand patches. The older the history, the slower the retrieval.

Linus threw this model away entirely. Git does not store differences. Git stores snapshots.

Every time you commit, Git takes a picture of every file in your project at that exact moment. Not what changed. Everything. The entire state of the project, frozen in time. If a file hasn't changed since the last commit, Git doesn't store a new copy. It stores a pointer to the existing one. But conceptually, every commit is a complete photograph of your project, not a set of instructions for getting from one version to the next.

Git never ever tracks a single file. Git thinks everything as the full content. All history in Git is based on the history of the whole project.

Linus said that in his two thousand seven talk at Google, and he meant it literally. This changes everything. Want to see what the project looked like six months ago? Git doesn't replay six months of patches. It just loads the snapshot. Instant. Want to compare two versions? Git loads both snapshots and compares them directly. No reconstruction needed.

The trade-off is obvious. Snapshots take more space than differences, right? Not really. Git compresses aggressively. And because it stores content by fingerprint, identical files across different snapshots are stored only once. In practice, Git repositories are often smaller than their Subversion equivalents, despite storing far more information.

But the real genius is not the space efficiency. It is what snapshots enable. When your history is a chain of complete pictures rather than a stack of instructions, operations that were expensive become cheap, and operations that were impossible become trivial. Within a few years, this would enable a collaboration model so fluid it would need an entirely new platform to manage it, a story for Episode Ten.

But snapshots create a problem of their own. Thousands of them, across thousands of contributors, spread across the globe. How do you identify them? How do you know if one has been corrupted in transit? Linus found his answer in an unexpected field.

The Fingerprint

Not version control. Not operating systems. Cryptography.

Every piece of content that Git stores gets a name. Not chosen by a human, not assigned by a counter. The name is computed from the content itself. Git runs the content through a mathematical function, a hash function, and the output is a forty-character string of letters and numbers. That string is the content's fingerprint. Its identity. Its address.

The same content always produces the same fingerprint. Always. Feed the exact same file into the function on any computer, on any operating system, today or ten years from now, and you get the identical forty-character result. But change a single character in the file, even one comma, one space, one letter, and the fingerprint is completely different. Not slightly different. Utterly, unrecognizably different.

This has beautiful consequences. If two files are identical, they get the same fingerprint, so Git only stores them once. It doesn't matter if the files have different names or live in different directories. Same content, same fingerprint, stored once. If you have ten thousand files across fifty branches and seven thousand of them are identical, Git stores seven thousand files, not fifty thousand.

And here's the property Linus cared about most. If anything gets corrupted, a bit flip on your hard drive, a network error during transfer, a cosmic ray hitting your memory, the fingerprint won't match. Git checks the fingerprint every time it reads something back. If the check fails, Git knows the data is damaged.

If you have disc corruption, if you have RAM corruption, if you have any kind of problems at all, Git will notice them. It's not a question of if. It's a guarantee.

This wasn't theoretical paranoia. Linus had watched the Linux kernel grow to millions of lines of code managed by thousands of developers across the globe. Data integrity wasn't a nice-to-have. It was existential. One corrupted byte in a kernel patch could crash every machine running that kernel. BitKeeper had taught him what was possible. Now he baked verification into the foundation.

You need to know exactly twenty bytes. You need to know the name of the top of your tree. And if you know that, you can trust your tree, all the way down, the whole history.

Think about what that means. One single fingerprint, twenty bytes, lets you verify the entire history of a project. Every file, every directory, every commit, every person who ever contributed. Because each fingerprint references other fingerprints, and those reference still more, all the way down. A chain of trust built from mathematics. No central authority needed. No server to verify against. Just the content and its fingerprint, verifiable by anyone, anywhere.

The hash function Linus chose in two thousand five was called SHA-1. It was the industry standard. Banks used it. Governments used it. Certificate authorities that secured the entire web used it. Linus did not choose it for security. He chose it for integrity, because it was the best available tool for catching corruption. He was making a bet, whether he knew it or not. A bet that the cryptographic foundation he was baking into his system would hold. For twelve years, that bet looked unshakeable.

Four Building Blocks

Everything in Git is built from exactly four types of objects. That's it. Four. The entire system, every feature, every command, every workflow, all of it rests on four simple primitives.

The first is the blob. A blob is just content. The raw bytes of a file, compressed, stored with its fingerprint as its name. A blob doesn't know its own filename. It doesn't know what directory it lives in. It doesn't know what project it belongs to. It's just content with an address.

The second is the tree. A tree is a directory listing. It contains entries that say: this filename points to that blob fingerprint, and this subdirectory name points to that other tree fingerprint. Trees give structure to blobs. They're the table of contents that tells you which content lives where.

The third is the commit. A commit is a snapshot of the entire project. It points to one tree, the root directory at that moment in time, and it carries metadata: who made this snapshot, when they made it, why they made it, and crucially, which commit came before it. That parent link is what creates the timeline. Commits point backward to their parents, forming a chain that stretches all the way back to the project's first moment.

The fourth is the tag. A tag is a human-readable label attached to a specific commit. Where a commit fingerprint is a forty-character string that nobody memorizes, a tag says: this moment is called "version two point oh." Tags are how humans navigate a history that is otherwise addressed by cryptographic fingerprints.

Blobs, trees, commits, tags. Content, structure, history, names. Everything else in Git is built on top of these. When you run git log, you're walking the chain of commits, following parent links backward through time. When you run git diff, you're loading two commits, finding their trees, and comparing which blobs changed. When you run git show on a specific commit, you're opening one snapshot and reading its tree and metadata.

These commands, git log, git diff, git show, are your windows into the storage model. They let you look inside Git and see what it actually contains, not just the current files on disk but the complete history of everything that ever was.

A Database That Happens to Do Version Control

Linus said something in his two thousand seven talk at Google that reveals how he actually thinks about Git. He said Git isn't a version control system. It's a content-addressable filesystem. A database where you look things up by their content, not by a name or an index number.

Think about a regular database. You store a record and it gets an ID, maybe number forty-seven. That ID is arbitrary. It tells you nothing about the record. You could change the record entirely and it would still be number forty-seven.

Git's database works differently. The ID is the content. The fingerprint is derived from the data itself. You can't change the data without changing the ID. You can't have two different pieces of data with the same ID. The address and the content are mathematically bound together.

This is why Git calls itself a "content-addressable filesystem" in its documentation. And this is why calling it a "stupid content tracker" was, in Linus's characteristically self-deprecating way, accurate. At its core, Git tracks content. It gives content addresses based on what the content is. Everything else, the version control, the branching, the merging, the collaboration, is built on top of that foundation.

And once you understand this, every Git operation suddenly makes sense. Why is branching cheap? Because a branch is just a pointer to a commit fingerprint. One tiny file containing forty characters. Creating a branch doesn't copy anything. It doesn't duplicate history. It writes forty characters to a file and you're done. Why can Git detect corruption? Because every piece of content is named by its fingerprint. Change the content, the name doesn't match anymore. Why is merging possible? Because Git can find the common ancestor of two branches by walking their commit chains backward until they meet. It's all just graph traversal over the four object types.

SHA-1 as far as Git is concerned isn't even a security feature. It's purely a consistency check.

Linus said that in the same Google talk. He chose the hash function for integrity, not security. He wanted to catch corruption, not prevent attacks. That distinction would matter enormously twelve years later.

The Crack in the Foundation

The promise of a hash function is that different inputs produce different outputs. Mathematically, collisions are inevitable, there are more possible inputs than possible outputs, but finding one should be so computationally expensive that it is effectively impossible. For SHA-1, the theoretical security margin in two thousand five was enormous. Finding two different inputs with the same fingerprint would take longer than the age of the universe using the best available computers.

Cryptographers knew this wouldn't last forever. Computing power grows. Attacks get smarter. By two thousand five, researchers had already published theoretical weaknesses in SHA-1. The question wasn't whether SHA-1 would eventually fall. It was when.

The answer arrived on February twenty-third, two thousand seventeen. A team of researchers from CWI Amsterdam and Google, led by Marc Stevens at CWI and Elie Bursztein at Google, announced SHAttered. They had produced two different PDF files with the same SHA-1 fingerprint. The first practical collision.

It wasn't cheap. The computation required the equivalent of six thousand five hundred years of CPU time and one hundred years of GPU time. Google donated the computing infrastructure. Stevens and Bursztein had been collaborating for roughly two years to make the theoretical attack practical. But they did it. Two different files, same fingerprint. The promise was broken.

For most of the internet, this was manageable. Web browsers had already started deprecating SHA-1 certificates in two thousand fifteen. Banks had migrated. Certificate authorities had moved on. But Git had SHA-1 baked into its bones. Every object, every commit, every piece of history in every Git repository in the world was addressed by a SHA-1 fingerprint. You couldn't just swap it out like changing a lightbulb.

The Git community responded quickly. Within weeks, Linus's response was characteristically calm. He pointed out that Git used SHA-1 for integrity, not security. A collision attack required deliberate effort and enormous computing power. Nobody was going to accidentally produce a collision. And Git's use of SHA-1 was structured differently from the PDF format the researchers exploited, making the specific SHAttered technique harder to apply to Git objects.

But "harder" isn't "impossible," and the Git developers knew the clock was ticking. Marc Stevens, one of the SHAttered researchers, had already written a collision detection algorithm. GitHub integrated it within a month of the announcement. Git itself adopted it in version two point thirteen, released in May two thousand seventeen. Every object written to a Git repository now gets checked for collision signatures.

The longer-term fix was harder. Migrate Git from SHA-1 to SHA-256, a stronger algorithm with a wider fingerprint. This work began in earnest around two thousand eighteen and has been progressing gradually ever since. It's a massive engineering effort because the fingerprint is woven into everything. Object storage, network protocols, the way repositories talk to each other, the way tools parse Git output. Changing the hash function means changing all of it, while maintaining backward compatibility with billions of existing objects addressed by SHA-1 fingerprints.

As of two thousand twenty-five, the transition is still underway. Git supports SHA-256 repositories experimentally. Full interoperability between SHA-1 and SHA-256 repositories remains a work in progress. The migration will likely take years more to complete across the entire Git infrastructure. Twenty years of content addressed by one algorithm is a lot of content to bridge to another.

This is the cost of baking a cryptographic algorithm into your foundation. When Linus chose SHA-1 in two thousand five, it was the right choice. Twelve years later, the foundation cracked. Not catastrophically, but visibly. And the repair is a decade-long project that's still not finished.

But here's the thing. The design survived. The four object types didn't change. The content-addressing model didn't change. The commit graph didn't change. Only the fingerprinting function needs updating, and the rest of the architecture absorbs the change because the design was simple enough to be adaptable. Swap one hash function for another, and the model works the same way. That's what simplicity buys you. Not invulnerability, but resilience.

Linus built Git in two weeks. He gave it four object types, content-based addressing, and a stupid name. Twenty years later, the design is still fundamentally the same. The hash function is changing. The scale has grown beyond anything he imagined. But the four building blocks, blob, tree, commit, tag, remain. Content, structure, history, names. That is all there is. Everything else is built on top.

The stupid content tracker. It tracks content. It is named after an insult. And its foundational simplicity is why it could scale to run the world. Because when a platform decided to build a social network on top of the graph, the graph was ready.

git log dash dash oneline. That is all you need. Every commit compressed to one line, a short fingerprint and whatever message the author left behind. Add dash dash graph and Git draws you an ASCII art family tree, branches splitting and merging in a cascade of slashes and backslashes. It is the entire content-addressable database we just spent an episode explaining, rendered as something a human can actually read. The stupid content tracker, handing you the story it never cared about storing.