Git at the Limits

Three Hundred Gigabytes

The Windows operating system repository is over three hundred gigabytes. Three and a half million files. Four thousand engineers pushing changes every day, averaging over eight thousand pushes and two thousand five hundred pull requests across more than four thousand active branches. Running git status on this repository, the command that simply asks "what changed since my last snapshot," takes minutes. Cloning the entire thing takes half a day. Just checking out a branch, the act of switching your working directory to reflect a different snapshot, takes hours.

Git was designed for the Linux kernel, which is large by normal standards. Tens of millions of lines of code, thousands of contributors, decades of history. But the Linux kernel repository is roughly four gigabytes. The Windows repository is seventy-five times larger. And Windows is not even the biggest codebase in the world.

Google stores over two billion lines of code in a single repository. Eighty-six terabytes of data. Nine million source files. Twenty-five thousand developers making sixteen thousand changes every single day, plus another twenty-four thousand automated commits from bots.

Facebook had the same problem. One massive repository holding the code for the app used by billions of people, and the version control tools available in two thousand twelve could not handle it.

These are the companies that broke Git. Or rather, these are the companies that discovered what happens when you take a tool designed by one person in two weeks to manage the Linux kernel, and try to use it for something seventy-five times larger. The answer, it turns out, is that every design decision has a cost at scale, and every decision Linus Torvalds made in April two thousand five, especially the brilliant one about every clone being complete, becomes a problem when "complete" means three hundred gigabytes.

Google's Secret System

Google never even tried to use Git.

By the early two thousands, Google's codebase was already massive and growing exponentially. They were using Perforce, the commercial version control system favored by large enterprises and game studios. But even Perforce was starting to buckle. The repository was growing faster than any off-the-shelf tool could handle.

So Google did what Google does. They built their own.

The system is called Piper, and it is nothing like Git. Where Git gives every developer a complete copy of the entire repository, Piper gives developers a view. You see the files you need. You work on your piece of the codebase. The full repository, all eighty-six terabytes and two billion lines, lives on Google's infrastructure, distributed across ten data centers around the world, backed by Spanner, Google's globally consistent database.

The migration from Perforce to Piper took over four years. Perforce had embedded itself deeply into Google's workflows over eleven years of use. Build systems, code review tools, automated testing, all of it was wired to Perforce. Untangling that and rewiring it to Piper was an enormous infrastructure project that most of the software world never heard about.

The reason Google keeps everything in one repository is philosophical, not just practical. In a monorepo, any engineer can see any code. If you depend on a library maintained by another team, you can read its source, trace its behavior, even fix a bug in it and submit the change for review. There are no walls between teams at the code level. When a core library needs a breaking change, the team making the change is responsible for updating every caller across the entire company. Not just their own code. Everyone's code.

This creates a culture of shared ownership. No code is someone else's problem. The trade-off is that you need a version control system that can handle the entire company's output in a single place, and no off-the-shelf tool could do that.

Google also built CitC, short for Clients in the Cloud. It is a virtual filesystem that presents a writable workspace to each developer. You see the whole repository but only the files you actually open get fetched from the server. The rest exist as lightweight placeholders until you need them. Sound familiar? Microsoft would arrive at almost the same idea years later, for almost the same reasons.

The numbers are staggering. As of the two thousand sixteen paper that Google published in the Communications of the ACM, authored by Rachel Potvin and Josh Levenberg, Piper was handling thirty-five million commits across its lifetime, with sixteen thousand human-authored changes landing every single workday. The repository had been growing continuously for over fifteen years at that point, and the system showed no signs of strain.

But Piper is proprietary. It runs on Google's infrastructure. You cannot download it, install it, or use it for your own projects. It solves Google's problem and nobody else's. The lesson it teaches is that the monorepo model works, even at scales that would make Git fall over completely, if you are willing to build custom infrastructure to support it.

Facebook Picks the Other Side

Facebook had the same scaling problem, but they made a completely different choice.

Around two thousand twelve, Facebook's codebase was growing fast enough that their engineers could project the moment when their existing tools would stop working. The repository was on track to become so large that basic operations, the kind developers run dozens of times per day, would slow to a crawl.

They evaluated their options. Git was the obvious candidate. It was already the most popular version control system in the world. Most Facebook engineers knew it. The tooling was mature, the community was enormous.

But when Facebook's engineers tested Git against their projected repository size, the results were bad. Basic commands would take over forty-five minutes on a repository matching Facebook's expected growth, because Git's status command has to examine every single file in the working directory to determine what changed. At the scale Facebook was heading toward, that meant millions of files, each one needing a filesystem stat call.

So Facebook did something unusual. They went to the Git maintainers and asked for help. Could Git be modified to handle this kind of scale? Could the status command be made smarter? Could the underlying architecture be adjusted?

The answer they got was, essentially, no. The Git maintainers recommended that Facebook split their repository into smaller pieces. Shard the monorepo into many smaller repos. The maintainers were not interested in optimizing for monorepos at Facebook's scale. They saw the problem as Facebook's architecture, not Git's limitation.

Facebook disagreed. They believed in the monorepo model for the same reasons Google did. Shared visibility, atomic cross-project changes, no dependency management headaches between repositories. Splitting the repo was a non-starter.

So they turned to Mercurial, Git's twin born the same week in April two thousand five. Mercurial had one crucial advantage for Facebook's purposes. It was written mostly in clean, modular Python, making it deeply extensible. Git's internals, a mix of C and shell scripts evolved over a decade, were far harder to modify at the architectural level. An engineer named Bryan O'Sullivan led the evaluation and migration effort, and his diplomatic approach helped secure buy-in across the company for what was, to many engineers, a baffling decision. Why would you move away from the tool everyone already knows?

Facebook's engineers also found the Mercurial developer community more willing to collaborate. Where the Git maintainers had told Facebook to change their architecture, the Mercurial maintainers worked with Facebook to change the tool. Over eighteen months, Facebook contributed more than five hundred patches to Mercurial's codebase. That is not a typo. Five hundred patches. This was not a company slapping a wrapper around an existing tool. This was a deep partnership between a tech giant and an open source community.

The key innovation was an extension called remotefilelog. Standard version control, both Git and Mercurial, downloads the full history of every file when you clone a repository. Remotefilelog changed that. It downloaded only commit metadata during clone. The actual file contents were fetched on demand, only when a developer actually opened or diffed a file, served through a memcache layer for speed. Clone and pull operations became ten times faster.

They also integrated Facebook's Watchman, a file system monitor, directly into Mercurial. Instead of scanning every file to determine what changed, the modified Mercurial asked the operating system which files had been touched since the last check. The result was a status command that ran five times faster than Git's equivalent on the same repository.

Achieving these types of performance gains through extensions is one of the big reasons we chose Mercurial.

Durham Goode, one of the engineers leading the effort, put it plainly. The extensibility was the point.

But Facebook was not done. Even with these improvements, the standard Mercurial server architecture was struggling under the load of thousands of engineers hammering it simultaneously. So Facebook built Mononoke, a Mercurial server written from scratch in Rust, designed specifically for massive monorepos. Mononoke was not a patch on an existing server. It was a complete reimagining of what a version control server could be, built for the kind of scale that no publicly available tool had ever handled.

Years later, Facebook would take the lessons from all of this, the modified Mercurial client, the remotefilelog extension, the Watchman integration, and build Sapling, an entirely new source control client released as open source in two thousand twenty-two. Sapling is compatible with both Git and Mercurial repositories, representing everything Facebook learned from a decade of pushing version control to its limits.

The irony is sharp. Facebook went to Git for help, was told to change their architecture, turned to Mercurial instead, spent a decade building custom infrastructure, and eventually built a tool that works with Git anyway. The circle closed, but it took ten years.

Microsoft Reinvents the Checkout

If Google went around Git and Facebook went around Git's community, Microsoft went straight through the middle.

In May two thousand seventeen, Microsoft announced that the Windows operating system, one of the most complex software projects in human history, was now being developed using Git. Not a modified Git. Not a Git alternative. Git itself, with one enormous addition bolted on top.

The addition was called GVFS, the Git Virtual File System, later renamed VFS for Git. The concept was deceptively simple. Instead of downloading all three and a half million files when you clone the Windows repository, VFS for Git creates a virtual filesystem that makes it look like every file is present. But the files are not actually there. They are placeholder entries in the filesystem. When you open a file, read it, compile it, or diff it, the virtual filesystem intercepts that access, downloads the file content from the server, and stores it locally. From that point on, the file is real. But the millions of files you never touch remain as lightweight placeholders.

Brian Harry, Microsoft's Corporate Vice President for Cloud Developer Services, wrote about the project in a blog post titled "The largest Git repo on the planet." The numbers tell the story.

We set out to bring the Windows codebase into a single Git repo in Azure DevOps. Some of the numbers are staggering.

Before VFS for Git, many basic commands on the Windows repository simply never completed. They would run for thirty minutes, an hour, and then the developer would give up or the operation would time out. After VFS for Git, clone took about two minutes. Checkout took thirty seconds. Status, the command that used to take minutes of painful waiting, ran in four to five seconds.

The rollout happened in waves during twenty seventeen. First, two thousand engineers from the Windows OneCore team in March. Then another thousand in April. Then three to four hundred more in May. By the time Harry wrote his blog post, three thousand five hundred of the four thousand Windows engineers had migrated. Two weeks after the first wave, they surveyed the engineers. Sixty-two percent reported satisfaction, which, for a massive infrastructure change imposed on thousands of people who did not ask for it, is remarkably high.

The key insight in VFS for Git was that Git's internal commands, status, checkout, reset, all assume they need to examine every file. This assumption is wired deep into Git's architecture. For a five thousand file repository, nobody notices. For three and a half million files, it is a catastrophe. VFS for Git taught Git to only care about the files the developer had actually touched, turning every operation from a full scan to a much smaller one.

Microsoft open-sourced VFS for Git under the MIT license. They also contributed changes upstream to Git itself, improving the core tool's ability to handle large repositories. Some of these improvements, like the filesystem monitor integration and the multi-pack index, eventually made their way into standard Git, benefiting everyone.

The project later evolved into Scalar, a simpler tool that configures Git's built-in performance features for large repositories without requiring the full virtual filesystem layer. As Git itself got better at handling scale, the need for the aggressive virtualization approach diminished. Scalar is now maintained within the Git project itself, which means Microsoft's investment in making Git work for three hundred gigabytes trickled down into improvements for repositories of every size.

The Scaling Tricks

Google built a custom system. Facebook switched to a different tool. Microsoft built a virtual filesystem. But for the rest of the world, for the thousands of companies with repositories that are large but not Windows-large, Git itself has developed a set of scaling tricks. Each one is a workaround for the same fundamental assumption: every clone has everything.

The first trick is the shallow clone. When you run git clone with a depth flag, you tell Git to only download recent history. A depth of one gives you just the latest snapshot. No parent commits, no history at all. Just the current state of the code. This is enormously faster for large repositories and is the standard approach for continuous integration systems that need to build the code but do not need to know what happened six months ago.

The trade-off is real. With a shallow clone, you cannot run git log to explore history, you cannot use git blame to see who wrote a specific line, you cannot use git bisect to hunt down the commit that introduced a bug. You have a snapshot, not a story.

The second trick is sparse checkout. Standard Git materializes every file in the repository into your working directory. Sparse checkout tells Git to only create the files you actually need. If you work on the frontend and never touch the backend or the documentation, sparse checkout lets your working directory contain only the frontend files. The rest of the repository still exists in Git's internal storage, but your filesystem only shows what you asked for.

This is what git sparse-checkout does. You define patterns, directory paths or wildcards that describe the files you care about. Git creates a working directory containing only those files. Everything else is hidden. You still have the full history, you can still search across the entire codebase with Git commands, but your filesystem is clean and focused. For a three and a half million file repository, the difference between materializing everything and materializing only your team's files is the difference between an unusable checkout and an instant one.

The third trick is Git LFS, Large File Storage. Git was designed to track text files, source code, configuration, documentation, the kind of content where diffs are meaningful and compression works well. Binary files, images, videos, compiled assets, machine learning models, game textures, are a nightmare for Git. Every version of a binary file is stored in full because diffs between binary blobs are meaningless. A repository with a hundred versions of a fifty megabyte image file contains five gigabytes of image data that Git faithfully replicates to every clone.

Git LFS, developed by GitHub and Atlassian and released in two thousand fifteen, solves this by replacing large files with small text pointers inside the Git repository. The actual file content lives on a separate server. When you check out a branch, LFS intercepts the pointer files and downloads the real content from the server. Your working directory looks normal, but the Git repository itself stays lean.

The command is git lfs. You tell it which file types to track, typically by extension, and from that point on, those files are managed by LFS instead of by Git's standard storage. Clone and fetch operations skip the large file content. Checkout downloads what you need. The repository shrinks dramatically.

The fourth trick is the commit graph file. Git's history is a chain of commits, each pointing to its parent. Walking that chain, which is what git log does, requires reading each commit object from disk, one at a time. For repositories with hundreds of thousands of commits, this is slow.

The commit graph file pre-computes and stores this chain in a single, efficiently structured file. Instead of reading thousands of individual objects, Git reads one file that contains the entire graph of relationships. The result is dramatically faster log operations, faster reachability queries, and faster merge base calculations. The command git gc, which cleans up and optimizes a repository, generates this file automatically.

Then there is partial clone, perhaps the most ambitious of Git's scaling features. Partial clone changes something fundamental about what "clone" means. Instead of downloading every object in the repository, a partial clone downloads only what you immediately need. Missing objects are fetched from the server on demand, transparently, when you access them. This is conceptually similar to what Facebook built with remotefilelog and what Microsoft built with VFS for Git, but implemented within Git itself, without requiring a custom server or a virtual filesystem.

Partial clone is still relatively young. It requires server support and works best with hosting platforms that have invested in the feature. But it represents Git's internal answer to the scaling problem, an acknowledgment from the project itself that "every clone has everything" cannot remain the universal default as repositories grow. This relentless pressure of scale does not just bend tools. It creates new risks, and the security of the software supply chain, a theme we will turn to soon, is where those risks become most dangerous.

When Every Clone Has Everything, and Everything is Too Much

Here is the tension at the heart of this episode. Linus Torvalds designed Git around a principle that was revolutionary in two thousand five: every clone is a complete repository. No central server is required. Every developer has the full history, every branch, every commit, right on their machine. You can work offline, verify integrity locally, recover from server failure by pushing from any clone. The distributed nature of Git was not just a feature. It was the foundation.

That principle made Git resilient. It made Git fast for normal repositories. It made distributed collaboration possible without any special infrastructure. It was, and remains, a beautiful design decision.

But it was a decision that came with a hidden cost, visible only at extreme scale. When "everything" means three hundred gigabytes, every clone having everything is not resilient. It is impossible. No developer needs three and a half million files to work on their corner of Windows. No continuous integration server needs the full history of a two billion line codebase to build one component.

The three companies that hit this wall first each solved it by violating the principle in different ways. Google abandoned Git's model entirely, building a centralized system where developers see only what they need. Facebook modified a different tool to download files on demand, reversing the "clone everything first" assumption. Microsoft built a virtual filesystem that lies to Git, making it think all files are present when they are not.

And Git itself is slowly adapting. Shallow clones, sparse checkout, partial clone, LFS, each is a concession to the reality that "every clone has everything" is an ideal, not a universal truth. The commit graph file is an optimization for repositories where walking the full history has become too expensive.

The meta-narrative here is about the hidden costs of design decisions. Every architecture has a scale at which its assumptions break. Linus was building for the Linux kernel, a large project by any measure, but one with a four gigabyte repository. He could not have predicted that his tool would be used for codebases seventy-five times larger. The decisions that made Git perfect for the kernel, complete clones, local-first operations, content-addressed storage, became the exact properties that needed workarounds at three hundred gigabytes.

This is not a criticism. It is a law of engineering. The measure of a good design is not that it works at every scale, because nothing does. The measure is how gracefully it adapts when it hits the wall.

Git has adapted. Not elegantly, not all at once, but steadily. Shallow clones appeared early. LFS solved the binary file problem. Sparse checkout reduced the working directory. The commit graph sped up history traversal. Partial clone is teaching Git to be lazy, to defer downloading until the moment of need. And Microsoft's contributions, born from the pressure of three hundred gigabytes, have improved Git for everyone, even developers who will never work on a repository that large.

The tool that one person built in two weeks to manage the Linux kernel now manages codebases a hundred times larger. It needed help. It needed tricks. It needed companies with thousands of engineers and billions of dollars to push it, pull it, and in some cases, build entirely new infrastructure around it. But it adapted. And that, more than speed or elegance or any single command, might be Git's greatest strength.

Git LFS is the admission that Git was never meant to hold everything. Binary files, images, videos, machine learning models, they bloat a repository in ways that no amount of clever compression can fix. LFS replaces them with small text pointers and stores the real content elsewhere. Your repository stays lean. Your working directory looks the same. The tradeoff is a server dependency, but for any project with assets larger than source code, it is the only sane option.