The Scale Problem: When Everything Is Too Much

The Clone That Broke the Morning

This is episode twenty-nine of Git Good, and the last episode in our act about how Git reshaped the way teams work. We have talked about workflows, code review, and the green squares that measure your worth. Now we arrive at the question that sits underneath all of those: what happens when the repository itself becomes the problem?

Imagine a developer on their first day at a new company. They open a terminal, type git clone, and wait. Ten minutes pass. Thirty minutes pass. The progress bar barely moves. An hour later, someone from the team walks over and says, do not bother. We will give you a pre-built image. Nobody clones from scratch here.

That first-day experience reveals something fundamental about the company. Not about its code quality or its deployment pipeline or its test coverage. About its relationship with scale. Because the size of your repository is not just a technical constraint. It is an organizational decision that shapes how people work, how teams communicate, and who gets to change what.

Season one told the technical story of scale. Episode eighteen walked through the numbers: Google's eighty-six terabytes, Microsoft's three hundred gigabytes, Facebook's projected forty-five-minute status command. It showed how three companies each broke Git in their own way and built custom solutions. If you want the engineering details, that episode has them.

This episode is about what comes after the engineering. The decision those companies made before they wrote a single line of infrastructure code. The question every growing organization eventually faces. Do we keep everything in one place, or do we split it apart? And why does the answer change everyone's job?

One Repo to Rule Them All

The word monorepo sounds like a technical term, but it is really an organizational philosophy. When you keep all your code in a single repository, you are making a statement about how your company works. You are saying that any engineer should be able to see any code. You are saying that when one team changes an interface, they are responsible for fixing every team that depends on it. You are saying that the boundaries between teams are soft, porous, negotiable.

Google made this philosophy explicit. In two thousand sixteen, Rachel Potvin and Josh Levenberg published a paper in Communications of the ACM that laid out the case with the kind of precision only Google could provide. Twenty-five thousand engineers. One repository. Sixteen thousand human-authored changes per day. Twenty-four thousand more from automated systems. Eighty-six terabytes. Two billion lines of code.

But the numbers were not the argument. The argument was cultural. Potvin described a practice Google calls "atomic changes." When a core library needs an update, the team making the change does not publish a new version and hope everyone upgrades. They update every caller. Every single one, across the entire company. One commit touches hundreds of files owned by dozens of teams, and it either lands as a whole or not at all.

The key insight is that we do not have a diamond dependency problem. There is only one version of everything. The latest.

That quote captures the monorepo philosophy in two sentences. There is no version three of the logging library running alongside version four. There is no team frozen on an old dependency because upgrading would break their tests. There is just one truth, and it is always current.

This is transformative for how teams work. In a monorepo, you never send an email saying "please upgrade to version two point three of our API by end of quarter." You just change the API, fix the callers, and submit the change. The team whose code you touched reviews the parts that affect them. If they object, you negotiate. The conversation happens in the code review, not in a meeting six weeks later.

Facebook arrived at the same conclusion through a different path. Their monorepo held the code for the app used by billions of people, and they believed, as Google did, that splitting it into separate repositories would create walls between teams. Walls that would slow down cross-cutting changes, fragment shared libraries, and make it harder for any single engineer to understand how the pieces fit together.

But belief in the monorepo does not make the monorepo work. Google had to build Piper, a custom version control system running on Spanner across ten data centers. Facebook had to abandon Git entirely. They went to the Git maintainers first and asked for help with scaling. The answer was, essentially, split your repository into smaller pieces. Facebook disagreed. They believed in the monorepo. So they turned to Mercurial, contributed over five hundred patches to it, built a new server called Mononoke in Rust, and created a virtual filesystem called EdenFS. Durham Goode, one of the engineers leading the effort, put it simply.

Achieving these types of performance gains through extensions is one of the big reasons we chose Mercurial.

Years later, Facebook would take all of those lessons and build Sapling, an entirely new source control client that, in a twist of irony, is compatible with Git. Microsoft, meanwhile, had to invent VFS for Git just to make three and a half million files manageable.

The cost of the monorepo is the cost of making the monorepo possible. And that cost is enormous. Custom tooling, custom infrastructure, dedicated teams whose entire job is keeping the repository usable. Google has entire divisions focused on build and version control infrastructure. This is not overhead. This is the foundation that makes the monorepo philosophy function. Without it, the philosophy is just an aspiration and a very slow git status command.

The Case for Walls

The alternative is the polyrepo. Each service, each library, each component gets its own repository. Your team owns your repo. You control who can commit to it. You decide your release schedule, your branching strategy, your dependency versions. You publish versioned artifacts, and other teams consume them when they are ready.

This sounds like freedom. And it is, in the same way that a country of independent city-states is free. Each city governs itself. Each city moves at its own pace. The cost arrives when you need to coordinate.

In a polyrepo world, updating a shared library means publishing a new version, then waiting for every consuming team to upgrade. Some upgrade immediately. Some upgrade next quarter. Some never upgrade. You end up with six different versions of the same library running in production, each with its own bugs, its own security patches, its own quirks. The diamond dependency problem, where two libraries depend on different versions of the same third library, becomes a daily reality rather than a theoretical concern.

The coordination tax is real. A study from Buildkite found that polyrepo teams spend significantly more time on dependency management and cross-repository integration than their monorepo counterparts. The time is invisible because it is distributed across dozens of small tasks: bumping version numbers, updating lock files, chasing down which team broke the contract, writing compatibility layers for APIs that changed out from under you.

But the polyrepo has genuine strengths that the monorepo advocates tend to dismiss. Isolation is not just organizational convenience. It is a security boundary. When each team has their own repository, a compromised credential exposes one service, not the entire company's code. Access control is granular by default, not bolted on after the fact. Teams can adopt different languages, different build systems, different testing frameworks without fighting a shared tool chain.

And the polyrepo scales in a way the monorepo does not: socially. A monorepo with a thousand engineers requires governance. Who approves cross-cutting changes? Who decides when a core library can make a breaking change? Who reviews the commit that touches two hundred files across forty teams? These questions have answers at Google, where the infrastructure investment makes them manageable. They have much harder answers at a company of five hundred people that chose a monorepo because Google did, without investing in the tooling that makes it work.

The Debate That Never Ends

The monorepo versus polyrepo argument has been raging for over a decade, and it generates more heat than light because the participants are often arguing about different things.

When a Google engineer says monorepo, they mean a system with custom tooling, cloud-based workspaces, automated dependency updates, a global build system, and dedicated infrastructure teams. When a startup founder says monorepo, they mean one Git repository on GitHub with no special tooling, checked out in full on every developer's laptop, with builds that take longer every month.

These are not the same thing. The startup monorepo works beautifully at fifty people and fifty thousand files. It might work at two hundred people and two hundred thousand files. Somewhere past that, git status starts to feel sluggish, clone times creep up, and the build takes long enough that developers start checking their phones while they wait. The startup is now in the gap between "one repository is easy" and "we can afford to build Google-scale infrastructure." Most companies live in that gap permanently.

We set out to bring the Windows codebase into a single Git repo in Azure DevOps. Some of the numbers are staggering.

Brian Harry wrote those words in two thousand seventeen when Microsoft decided to migrate Windows to Git. Four thousand engineers, three and a half million files, three hundred gigabytes. Microsoft had the resources to build VFS for Git, then evolve it into Scalar, then contribute those improvements upstream. The key phrase in Harry's blog post is not the staggering numbers. It is the phrase "we set out." Microsoft decided the monorepo was worth the investment. They committed engineering teams, years of effort, and significant political capital within the Git community to make it work.

Most companies do not have that option. They do not have teams to dedicate to build tooling. They do not have the political weight to push patches upstream into Git. They are choosing between a monorepo that will slowly degrade as they grow and a polyrepo that will slowly fragment as they grow. Both are correct. Both are painful. The question is which kind of pain matches your organization.

What Git Assumes

Underneath the organizational debate sits a technical assumption that shapes everything. Git was designed around the idea that you clone everything. When Linus Torvalds built Git in two thousand five for the Linux kernel, "everything" was a few gigabytes. Manageable. More than manageable, elegant. Every developer has the full story. No server required. Complete independence.

That assumption is embedded deep. Git status checks every file in the working directory. Git clone downloads every object. Git log walks the entire commit history. These are not bugs. They are features, designed for a world where the repository fits comfortably on a laptop.

The features that soften this assumption are all relatively recent. Sparse checkout lets you materialize only the files you need. Partial clone downloads objects on demand instead of all at once. The commit graph file pre-computes history traversal so log does not have to read every commit individually. Scalar, Microsoft's contribution, bundles these optimizations into a single configuration that makes large repositories usable.

But these features feel like patches on a philosophy, because they are. Git's distributed model assumes completeness. Every optimization that breaks that assumption, every feature that says "you do not need all of this right now," moves Git a little closer to the centralized model it was designed to replace. Sparse checkout is, conceptually, a view on a central repository. Partial clone is on-demand fetching from a server. These are the same ideas that Perforce and Subversion used, reintroduced through the back door.

This is not a failure. It is evolution. The Linux kernel is still a few gigabytes. Most open source projects are smaller. Git's original assumptions serve them perfectly. The scaling features exist for the organizations that outgrew those assumptions, and they coexist with the original design without breaking it for everyone else.

The Quiet Steward and the Enterprise

There is a human story inside this technical evolution that is easy to miss. Junio Hamano, Git's lead maintainer since two thousand five, has spent two decades managing the tension between Git's origins and its enterprise ambitions. When Microsoft showed up with patches to make Git handle three hundred gigabytes, those patches added complexity to a tool that prided itself on simplicity. When Google engineers contributed partial clone support, the implementation touched deep internals that affected every Git user, not just the ones working at scale.

Junio's approach, consistent across twenty years of maintenance, is to accept changes that benefit the broader community while preserving Git's core model. The partial clone patches went through years of review. The sparse checkout redesign went through multiple iterations. Microsoft's filesystem monitor integration, which lets Git ask the operating system what changed instead of checking every file, was controversial precisely because it added a new dependency to a tool that had survived on minimal dependencies.

Each of these changes makes Git better for large repositories. Each one also makes Git more complex. The tool that started as a few thousand lines of C, written in two weeks by a frustrated kernel developer, now has features that exist solely because companies with tens of thousands of engineers need them. Whether that is growth or bloat depends on which end of the scale you sit at.

The Accelerant

And now the calculation changes again. AI-generated code is entering repositories at an unprecedented rate. A developer using an AI coding assistant might produce three or four times the volume of changes they would write by hand. Those changes land as commits, accumulate as history, and consume storage. The repository grows faster.

This has implications in both directions. If repositories grow faster, the case for monorepo tooling grows stronger, because the gap between "manageable" and "too large for vanilla Git" arrives sooner. A startup that might have had five comfortable years before scale became an issue might now have three. The scaling tricks, sparse checkout, partial clone, commit graphs, move from nice-to-have to essential on a shorter timeline.

But AI also complicates the monorepo's greatest strength. The atomic change, the cross-cutting commit that updates every caller when an interface changes, works because a human engineer understands the change they are making. When an AI generates those updates, the question of review becomes critical. Who verifies that the AI-generated migration across two hundred files is correct? The human who prompted it might not understand all two hundred files. The teams whose code was touched certainly did not write the changes. The monorepo's promise of shared ownership depends on someone actually understanding the shared code, and AI is stretching that assumption thin.

The polyrepo faces a different AI problem. If every team manages their own dependencies, and AI accelerates how quickly those dependencies change, the coordination tax multiplies. More versions, more breaking changes, more upgrade cycles, all moving faster than before. The walls between repositories, which provided safety through isolation, now also prevent the kind of sweeping, coordinated changes that AI is uniquely good at.

Neither model has a clean answer yet. The scale problem, the question of how to organize millions of lines of code across thousands of people, was hard enough when humans wrote all of it. AI does not simplify the problem. It accelerates it.

The Choice Nobody Talks About

Here is what the monorepo versus polyrepo debate usually misses. The choice is not just technical and not just organizational. It is a statement about what kind of company you want to be.

A monorepo says: we are one team. Boundaries between groups are soft. Any engineer can read, understand, and change any code. Coordination happens through the code itself, through atomic changes and shared review, not through meetings and version negotiations. The cost is infrastructure. The cost is complexity. The cost is that when the repository is broken, everyone is broken.

A polyrepo says: we are many teams. Boundaries are real. Each group owns their territory and publishes contracts that others consume. Coordination happens through versioned interfaces, through negotiation and upgrade cycles, not through reaching into someone else's code. The cost is fragmentation. The cost is duplication. The cost is that when a dependency changes, the ripple takes weeks to reach every consumer.

Google, Facebook, and Microsoft each chose the monorepo, and each spent years and millions of dollars building infrastructure to make that choice sustainable. They could afford to. They had to, because their codebases were already too large and too interconnected to split apart. The monorepo was not a philosophy they adopted. It was a reality they invested in.

Most companies are not in that position. Most companies have a choice. And the honest answer, the one that almost no blog post or conference talk gives, is that neither option is correct in the abstract. The right choice depends on how large your codebase is, how fast it is growing, how much your teams need to share code, how much you can invest in tooling, and how much coordination cost you are willing to absorb.

Git does not have an opinion. Git will happily store one monorepo or a thousand polyrepos. It was designed for the Linux kernel, which is one of the most successful monorepos in the world, managed not by custom infrastructure but by a mailing list and the taste of one maintainer. That model works for Linux. It would not work for Google. And Google's model would not work for your team of twelve.

The scale problem is not a problem to solve. It is a problem to choose. You choose your pain, you invest in the infrastructure that makes your choice sustainable, and you accept that the other choice would have been equally valid with different trade-offs. Every growing company discovers this eventually. The ones that discover it early build better infrastructure. The ones that discover it late spend a year migrating.

That was episode twenty-nine of Git Good, and the end of our second act about how Git changed the way we work. When we return, Git escapes the terminal entirely. What happens when a tool designed for source code starts tracking novels, scientific experiments, and the law itself? The answer is stranger than you would expect.

Git sparse-checkout set followed by a directory path is the command that tells Git you only need this corner of the universe. In a repository with ten thousand files, sparse checkout lets you work with the fifty that matter to you. Everything else exists in Git's internal storage, but your working directory stays clean and focused. It is the closest thing Git has to a view on a centralized repository, the admission that every clone having everything does not mean every working directory needs to show it. For teams in the gap between small and Google-scale, it is often the first scaling trick that makes a real difference.