The Great Migration

The Last Holdout

This is episode twenty-five of Git Good, and the third episode in our second season. The first two episodes showed you the wall. The learning curve that keeps beginners out. The error messages that read like threats. Now we are going to look at what happens when an entire organization decides to climb that wall together.

In two thousand eight, Google released the Android source code to the public. The codebase was eight and a half million lines long, not counting the Linux kernel, and it had been developed using Subversion. To make it work as an open source project, Google's engineers moved the whole thing to Git. They could not just run a single migration command, because a repository that large, split across that many components, would have choked any single Git clone. So they built a tool called repo, a wrapper that manages hundreds of Git repositories through a single manifest file. They built another tool called Gerrit for code review. And they open sourced both of them alongside the code itself. The migration was not a weekend project. It was an infrastructure rewrite that produced tools still in use today, almost two decades later.

Android's story gets repeated, in different sizes and different levels of pain, every single day. Somewhere right now, a team is migrating to Git. Maybe they are leaving Subversion. Maybe Perforce. Maybe a proprietary system nobody outside the company has heard of. They have read the blog posts. They have watched the conference talks. They believe Git is the future, and they are probably right. What they have not done, almost certainly, is accurately estimated what the migration will cost them.

The Migration Tax

Moving to Git is not installing new software. That part takes five minutes. The real migration is everything else.

There is the history. Every commit, every branch, every tag from the old system needs to come along, or the team needs to decide what to leave behind. There is the muscle memory. The senior developer who has typed svn commit every day for fifteen years does not switch to git commit and git push without months of stumbling. There are the scripts, the build systems, the continuous integration pipelines, all wired to the old tool. There are the permissions, because Subversion lets you control access at the directory level and Git does not. There is the workflow, because Subversion's model of one central server and sequential revision numbers is a fundamentally different mental model than Git's distributed graph of snapshots. And there is the knowledge gap, the same wall we talked about in the last two episodes, except now the entire team hits it at once.

The organizations that handle this well are the ones that treat migration as a project, not an event. They budget months. They run the old and new systems in parallel. They train people before the cutover. They accept that productivity will drop for weeks and possibly months after the switch.

The organizations that handle it badly are the ones where someone in management reads an article about how everyone uses Git now and sends an email on Friday afternoon.

Three Hundred Gigabytes

The largest Git migration in history happened at Microsoft, and it almost did not work.

In two thousand seventeen, the Windows operating system lived across more than forty Source Depot servers. Source Depot was Microsoft's internal version control system, a fork of Perforce they had maintained for years. The Windows codebase was three and a half million files. Checked into a Git repository, it would have been roughly three hundred gigabytes. Four thousand engineers worked on it every day, producing over eight thousand pushes, twenty-five hundred pull requests, and one thousand seven hundred sixty daily builds across four hundred forty branches.

Microsoft wanted to move all of this to Git as part of their One Engineering System initiative, the effort to get every team in the company onto the same tools. The problem was simple to state and brutal to solve. Git is a distributed version control system. That means every developer gets a complete copy of the repository. A complete copy of the Windows repository was three hundred gigabytes. On a good connection, cloning it would take twelve hours. On a mediocre connection, it would take days. And that was just the clone. Running git status, the command that tells you what has changed, took ten minutes. Checking out a branch took two to three hours.

Git copies the entire repo and all its history to your local machine. Doing that with Windows is laughable.

Brian Harry, a technical fellow at Microsoft who had been working on developer tools for decades, led the effort. His team's insight was that the problem was not Git's data model. Git's data model was fine. The problem was that Git assumed you wanted everything. Every file, every version, every blob, downloaded to your machine before you could do any work.

Many of the commands would take thirty minutes up to hours.

For a repository the size of Windows, that assumption was a dealbreaker.

So they built a virtual filesystem. They called it GVFS, the Git Virtual File System, and the idea was almost absurd in its simplicity. What if Git thought all three and a half million files were on your machine, but they were not? GVFS intercepted the operating system's file access calls and downloaded each file on demand, the first time a program tried to read it. From Git's perspective, the working directory looked complete. From the network's perspective, the developer only downloaded the few thousand files they actually touched.

It worked. Clone went from twelve hours to minutes. Status went from ten minutes to four or five seconds. Checkout went from hours to thirty seconds. Microsoft announced GVFS at the Git Merge conference in Brussels in February two thousand seventeen, and by May, three thousand five hundred of the four thousand Windows engineers had migrated. They surveyed engineers two weeks after the switch. Sixty-seven percent reported being satisfied, which sounds modest until you consider that these people had just had their entire workflow rewritten underneath them.

But GVFS had problems. It required a custom filesystem driver, which meant it only worked on Windows at first. When the Microsoft Office team needed to migrate their own monorepo, the macOS engineers could not use GVFS because Apple had deprecated the kernel features it depended on. So the team pivoted. Instead of virtualizing the filesystem, they leaned on features Git was developing upstream: partial clone, which lets you download objects on demand, and sparse checkout, which lets you tell Git you only care about certain directories. A developer named Derrick Stolee and his team rewrote the tooling as a lightweight configuration layer called Scalar. They rewrote it from C Sharp to C, reducing it from tens of thousands of lines to fewer than three thousand. And in October two thousand twenty-two, Scalar was merged into Git itself, version two point thirty-eight. The code that had started as a massive filesystem driver ended its journey as a thin configuration layer. Stolee described the philosophy behind Scalar's final form as a preference for incremental changes over complete rewrites.

Each individual movement was relatively small compared to the entire system.

The Windows migration is a success story, but it is also a story about how much money and engineering talent was required to make Git work for one repository. Microsoft had to invent new technology, build a filesystem driver, contribute hundreds of patches to Git upstream, rewrite the tooling twice, and deploy it to four thousand engineers over the course of years. Most organizations migrating to Git do not have those resources. They have the same problems at a smaller scale and no team of systems programmers to build solutions.

The Ones Who Said No

Not every large company migrated. Two of the biggest decided Git was not the answer, and their reasons are instructive.

In two thousand twelve, Facebook's monorepo was growing fast. Their engineers ran projections and discovered that basic Git commands would take over forty-five minutes on their projected codebase. So they did what any reasonable team would do. They went to the Git maintainers and asked for help.

The response was blunt. Split your repository. The Git maintainers told Facebook that a single repository of that size was not what Git was designed for, and the right solution was to break the codebase into smaller pieces. One response on the mailing list put it plainly: there is only so much you can do about checking the status of one point three million files.

Facebook was surprised. Not by the technical limitation, which they understood, but by the unwillingness to address it. They did not want to split their monorepo. A monorepo meant every engineer could see every line of code, could make atomic changes across services, could refactor without coordinating across repository boundaries. The monorepo was not a limitation they were trying to work around. It was a deliberate architectural choice.

So they went to the Mercurial project instead. Mercurial, Git's old rival from the two thousand five format wars, had a similar performance profile but a fundamentally different architecture. It was written in Python with clean extension points, designed from the start to be modified. Where Git's maintainers said "use smaller repositories," the Mercurial community said "show us what you need." Facebook sent engineers to a Mercurial hackathon in Amsterdam. They found the community, in their words, impressively welcoming to aggressive changes.

A developer named Bryan O'Sullivan, who had been a Mercurial contributor before joining Facebook, led the adoption. His team did not force the switch. They spent months socializing the possibility internally. They mapped common Git commands to Mercurial equivalents. They analyzed which Git operations their engineers ran most frequently. They created forums where people could voice concerns. And then they migrated, building custom tools on top of Mercurial: a server called Mononoke, written in Rust, designed for massive monorepos. A virtual filesystem called Eden. A workflow called stacked diffs that changed how code review worked.

A decade later, in November two thousand twenty-two, Meta open sourced the result. They called it Sapling. Durham Goode, who had been working on it since the beginning, wrote the announcement on what he described as his tenth year at Meta working on this problem, almost to the day. The punchline is that Sapling is Git compatible. It can clone Git repositories and push to GitHub. After a decade of building an alternative to Git, Meta's tool ended up speaking Git's language anyway, because that is what the rest of the world uses.

Sapling began ten years ago as an initiative to make our monorepo scale in the face of tremendous growth, starting as an extension to the Mercurial open source project, and rapidly growing into a system of its own.

Then there is Google. Google never migrated to Git at all. Their internal tool, Piper, manages a single repository that contains roughly two billion lines of code across nine million source files. Eighty-six terabytes of data. Twenty-five thousand engineers making forty thousand commits per day, all to the same codebase. Piper is built on top of Spanner, Google's globally distributed database, replicated across ten data centers using the Paxos consensus algorithm.

Despite several years of experimentation, Google was not able to find a commercially available or open source version control system to support such scale in a single repository.

Rachel Potvin and Josh Levenberg published a paper in two thousand sixteen explaining why. The core issue was architectural. A git clone copies the entire repository to your machine. At Google's scale, that would mean downloading eighty-six terabytes. Even partial solutions would require splitting Google's monorepo into thousands of separate repositories, which would defeat the entire purpose of having a monorepo. Git's distributed model assumes every developer has a full copy. Google needed a centralized model where developers only see the files they need. Piper gives them that. It is purpose-built, server-backed, with sparse views and integrated code review. It is also, by design, not Git.

The irony is that Google uses Git everywhere else. Android runs on Git, with the repo tool managing hundreds of Git repositories. Chromium is on Git. Kubernetes is on Git. Google hosts one of the largest Git services in the world at googlesource.com. But the heart of the company, the monorepo that contains Search and Gmail and Maps and YouTube and everything else, runs on a tool the rest of the industry has never seen.

The Small Migrations

The stories that get conference talks are the ones with three hundred gigabyte repositories and four thousand engineers. But most migrations look nothing like that. Most migrations involve a team of ten or twenty people, a Subversion server that has been running for a decade, and a sense that everyone else has already moved on.

These migrations do not require inventing new filesystem drivers. They do require confronting a different set of problems, problems that are human-shaped rather than technology-shaped.

The tool that bridges the gap is called git-svn. It is a built-in Git command that lets you clone a Subversion repository into a Git repository, converting each Subversion revision into a Git commit. You can work locally with Git branches and merges and all the flexibility that provides, and then push your changes back to the Subversion server as if you had been using svn commit all along. In theory, it lets a team migrate incrementally. Some developers switch to Git while others stay on Subversion. The bridge keeps both sides in sync.

In practice, git-svn is a patience test. For a repository with fifteen thousand revisions, which is modest by any professional standard, the initial clone can take four days. The tool has been known to leak memory and crash partway through, requiring you to start over. It does not handle non-standard Subversion layouts well. If your repository does not follow the classic trunk, branches, and tags directory structure, git-svn gets confused. Branches in Subversion become remote branches in Git with an unfamiliar naming scheme. Merge history, if it existed in Subversion at all, often does not survive the translation.

And then there are the things that cannot be translated at all. Subversion tracks file renames explicitly. When you run svn move, the server records that a file was renamed, and the entire history follows the file to its new name. Git does not track renames. It detects them after the fact, by comparing the content of deleted and added files. If the content is at least fifty percent similar, Git guesses it was a rename. If you renamed a file and also changed most of its contents, Git sees a deletion and an unrelated addition. The history splits. Run git log on the new filename and you see commits going back to the rename. The older history, under the old filename, is only visible if you pass a special flag, and even then Git can get it wrong.

For a team whose Subversion repository has fifteen years of carefully maintained history, watching that history fragment during migration is not a technical inconvenience. It is a loss. Developers have strong feelings about history. They want to know who wrote a line of code and why. They want to trace the evolution of a module across years of refactoring. The promise of version control is that nothing is forgotten. When a migration breaks that promise, even partially, it shakes trust in the new tool before anyone has had a chance to learn it.

The Binary Problem

There is an entire category of organization for which Git migration is not just difficult but architecturally wrong, and they are some of the largest software teams in the world.

Game studios work with assets that Git was never designed to handle. A single texture file can be hundreds of megabytes. A three dimensional model can be gigabytes. An audio library for a major title can be tens of gigabytes. These are binary files, opaque blobs that Git cannot diff, cannot merge, and cannot compress efficiently. Every version of every binary file is stored as a complete copy in Git's object database. A repository that contains a hundred revisions of a one gigabyte texture file is not a one gigabyte repository. It is a hundred gigabyte repository.

This is why most game studios use Perforce. Perforce is a centralized version control system that handles large binary files natively. It supports file locking, so two artists cannot edit the same texture simultaneously and create an unmergeable conflict. It integrates tightly with Unreal Engine, the game engine used by most of the industry. It is expensive and proprietary and its interface is showing its age, but it works for the specific problem game studios have.

Git Large File Storage, usually called Git LFS, was supposed to solve this. Developed by GitHub and Atlassian, LFS replaces large files in your repository with small pointer files. The actual content is stored on a separate server. When you clone the repository, you get the pointers, and LFS downloads the actual files on demand.

It solves the repository size problem. It does not solve anything else. LFS has to be installed separately by every developer and configured for every repository. It is entirely command-line driven, which means artists and designers, the people who produce most of the binary assets, need to learn terminal commands or rely on someone else to manage their files. It has storage limits that get expensive quickly. GitHub's free tier gives you ten gigabytes of storage and ten gigabytes of bandwidth per month, which a single day of asset iteration can exceed. There is no file locking. Two artists can still edit the same texture, and when they try to push, one of them loses their work. And the files still cannot be merged, so the fundamental collaboration problem remains.

The result is a mess. Some game studios use Git with LFS for code and Perforce for art assets. Some use Perforce for everything. Some use Unity Version Control, which used to be called Plastic SCM. A few use Subversion, which handles large files better than most people realize. And a growing number of smaller indie studios try Git because it is free and everywhere, run into the binary problem, struggle with LFS, and either live with the pain or quietly switch to something else.

Git won version control. But it won it for text files. For the industries that work primarily with binary assets, film production, game development, hardware design, the migration to Git is less a triumph than an awkward compromise.

What Actually Gets Lost

Every migration guide talks about preserving history. Very few talk about what history actually means across different systems, because it does not mean the same thing.

In Subversion, a revision is a sequential number. Revision one thousand forty-two means the one thousand forty-second change to the repository, globally, across all files and all branches. You can say "this bug was introduced in revision eight hundred and the fix went in at revision nine twelve" and everyone knows exactly where in the timeline those events fall. Revision numbers are a shared clock.

Git does not have revision numbers. It has commit hashes, forty-character strings of hexadecimal that are unique but meaningless to humans. You cannot look at a commit hash and know whether it came before or after another commit. You cannot tell a colleague "check out revision ten twenty-four" because that concept does not exist. The history is a graph, not a list. Two commits can happen simultaneously on different branches with no ordering between them. This is technically correct for a distributed system and deeply disorienting for a team that has been thinking in sequential revisions for a decade.

Branch semantics differ too. In Subversion, a branch is a directory. You can see all the branches by listing the branches directory. You can control who has write access to each branch, because directory-level permissions are a built-in feature. In Git, a branch is a pointer to a commit. It exists in your local repository and optionally on a remote. There is no directory to list. There are no directory-level permissions. Access control happens at the repository level or not at all, which is why large organizations that need fine-grained permissions end up splitting their code across multiple repositories.

Then there is the workflow. Subversion has a single workflow. You update from the server, make your changes, and commit to the server. The server is the source of truth. If two people edit the same file, the second person to commit gets a conflict and resolves it before their change goes in. The conflict happens at commit time, on one machine, with a clear resolution path.

Git has a dozen workflows, and the one your team picks will shape how they think about collaboration. Do you merge or rebase? Do you use feature branches or commit directly to main? Do you squash commits before merging? Each choice has consequences, and teams argue about them. A team migrating from Subversion does not just need to learn new commands. They need to adopt a workflow that did not exist in their previous system, and they need everyone to agree on it. This is where many small migrations stall. The tools work, the history converts, and then the team spends three months fighting about whether to use merge commits.

The Human Side

The hardest part of any migration is not the technology. It is the person who does not want to migrate.

Every team has one. The senior developer who has been writing software for twenty years and has used Subversion for most of them. They know every svn command. They know the revision numbers of important changes. They have scripts that depend on the way Subversion works. They are not resistant to change because they are stubborn or old-fashioned. They are resistant because they have invested years in building proficiency with a tool, and the migration asks them to become a beginner again. In an industry that measures people by their expertise, being a beginner is uncomfortable.

The manager who mandates the migration without understanding the cost is the other recurring character. They read that Microsoft and Google and Facebook use Git, which is half true, and they announce the switch in a quarterly planning meeting. They allocate two weeks for the transition. They do not budget for training. They do not account for the productivity dip. They do not realize that "everyone uses Git" means "everyone uses Git differently" and that their team will need to choose and standardize on a workflow that nobody has experience with yet.

The worst outcome is not a failed migration. It is a half-finished one. The repository is converted but nobody has updated the build scripts. The history is there but nobody knows how to search it with Git's commands instead of Subversion's. Half the team has switched and the other half is still using git-svn as a bridge, creating a translation layer that adds confusion and merge headaches. The organization is paying the cost of two systems while getting the benefits of neither.

And sometimes, quietly, the team goes back. Not to Subversion usually, but to the workflow they had before. They use Git the way they used Subversion: one branch, linear commits, push and pull from a central server. They do not use feature branches. They do not use pull requests. They do not use any of the things that make Git worth the migration cost. They have Git installed but they have not actually migrated. They have just changed which command they type before they start arguing about merge conflicts.

The AI Promise and the AI Question

There is a new pitch making the rounds in two thousand twenty-six. AI-assisted migration. Tools that can analyze your Subversion workflow, map it to an equivalent Git workflow, generate the migration scripts, and even retrain your CI pipelines. The promise is that AI reduces the migration tax to nearly zero.

Some of this is real. AI can certainly generate git-svn commands. It can write hook scripts. It can translate Subversion path-based permissions into Git's closest equivalent, which involves splitting repositories. It can explain to the confused senior developer what git rebase does, in terms they understand, at two in the morning when the migration has gone sideways and nobody else is awake.

But the deeper question is whether AI changes the calculus of migration itself. If an AI assistant can insulate your team from Git's complexity, translating their intentions into commands, resolving their merge conflicts, managing their branches, do they need to understand Git at all? And if they do not understand it, are they really using Git, or are they using a wrapper that happens to store data in Git format?

This circles back to the question we raised two episodes ago. AI as translator between a complex tool and the people trying to use it. For migration, the question is sharper. The whole point of migrating to Git was supposed to be gaining access to Git's power: cheap branching, distributed collaboration, powerful merging. If the team never learns those things because an AI handles them, the migration was a costume change. New tool, same workflow. The cost was real and the benefit was a checkbox on a compliance form.

The Ones Who Had No Choice

There is one more migration story worth telling, and it is the quietest one. In April two thousand nineteen, the Apache Software Foundation completed its migration to GitHub. Apache had been one of the last major open source organizations still running its own Subversion infrastructure. Hundreds of projects, decades of history, thousands of contributors. The migration was not driven by Git's technical superiority. It was driven by the fact that contributors expected Git and GitHub. Pull requests had become the universal language of open source contribution. A project hosted on Subversion was a project that was harder to contribute to, harder to discover, harder to integrate with the tools everyone else was using.

This is the force that drives most migrations. Not technical superiority. Social gravity. Git is where the developers are. GitHub is where the pull requests are. A team that stays on Subversion is not wrong, but they are increasingly alone. Their job postings say "experience with Git" because candidates expect it. Their new hires arrive already knowing Git, or at least knowing the three-command incantation. The cost of not migrating is measured in recruitment friction, in contributor attrition, in the slow drift toward irrelevance.

Season one of this show told the story of how Git was built. How Linus wrote it in two weeks. How Junio maintained it with quiet diligence. How GitHub turned it into the social network for code. And that story made it sound like adoption was about technical merit, about the right tool winning because it was the right tool.

Season two is telling a different story. Adoption is also about gravity. About network effects. About the cost of being different in an industry that standardized whether you were ready or not. The great migration is not a single event. It is an ongoing process, happening right now, in thousands of organizations. Some of them are doing it well. Some of them are doing it badly. Some of them are building virtual filesystems because the tool that won was not designed for their scale. And some of them are staring at a Subversion server, knowing they need to switch, not because Git is better for what they do, but because Git is what everyone else uses.

That is the tax. And everyone pays it, one way or another. That was episode twenty-five.

Git svn clone is the bridge between two worlds. Point it at a Subversion repository and it will rebuild the entire history as Git commits, one revision at a time. The branches come along, the tags come along, and when it finishes you have a Git repository that remembers where it came from. You can keep working in Git and push your changes back to the Subversion server with git svn dcommit, which is how teams migrate incrementally, one developer at a time. What it will not tell you is that the initial clone can take days for a large repository. That merge history rarely survives the translation. That file renames tracked explicitly in Subversion become heuristic guesses in Git. The bridge works. It is also the place where you first notice that these two systems do not think the same way.