To Git or Not: The Mismatch

The Fifty Gigabyte Mistake

This is episode thirty-two of Git Good. In episode twenty-five, we watched organizations climb the wall to migrate to Git. In episode thirty, we celebrated the places Git escaped to, the novelists branching their endings, the scientists versioning their lab notebooks, the lawyers tracking legislation. Those were the success stories. This episode is about the failures.

Sometime around two thousand twenty, a mid-sized game studio decided to put their entire asset library into a Git repository. They had heard the pitch. Git is the industry standard. GitHub is where collaboration happens. Every engineer they hired already knew the commands. So they set up a repository, configured Git Large File Storage, and started pushing textures, three dimensional models, audio files, and animations. The repository hit fifty gigabytes within a month. Cloning it took the better part of a workday. Running git status, the command that simply checks what has changed, started taking minutes instead of milliseconds. The continuous integration server, which pulled the repository fresh for every build, burned through the LFS bandwidth quota in the first week. The bill from GitHub arrived. It was not subtle.

They were not doing anything wrong. They were using Git exactly as GitHub recommended, with LFS configured, with the right file extensions tracked, with everything set up by the book. The problem was not configuration. The problem was that Git was designed, from its first line of code, for a specific kind of file. Small. Text-based. Line-oriented. The kind of file where you can look at two versions side by side and see exactly what changed on line forty-seven. Source code. Configuration files. Documentation written in plain text. Everything Linus Torvalds needed to track the Linux kernel. Everything a game studio does not produce.

The studio did what many do. They quietly moved their art assets to Perforce and kept Git for the code. Two version control systems, two workflows, two sets of permissions, two mental models. The engineers commit and push. The artists check out and check in. The two groups work on the same product but live in different collaboration universes. It is not elegant. But it works. And that awkward compromise tells you something important about where Git's victory actually ends.

The Kingdom of Text

Git's entire architecture assumes that the interesting thing about a file is the differences between its versions. When you commit a change to a source code file, Git does not store a new copy of the whole file. It stores a delta, a compressed record of what changed. Line twelve was added. Lines forty through forty-five were removed. Line seventy was modified. This is brilliantly efficient for text. A one megabyte source file that gets ten small edits over the course of a day produces ten tiny deltas, not ten one megabyte copies.

Binary files do not work this way. A texture file, a Photoshop document, a three dimensional mesh, a trained machine learning model. These are opaque blobs to Git. There are no lines. There is no meaningful diff. Change one pixel in a texture and Git sees a completely different file. It cannot tell you what changed. It cannot compress the difference efficiently. It stores the entire file again. Do that a hundred times and your repository is a hundred copies of the same file with trivial differences, each one taking the full space.

This is not a bug. It is a design decision. Linus built Git to track the Linux kernel, which at the time was roughly fifteen million lines of C code. Text files, all of them. The kernel does not contain textures. It does not contain neural network weights. It does not contain Jupyter notebooks full of embedded charts. Git is a tool that does one thing extraordinarily well, and the world keeps asking it to do other things.

The Game That Perforce Built

Over ninety percent of the top triple-A game studios use Perforce. Not because they love it. Perforce is expensive, its interface looks like it was designed in two thousand three, and setting it up requires a systems administrator who knows what they are doing. Studios use it because it solves the specific problems that game development actually has.

The first problem is scale. A modern triple-A game repository can be ten terabytes. Not gigabytes. Terabytes. Perforce handles this without breaking a sweat because it is centralized. Nobody clones the whole repository. Developers and artists connect to the server and download only the files they need. A character artist working on a specific creature pulls that creature's textures and meshes, not the entire game. A level designer pulls one map, not the whole world. The server tracks who has what, and it only sends the pieces each person actually uses.

The second problem is locking. When two programmers edit the same source code file, Git can usually merge their changes automatically. When two artists edit the same texture, there is no automatic merge. A texture is not two hundred lines that can be compared and combined. It is a single artifact, and two different versions of it are two different creative decisions. One wins. The other is lost. Perforce handles this with file locking. An artist checks out a texture, and the server locks it. Nobody else can edit that file until the first artist checks it back in. It sounds primitive compared to Git's optimistic merging, but for binary files, locking is not primitive. It is correct.

The third problem is the people. Game studios employ artists, designers, animators, and audio engineers alongside programmers. Most of these people have never opened a terminal. They do not know what a commit hash is. They do not want to learn. Perforce gives them a visual client with buttons. Check out. Check in. Get latest. It integrates with Unreal Engine, so artists can manage their files without leaving the editor they are already working in. Epic Games, the company behind Unreal Engine, uses Perforce internally. Their tools, Unreal GameSync, Robomerge, and Horde, all assume Perforce. The ecosystem is built around it.

I do not care about version control ideology. I care about whether I can open my file, do my work, and save it without losing anything. Perforce lets me do that. Git makes me feel like I need a computer science degree.

That sentiment, pulled from a game development forum, captures something the Git community does not always want to hear. Technical elegance matters to engineers. It does not matter to artists. And in game development, the artists produce most of the data.

The Notebook Problem

If game development is the most visible place where Git fails, data science is the most ironic. Here is a community that lives inside the software engineering world, uses Python and R, writes code every day, and still cannot make Git work for their most important artifact: the Jupyter notebook.

A Jupyter notebook looks, on screen, like an elegant document. Code cells, output cells, markdown explanations, embedded charts and images, all woven together in a single interactive file. It is the lab notebook of modern data science. Researchers use it to explore data, test hypotheses, and present results. It is designed for exactly the kind of iterative, experimental work that version control should be perfect for.

Open that same notebook in a text editor and the elegance vanishes. A Jupyter notebook is a JSON file. Not a clean JSON file with human-readable fields. A sprawling, deeply nested JSON document where a single code cell might contain its source code, its execution count, its output as base sixty-four encoded image data, and metadata about what kernel was running. Change one line of code and re-run the cell, and the diff shows your one-line code change plus hundreds of lines of changed output data, execution counts, and metadata. The meaningful change is buried in noise.

Merge conflicts are worse. When two data scientists edit the same notebook on different branches and try to merge, Git sees two incompatible versions of a JSON file. The conflict markers Git inserts, those angle brackets and equal signs it uses to show where the two versions disagree, break the JSON structure. The notebook becomes invalid. You cannot even open it in Jupyter to see what the conflict is. You have to fix it in a text editor, staring at raw JSON, trying to figure out which curly brace belongs to which cell.

The community has built a small ecosystem of workarounds. A tool called nbdime understands notebook structure and can produce meaningful diffs that show you which cells changed and what the output differences are. It can also handle merges, ensuring that even when there is a conflict, the result is at least a valid notebook you can open. A tool called nbstripout installs as a Git filter and automatically strips all output data from notebooks before they are committed, so the diffs only show code changes. A service called ReviewNB adds visual notebook diffing to GitHub pull requests, so code reviewers can see rendered notebooks instead of raw JSON.

Every new person who joins our team, we spend the first hour explaining the notebook Git setup. Install nbstripout. Configure the diff driver. Do not commit outputs. Never merge a notebook manually. It is an hour of workarounds for a problem that should not exist.

The irony is that stripping outputs defeats the purpose of the notebook. The whole point of a Jupyter notebook is that the code and its results live together. Strip the outputs and you have a script with extra formatting. Keep the outputs and your diffs are meaningless. The notebook format and Git's diff model are fundamentally incompatible, and every workaround sacrifices something.

The Designers Who Left

In two thousand seventeen, a company called Abstract launched with a bold promise: Git for designers. They had raised millions of dollars, eventually reaching thirty million in a single funding round. The pitch was compelling. Designers deserved the same version control that developers had. Branches for exploring alternatives. Commits for saving progress. Merges for combining work. Pull requests for design review. Everything Git did for code, Abstract would do for design files.

Git for designers is here. Finally, the same workflow that made software engineering collaborative comes to the design world.

The launch press was enthusiastic. Abstract worked with Sketch, the dominant design tool at the time. You could create a branch, explore a different direction for a user interface, commit your progress, and merge it back when you were happy. The vocabulary was borrowed directly from Git. The mental model was branches and merges.

It never caught on. The problem was not execution. Abstract worked reasonably well for what it did. The problem was that designers do not think in branches and merges. A designer exploring two directions for a layout does not want to create a branch, switch between branches, and resolve merge conflicts when the two directions touch the same component. They want to have two artboards side by side and pick the one that looks better. The metaphor was wrong. Version control for code solves a coordination problem, multiple people editing the same text files need a system to prevent their changes from colliding. Design is a different kind of collaboration. It is visual. It is spatial. It happens in real time, not in asynchronous commits.

Figma understood this. Instead of bringing Git's model to design, Figma brought Google Docs's model. Real-time multiplayer editing. No branches. No merges. No commits. Multiple designers working on the same file at the same time, seeing each other's cursors, watching changes appear live. Version history exists, but it is automatic and invisible, a timeline you can scrub through, not a graph of commits you have to manage. There is no learning curve because there is no new workflow to learn. You just open the file and start designing.

Abstract announced its sunset in late two thousand twenty-five, shutting down on January thirty-first, two thousand twenty-six. Sixty-two million dollars in funding. A decade of effort. Gone. Not because the product was bad, but because the metaphor was wrong. Git's model is not universal. It is specific to a kind of work, and that kind of work is text-based, asynchronous, and merge-friendly. Design is none of those things.

The Data Scientist's Dilemma

A data scientist named Dmitry Petrov spent years at Microsoft watching a gap widen. On one side, the software engineers with their Git repositories, their pull requests, their continuous integration pipelines, their disciplined workflows. On the other side, the data scientists with their folders full of files named model final, model final two, model final real, model actually final. The engineers had version control. The data scientists had chaos.

There is essentially a wall between the two worlds. Data science and software engineering do not work together. I wanted to build tools to remove that wall.

In two thousand eighteen, Petrov and a co-founder named Ivan Shcheklein started a company called Iterative and released a tool called DVC, Data Version Control. The idea was clever. DVC looks and feels like Git. You run dvc add to track a large file, dvc push to upload it, dvc pull to download it. The commands mirror Git's vocabulary deliberately. But underneath, DVC does not use Git's storage at all. The actual data, the massive training datasets, the multi-gigabyte model files, lives in cloud storage. Amazon S3, Google Cloud Storage, Azure Blob Storage. What goes into the Git repository is a tiny pointer file, a few lines of metadata that say "version three of this dataset lives at this address in the cloud."

It is a sensible architecture. Git tracks the code and the pointer files. DVC tracks the data. The two stay in sync through the pointer files. You can check out any branch in Git and run dvc checkout, and DVC will pull the version of the data that matches. Reproducibility, the thing scientists care about most, is restored. You can go back to any experiment and get the exact code and the exact data that produced those results.

But DVC also reveals the depth of the mismatch. A machine learning training dataset can be hundreds of gigabytes. A trained model file can be several gigabytes. These are not files you casually push and pull over the internet. They are not files that two people edit simultaneously. They are not files where "line forty-seven changed" is a meaningful statement. The entire Git workflow of branches and merges and pull requests and diffs does not map onto data the way it maps onto code. DVC borrows Git's vocabulary, but the underlying reality is closer to a file synchronization service with version labels. Calling it "Git for data" is marketing. The actual mechanism is cloud storage with metadata in Git.

The AI Irony

There is a particular irony in all of this for the two thousand twenty-six moment we are living through. The machine learning workflow produces exactly the kind of artifacts Git handles worst. Training datasets that are tens or hundreds of gigabytes. Model weight files that are several gigabytes each. Checkpoint files generated every few hours during training runs that last days. Experiment logs and output files that change with every run. Non-deterministic outputs where running the same code twice produces different results, making Git's notion of a clean diff fundamentally meaningless.

The tools that are supposed to bring order to this, DVC, MLflow, Weights and Biases, all exist because Git cannot do it alone. They sit alongside Git, using it for the parts it handles well, the code, the configuration, the experiment scripts, while routing everything else through specialized storage. The result is a patchwork. The code is in Git. The data is in S3. The model is in a model registry. The experiment tracking is in yet another system. Four tools to do what Git does alone for a simple web application.

And the problem is accelerating. As models get larger, as training datasets expand, as the iteration cycle speeds up with each generation, the gap between what Git can handle and what AI workflows actually produce keeps widening. The tool that enabled the software ecosystem that artificial intelligence grew out of is architecturally wrong for AI's own artifacts. Git made possible the open source culture that produced TensorFlow and PyTorch and Hugging Face. And now those projects strain against Git's limits every day.

The Dogma Question

Season one of this show told you how Git conquered version control. Season two has been showing you the consequences. And this episode arrives at a question the Git community does not ask often enough: is "use Git for everything" wisdom or dogma?

The answer, like most honest answers, is unsatisfying. It depends.

Git for source code is wisdom. Nothing else matches it. The distributed model, the cheap branching, the powerful merging, the vast ecosystem of tools and platforms built around it. For what Git was designed to do, it is the best tool ever created for that purpose.

Git for everything else is a spectrum. For configuration and plain text, it is still the best option. For small binaries, it is tolerable. For large binaries, notebooks, design files, and data science artifacts, it ranges from awkward to actively harmful. And the impulse to force everything into Git anyway, because Git is what we know, because GitHub is where the collaboration happens, because "everyone uses Git" and not using it feels like falling behind, that impulse is not wisdom. That is gravity.

In episode twenty-five, we talked about migration as social gravity, teams moving to Git not because it was technically better for their work but because it was what everyone else used. This is the other side of that gravity. Once you are in Git's orbit, everything gets pulled toward it. Your data. Your models. Your design files. Your notebooks. Even the things Git was never designed to hold. And the pull is strong enough that people build elaborate workarounds, LFS and DVC and nbdime and Figma bridges, rather than admit that some things do not belong in Git.

The healthiest organizations are the ones that draw the line clearly. Git for code. The right tool for everything else. They do not treat "not in Git" as a failure. They treat it as a design decision, the same way Linus made a design decision when he built Git for text files and did not apologize for it. That was episode thirty-two.

Git lfs track tells Git to stop trying. Point it at a file pattern and Git stops storing those files in its object database. Instead, it writes a tiny pointer file, a few lines of text that say where the real file lives on a separate LFS server. The actual content gets uploaded to that server on push and downloaded on pull. It solves the repository size problem and nothing else. You still cannot diff a binary file meaningfully. You still cannot merge two versions of a texture. You still pay for LFS bandwidth, and that gets expensive fast, one gigabyte free on GitHub before the meter starts running. Think of it as Git admitting, gracefully, that this particular file is not its problem.