The Git Disaster

The Cold Sweat

This is episode six of Git Good, Season Two. In the last two episodes we watched teams navigate code review and migration. Now we are going to talk about the thing nobody puts on their resume. The disasters.

Every developer has a story. The command they should not have run. The flag they added without thinking. The moment their terminal went quiet and their stomach did not. These stories get traded like war stories at conference bars and Slack channels, half confession and half therapy. They are funny in retrospect. They are never funny at the time.

What makes Git disasters different from other software catastrophes is that Git almost always has a way back. The tool that lets you destroy things in a single command also keeps a secret record of everything you did. The question is whether you know that record exists, and whether you find it before the panic takes over.

We are going to look at disasters across four scales. A single developer alone with a terminal. A team sharing a repository. An organization that never learned how Git works. And infrastructure, where a misconfigured script can reach across the internet and rewrite a hundred and fifty repositories before anyone notices.

Each scale has its own flavor of catastrophe. Each scale has its own version of the safety net. And all of them lead back to the same two words.

One Developer, One Terminal

Here is a story that plays out somewhere in the world every single day. A developer, let us call her a junior developer because the story always starts that way, is working on a feature branch. She has been at it for a week and a half. Fourteen commits over nine working days. The feature is not glamorous, a set of API endpoints that connect a billing system to a notification service, but it is the largest single piece of work she has owned since joining the company. The branch is not pushed to the remote yet because she wanted to clean up the commit history first. She has been rebasing, squashing, rewriting the sloppy "work in progress" messages into something a reviewer would respect. The branch is almost ready.

Then she runs a force push. On the wrong branch.

Maybe she was following a tutorial about rebasing that said "now force push to update the remote." Maybe she had the main branch checked out when she thought she was on her feature branch. Maybe her fingers typed the command before her brain caught up. The specifics do not matter. What matters is that her carefully curated commits are no longer where she expects them to be. The branch pointer has moved. Her work has vanished from git log like it was never there.

The panic is immediate and physical. Cold sweat. A tight feeling in the chest. She runs git log and her commits are not there. She runs git branch and the branches exist but point to wrong places. She tries to pull and Git tells her the histories have diverged. She is in that specific hell where every command she tries seems to make things worse, and every search result she finds assumes a level of understanding she does not have.

I mass deleted feature branches that I did not mean to delete. I mass deleted things I was not supposed to delete. I mass deleted things I absolutely was supposed to keep. I mass deleted things I had been working on for over a week.

And then, either through a colleague or a frantic search, she finds the reflog.

The reflog is Git's flight recorder. Every time a branch pointer moves, every commit, every reset, every checkout, Git writes a line. When she runs git reflog, there they are. Every commit she made over the past week, listed with timestamps, each one identified by its hash. The branch pointer is gone but the commits themselves never left. They have been sitting in the object store the whole time, orphaned but intact, waiting for someone to point at them again.

The reflog just lists the changes right there. There they are. The branch is back. Everything is restored. It literally saved my career.

She creates a new branch pointing to the hash she needs. One command. Seconds. Nine days of work, back from the dead.

Here is the thing about this story. It is not one story. It is ten thousand stories. The junior developer in this version did not know the reflog existed before this moment. Most developers do not. A survey of developer forums turns up the same confession over and over. Someone panics. Someone else, a colleague, a Stack Overflow answer, a blog post found at eleven thirty at night, says "try git reflog." And a career is saved.

Developers who survive this experience describe a shift in how they use every other Git command afterward. The rebase is less frightening when you know the previous state is recorded. The hard reset is less permanent when you know the reflog kept the receipt. The knowledge that the safety net exists changes the relationship with the tool entirely.

The tool that caused the disaster and the tool that fixes it are the same tool. The only difference is knowledge.

The Sprint to the Desk

Scale up. Now the disaster involves other people.

A developer, working late, needs to clean up the main branch. There are stale references, old merge commits, a history that has gotten messy. They know enough Git to be dangerous, which is exactly the amount of knowledge that causes the worst problems. They run a reset on the main branch. They force push.

The main branch of the shared repository now points to a commit from three weeks ago. Every commit made by every team member in those three weeks has been disconnected. The commits still exist in the remote repository's object store, but nothing points to them. To every developer who pulls the latest changes tomorrow morning, those three weeks of work will appear to have never happened.

The developer who did this realizes what they have done within seconds. The options race through their mind. They cannot just force push again because they do not have the correct commit hash. The remote reflog might have it, but they do not have access to the server. There is one place where the correct state of the main branch still exists: their colleagues' local repositories.

The clock is ticking. If any colleague runs a fetch or a pull before the branch is restored, their local copy will sync to the broken state, and one more copy of the correct history will be lost. The developer gets up from their desk and sprints across the office. They find a colleague who has not pulled yet.

He ran up to my desk, out of breath, and said close Sourcetree right now and run this command. I will explain later.

The colleague's local repository still has the correct main branch. They push it to the remote, overwriting the broken state with the real one. Crisis resolved. Three weeks of work restored from a single laptop that happened to be behind on pulls.

This story has a detail that makes it worth telling. The safety net was not Git's architecture. It was human. The reflog on the remote server could have helped, but the developer did not have access. The object store still had the commits, but finding the right hash among thousands of orphaned objects is not trivial under pressure. What actually saved the project was the fact that Git is distributed. Every clone is a full copy. As long as one clone somewhere in the world has the correct state, recovery is possible.

The disaster happened because one developer had too much access and too little caution. The recovery worked because another developer had not synced yet. In a centralized system like Subversion, the same mistake would have required a database backup to undo. The server is the single source of truth, and when the single source of truth is wrong, there is nowhere else to look.

In Git, every laptop is a backup. The same distributed architecture that makes Git confusing to learn is what makes it resilient to catastrophe. The twenty clones scattered across twenty laptops are not just copies. They are redundant records of the truth, and any one of them can restore what the others lost.

That is either reassuring or terrifying, depending on how much you trust your colleagues not to run git pull before you reach their desk.

The Organization That Never Learned

Now scale up again. Twelve teams. Hundreds of developers. A company that adopted Git because everyone uses Git, without ever investing in understanding what that means.

This is the Franken-Workflow, and if you have worked at a large enough company, you have seen some version of it. The company migrated from Subversion three years ago. They hired a consultant who set up a branching strategy. The consultant left. The branching strategy evolved, mutated, and eventually became something no single person could explain.

Here is what it looks like in practice. Every developer, for every ticket, clones the entire repository from scratch. Not a new branch. A fresh clone. They treat Git like a download, not a workspace. They make their changes, push, and then delete the local clone. The next ticket, they clone again. The accumulated knowledge of how the repository works, the stashes, the local branches, the reflog, is thrown away every single time.

The merge conflicts are constant. Because nobody works on local branches for more than a few hours, and because the branching strategy involves merging through three intermediate branches before reaching production, every developer encounters conflicts daily. The tech leads have developed a routine. When a conflict appears, the tech lead resolves it. Not by consulting the developers who wrote the conflicting code. Not by understanding what either change was trying to accomplish. By looking at the diff and making a judgment call about which version looks more correct.

Sometimes they get it right. Sometimes they do not. Sometimes a feature that took a developer two days to build is silently discarded during a conflict resolution by a tech lead who did not know it existed.

And then there is Jay.

Jay has a PhD. Jay is brilliant in his domain. Jay does not understand merge conflicts and does not intend to learn. When Jay encounters a merge conflict, he has a consistent strategy.

He just accepts all of his own changes and discards everything else. Every single time. He does not read the conflict markers. He does not look at what the other person wrote. He just keeps his version and moves on.

The team discovered this after a sprint where three developers' work vanished. They traced it back to Jay's conflict resolutions. The code was gone from the branch. It was gone from the merge commit. It was not even in the reflog because Jay had never had the other developers' code on his machine in the first place. The safety net that saves individual developers does not work when the person causing the damage never had the data locally.

The Franken-Workflow is not a Git problem. It is an organization problem that Git makes visible. In Subversion, with its centralized server and its sequential revision numbers, the same dysfunction would produce different symptoms. Developers would lock files instead of resolving conflicts. They would overwrite each other's changes on the server. The destruction would be the same, just less traceable.

Git did not cause the dysfunction. Git exposed it. Twelve teams with no shared understanding of their tools, no training budget, no workflow documentation, and no one whose job it is to notice when a PhD is silently deleting his colleagues' work. The version control system is the least of their problems.

There is a smaller story from the same world that deserves a mention because it illustrates a different kind of organizational disaster. A continuous integration pipeline, the automated system that builds and tests code whenever someone pushes a change. The first line of the build script deletes the .git directory. This is a common optimization. The CI system does not need the full history to build the code. Deleting .git saves disk space and speeds up the pipeline. The script works perfectly on the CI server, where the code is a disposable checkout that will be thrown away after the build finishes.

Then a developer runs the build script locally. On their actual working repository. The .git directory disappears. Every commit, every branch, every stash, the entire reflog, gone. Not orphaned. Not disconnected. Deleted from disk. The working directory still has the current files, but the repository, the thing that makes those files a project with a history, no longer exists.

The reflog cannot help because the reflog was inside the .git directory. The object store cannot help because it was inside the .git directory. The only recovery is to clone the remote again, losing every local branch and every piece of uncommitted work. The safety net that Git builds so carefully, the one that survives force pushes and bad resets and deleted branches, does not survive the deletion of .git itself. That is the one thing Git cannot recover from, because that is where Git keeps everything it knows.

One Hundred and Fifty Repositories

Now scale up one more time. November tenth, two thousand thirteen. The Jenkins project, one of the most widely used open source continuous integration tools in the world.

A Jenkins developer was setting up a Gerrit server, the code review system used by many open source projects. Gerrit has a replication plugin that synchronizes repositories. The developer pointed the plugin at a local directory containing one hundred and eighty-six Git repositories that had been cloned from the Jenkins GitHub organization about two months earlier.

The Gerrit replication plugin has a default that, in retrospect, is astonishing. It defaults to force push. Not a regular push that would fail if the remote had newer commits. A force push that overwrites whatever is on the other end without asking. The developer started Gerrit. The plugin found one hundred and eighty-six repositories. It began replicating.

The replication happened automatically, and because these local repositories were two months out of date, it rewound branch heads across more than a hundred and fifty repositories to point to older commits.

Two months of commits across one hundred and fifty repositories, gone. Not deleted. Not corrupted. Just disconnected. The branch pointers now aimed at commits from September instead of November. Every plugin, every tool, every component in the Jenkins ecosystem that had been updated in those two months appeared to revert to an older version.

Multiple developers noticed within hours. The recovery was a combination of local workspaces and GitHub's server-side reflogs. GitHub support provided data from their internal reflogs, the same mechanism that saves individual developers, but operating at the platform level. Developers who had up-to-date local clones pushed their branches back. Most popular plugins were restored within twenty-four hours.

The Jenkins project responded by building a continuous monitoring script that records every reference update across their entire GitHub organization. A flight recorder for the flight recorder. Because when you have a hundred and fifty repositories, finding the correct state after a disaster requires knowing what the correct state was, and that means logging everything, all the time, before anything goes wrong.

When the Safety Net Has Holes

There is one more story worth telling, because it shows what happens when the disaster exceeds what Git can recover.

January thirty-first, two thousand seventeen. GitLab, the company that hosts millions of Git repositories and pitches itself as the open source alternative to GitHub, experienced a cascading failure that started with a spam flag and ended with an engineer accidentally deleting the primary production database.

The sequence was almost comically unfortunate. An automated system mistakenly flagged a GitLab employee as a spammer. The subsequent cleanup caused increased database load. The primary database and its replica fell out of sync. An engineer attempted to fix the replication by clearing the replica's data directory and starting fresh.

He ran the deletion command on the wrong server. The primary database, not the replica, began disappearing.

He noticed his mistake within seconds. But the deletion process was fast. By the time he cancelled the command, three hundred gigabytes of live production data was already gone.

The recovery attempt revealed something worse. The backup systems had been silently failing for weeks. The database backups were incomplete. The replication was already broken. The automated backup verification, the process that should have caught this, did not exist. Before the incident, no single engineer at GitLab was responsible for validating that backups actually worked.

What saved the company from total catastrophe was luck. Six hours before the deletion, an engineer had taken a manual snapshot of the primary database for an unrelated testing purpose. That snapshot, taken on a whim for a routine test, became the foundation of the recovery. Without it, GitLab would have lost not six hours of data but twenty-four. The difference between a bad day and a company-ending event came down to one engineer who happened to run a backup for reasons that had nothing to do with disaster preparedness.

The recovery took eighteen hours. Eighteen hours of copying data across slow network connections, verifying integrity, rebuilding indexes. When it was over, the damage was real but survivable. Roughly five thousand projects, five thousand comments, and seven hundred new user accounts created in that six-hour window were permanently lost.

GitLab did something remarkable in response. They livestreamed the recovery process on YouTube. Engineers working in real time, visible to the world, trying to restore the service while thousands of people watched. They published a detailed postmortem that hid nothing, not the broken backups, not the untested recovery process, not the fact that a single directory deletion command, typed on the wrong server, was all it took. They showed the world exactly how a company that sells repository hosting lost its own data because nobody owned the question "do our backups actually work."

The GitLab incident is not strictly a Git disaster. It was a database disaster at a Git hosting company. But it belongs in this episode because it shows the limit of Git's safety net. Git's architecture protects you from losing committed data in your local repository. It does not protect you from losing the server that hosts the remote. It does not protect you from infrastructure failures. It does not protect you from the assumption that someone else is handling the backups.

The distributed model helps here, too. Every developer who had cloned a GitLab-hosted repository still had a full copy of that repository's history. The Git data was safe because Git data lives everywhere. The metadata, the issues, the merge requests, the comments, the CI configurations, those lived only on GitLab's servers, and those are what was lost.

The Reflog and the Folklore

Every disaster story in this episode shares a structure. Something goes wrong. Someone panics. A safety mechanism, usually one that the person did not know existed, provides a path back. And afterward, the story gets told and retold until it becomes part of the folklore that teaches the next generation what not to do and where to look when they do it anyway.

The reflog is the hero of most of these stories. Not because it is the most powerful recovery tool Git offers, but because it is the most discoverable in a crisis. You do not need to understand Git's object model. You do not need to know about garbage collection policies or orphaned commits. You just need someone to tell you to type two words, and there is your history, timestamped and waiting.

Someone once described the reflog as "re-flog: for when the first flogging was not enough." The joke works because it captures something true about the experience. You made a mistake. Git flogged you with an incomprehensible error or a terrifying result. And then the reflog, the re-flogging, forces you to confront exactly what you did, step by step, in chronological order. The punishment and the cure are the same thing.

But the reflog has limits. It is local. It records what happened on your machine, not what happened on the server. Jay the PhD's conflict resolutions do not appear in his victims' reflogs because the damage happened in Jay's repository. The Jenkins incident required server-side reflogs that only GitHub had access to. The GitLab incident involved data that Git does not track at all.

The real lesson across all four scales is not "learn the reflog." It is that Git's safety net works differently at every level.

For the solo developer, the reflog is almost magical. Thirty days of orphaned commits, ninety days of reachable history, all sitting quietly in the object store. The only thing between you and recovery is knowing the reflog exists.

For the team, the safety net is distribution itself. Every clone is a backup. The person who has not pulled yet holds the last correct copy. The question is whether you can reach them before they sync.

For the organization, the safety net is not technical at all. It is training, documentation, and someone whose job it is to notice when the process is broken. Git cannot save you from a workflow that nobody understands.

And for infrastructure, the safety net is monitoring, tested backups, and the humility to verify that your recovery process actually works before you need it. The Jenkins project learned this by building a monitoring script. GitLab learned it by losing production data on a livestream.

Git is the most forgiving catastrophically powerful tool ever built. It gives you the ability to rewrite history, delete branches, force push over months of work, and then quietly keeps a record of everything you did so you can undo it later. But that forgiveness has a radius. The further you get from a single developer at a single terminal, the less Git's built-in safety nets can help, and the more the safety depends on human systems that someone has to build, maintain, and test.

This is the pattern of the season so far. Git won, and every problem it creates is proportional to the scale at which it won. A single developer's disaster is recoverable in seconds. A team's disaster is recoverable in minutes, if someone acts fast. An organization's disaster is recoverable in theory but requires someone to have built the systems before the crisis. And an infrastructure disaster, at the scale of Jenkins or GitLab, requires luck, transparency, and the collective redundancy of every clone on every laptop in the world.

The disasters never stop. Somewhere right now, someone is about to force push to the wrong branch. But if this episode does its job, the next thing they type will be two words they did not know yesterday. That was episode six of Git Good, Season Two.

git reflog. The command that turns catastrophe into inconvenience. Every move your branch has made, timestamped, waiting. The branch you deleted ten minutes ago. The rebase you regret from last week. The reset that went too far this morning. It is all there, sitting in a list that reads like a confession. Find the hash you need, point a branch at it, and breathe. Thirty days for orphaned commits, ninety for reachable ones. That is your window. Use it before it closes.