The Three AM Phone Call: When Your Database Disappears

The Worst Tuesday at GitLab

On the last day of January twenty seventeen, a database engineer at GitLab was having a terrible night. It was almost midnight in Utrecht. Spammers had been hammering the database since six in the evening, creating thousands of spam snippets that drove write operations through the roof. The primary database locked up. The replica fell behind by four gigabytes. And when the engineer tried to rebuild the replica using pg_basebackup, it kept hanging.

So the engineer decided to clear the replica's data directory to give pg_basebackup a fresh start. Reasonable enough. They opened a terminal and typed the command to recursively delete everything in the PostgreSQL data directory. And then, about two seconds in, they noticed something that made their stomach drop. [worried] The terminal prompt said db one dot cluster dot gitlab dot com. Not db two. They had just run the deletion command on the primary production database, not the replica.

They killed the process immediately, but two seconds was enough. <break time="1s"/> [serious] Of roughly three hundred gigabytes of data, only four and a half gigabytes remained. The primary database for all of GitLab dot com, the platform hosting millions of software projects, was essentially gone.

What happened next is one of the most instructive disaster stories in all of tech, and it is the reason you are going to set up backups before this episode is over.

Five Backup Methods, Five Failures

Here is what makes the GitLab story unforgettable. It was not just the accidental deletion. It was what they discovered when they tried to recover. GitLab had not one, not two, but five different backup and replication strategies in place. Let that sink in. Five separate safety nets, and almost all of them had holes.

[worried] First, they checked the regular pg_dump backups. These were supposed to run daily and upload to S3. But the S3 bucket was empty. Completely empty. It turned out the backup process was using pg_dump version nine point two, while the database was running PostgreSQL nine point six. The version mismatch caused every backup to fail silently. The error notifications were being sent by email, but the emails were rejected by the receiving mail server because of missing DMARC authentication. So the backups had been broken for a while, the alerts about the broken backups were also broken, and nobody knew.

Second, Azure disk snapshots. GitLab ran on Azure at the time, and Azure offered automated disk snapshots. But those snapshots were only enabled for the NFS file servers, not the database servers. Somebody had assumed the other backup methods covered the databases.

Third, the PostgreSQL streaming replication that was supposed to keep the replica in sync. Already broken. That was the whole reason the engineer was trying to rebuild the replica in the first place.

Fourth, the automated LVM snapshots. These were supposed to happen every twenty four hours. But the process was not functioning properly, producing backup files that were only a few bytes in size.

And fifth, there was supposed to be a separate S3 backup pipeline. Also empty. Also silently failing.

[serious] Five backup methods. Zero working restores. This is what database people call Schrodinger's backup. The condition of any backup is unknown until you actually try to restore from it. GitLab's backups existed on paper. On disk, they were ghosts.

The Lucky Snapshot

The only thing that saved GitLab was luck. About six hours before the deletion, the same engineer who would later run the fatal command had manually triggered an LVM snapshot while working on load balancing. It was not part of any automated process. It was just something they happened to do as part of their work that day.

[excited] That six hour old snapshot became the lifeline. The team restored from it, losing every change that had been made to the database during those six hours. About five thousand projects, five thousand comments, and seven hundred new user accounts were permanently gone. The git repositories themselves survived because they were stored on separate servers, but the metadata, the issues, the merge requests, the comments, all the collaborative work from those six hours, vanished.

GitLab, to their enormous credit, live-streamed the entire recovery process on YouTube. They published a brutally honest postmortem. They did not hide behind corporate language or blame the engineer. Their conclusion was simple and damning.

[serious] The backup procedure was not tested on a regular basis because there was no ownership. As a result, nobody was responsible for testing this procedure.

That sentence should be printed on a poster and hung in every server room in the world.

Why Backups Fail

The GitLab story is famous, but it is not unusual. It is the pattern that matters. Backups fail silently. They fail because of version mismatches, permission errors, full disks, expired credentials, or misconfigured cron jobs. They fail because the person who set them up left the company. They fail because the monitoring for the backups failed too.

In two thousand nine, a social bookmarking site called Ma dot gnolia suffered a catastrophic database corruption. They had backups. But the backups were stored on the same system as the primary database. When the database corrupted, the corruption had already been faithfully copied into every backup. They had been backing up corrupted data for weeks without knowing it. The site never recovered. Every user's bookmarks were permanently lost.

The pattern is always the same. Nobody tests the restore. People test the backup, they verify that the cron job runs, they see the files appear in S3, and they assume everything is fine. But the backup file could be empty. It could be corrupted. It could be from the wrong database. It could be in a format that your current version of PostgreSQL cannot read. You do not have a backup until you have restored from it.

This is why the single most important thing you can do for your database is not setting up backups. It is testing your restores. Schedule it. Put it on the calendar. Once a month, take your latest backup, restore it to a test environment, and verify that the data is actually there. If you cannot do this, you do not have backups. You have hopes.

pg_dump: The Simplest Backup You Can Run Today

Let us talk about how PostgreSQL backups actually work, because the good news is that the basics are genuinely simple. A vibecoder can have working, tested backups running within an hour.

The first tool is pg_dump. It has been part of PostgreSQL since the beginning, and it does exactly what the name suggests. It dumps a database to a file. Every table, every row, every index definition, every constraint. The output is either a plain SQL script that you can replay to recreate the database, or a compressed custom format that pg_restore can read.

The command is beautifully simple. You run pg_dump with the database name and redirect the output to a file. That is it. You now have a backup. The custom format with the big F flag gives you a compressed binary file that restores faster and lets you selectively restore individual tables. For most vibecoders, the custom format is the right choice.

You can point this at a cron job, add a timestamp to the filename, upload the result to S3 or any object storage, and delete files older than thirty days. This is a complete backup system for a small to medium database. It costs almost nothing. The S3 storage for a typical application database is pennies per month.

The limitations of pg_dump are worth knowing. First, it is a logical backup. It reads your data and writes it out row by row. For a small database, this takes seconds. For a database with hundreds of gigabytes, it can take hours, and during that time the database is working harder. Second, pg_dump gives you a snapshot at a single point in time. If something goes wrong at two in the afternoon and your last dump was at three in the morning, you lose eleven hours of data. For many applications, this is perfectly acceptable. For some, it is not.

And here is where AI actually helps. Ask any large language model to write you a PostgreSQL backup script with S3 upload, retention policy, and error notification, and it will give you something ninety percent correct. The patterns are well established, the documentation is everywhere in the training data, and the edge cases are well known. This is one of the rare areas where vibe coding a backup script is genuinely a good idea. Just test the restore.

Rabbit Hole: WAL and the Database Time Machine

If pg_dump is a photograph of your database, what comes next is closer to a security camera. This section gets into the internals of how PostgreSQL can recover to any point in time. If you just want the practical takeaway, skip ahead to the chapter on the vibecoder's backup plan. But if you want to understand the machinery, this is one of the most elegant pieces of engineering in all of database design.

PostgreSQL uses something called Write Ahead Logging, or WAL. The concept is deceptively simple. Before PostgreSQL changes anything in your actual data files, it first writes a record of what it intends to change to a separate log. This log is the write ahead log. Write first, then do.

Why? Because if the power goes out halfway through writing your data, the database can look at the log when it restarts and figure out exactly what was in progress. Anything that was logged but not completed gets replayed. Anything that was not logged gets discarded. This is how PostgreSQL guarantees that your data stays consistent even through crashes.

But here is the clever part. Those WAL files are a complete, sequential record of every single change made to the database. Every insert, every update, every delete, every schema change. If you save these WAL files, and you have a base backup from some point in the past, you can replay the WAL files from that base backup forward to reconstruct the database at any point in time.

[slow] This is called point in time recovery. Imagine you have a base backup from Monday at midnight. On Wednesday at two fourteen in the afternoon, someone accidentally runs a delete query that wipes out your users table. With pg_dump alone, you would restore to Monday at midnight and lose two and a half days of data. With WAL archiving and point in time recovery, you can tell PostgreSQL to replay all the changes up to Wednesday at two thirteen, one minute before the bad query. You get back almost everything.

The tool for taking the base backup in this scenario is pg_basebackup. Unlike pg_dump, which reads your data logically, pg_basebackup copies the raw data files of the entire database cluster. It is a physical backup. It is faster for large databases, and it is the foundation that WAL replay builds on.

The combination of pg_basebackup plus WAL archiving is what production PostgreSQL deployments use. It gives you continuous protection, not just nightly snapshots. But it is more complex to set up and operate than a simple pg_dump. For a single vibecoder running a handful of databases, pg_dump with frequent schedules is often the right trade-off. For a growing application where eleven hours of data loss would be catastrophic, pg_basebackup with WAL archiving is the answer.

Replication: The Backup That Also Reads

There is another layer to the safety story, and it doubles as a performance tool. PostgreSQL streaming replication lets you run a second server, a replica, that maintains a nearly identical copy of your primary database in real time.

The primary writes its WAL records as usual. The replica connects to the primary and streams those WAL records as they are generated, applying them immediately. The replica typically lags behind the primary by milliseconds, not minutes or hours. If the primary server catches fire, literally or figuratively, you can promote the replica to become the new primary and keep running.

But replication is not a backup. This is a common and dangerous misconception. If someone runs a destructive query on the primary, that query gets replicated to the replica. If someone drops a table, the replica drops it too. Replication protects you against hardware failure. It does not protect you against human error. You still need actual backups, the kind that sit in cold storage where a bad query cannot reach them.

That said, a replica gives you something valuable beyond redundancy. You can point read heavy queries at the replica instead of the primary. Your analytics dashboard, your reporting queries, your full text search, all of those can hit the replica while the primary handles the writes from your application. For a growing application, this is often the first scaling move that makes a real difference.

Tangent: The Backup That Saved Toy Story

Let me tell you a story from outside the database world that captures the essence of everything we have been talking about. In nineteen ninety eight, Pixar was deep into production on Toy Story two. An employee was doing routine file cleanup on the internal servers and accidentally ran a deletion command on the Toy Story two root directory.

[surprised] Oren Jacob, an associate technical director, was working when he noticed something strange. Woody's hat disappeared from his model. Then his boots. Then Woody himself. When Jacob checked the directory, it had dropped from hundreds of files to four.

[worried] We watched file after file disappear. The whole movie was going away. [fast] I grabbed the phone and told someone to pull the plug on the machine.

[gasp] They killed the server, but it was too late. Roughly ninety percent of the movie was gone. Two years of work by hundreds of artists and engineers, deleted in seconds. And when they checked the backup system, it had not been working properly for about a month.

This is where the story turns into something you could not write as fiction. Galyn Susman, the supervising technical director, had recently given birth to her son. To work from home, she had set up a Silicon Graphics workstation with a copy of the Toy Story two production database on it. It was about two weeks old, but it existed.

We put the computer in the back seat of my Volvo. We wrapped it in blankets and strapped it in with seatbelts, and we drove back to Pixar at thirty five miles an hour with the hazard lights on.

[excited] That blanket-wrapped workstation in the back of a Volvo saved a movie that would go on to earn nearly five hundred million dollars worldwide. And here is the beautiful, ironic twist. After they restored the film from Galyn's backup, the creative team watched the whole thing and decided it was not good enough. They threw out the entire movie and rewrote it from scratch, finishing the new version in nine months. The backup saved the movie, and then the movie got deleted again on purpose.

The lesson is the same one GitLab would learn eighteen years later. The official backup system failed. What saved them was an accidental copy that happened to exist because someone was working from home. [serious] You do not want your disaster recovery plan to depend on luck.

The Vibecoder's Backup Plan

So what should you actually do? Let me show you what a real, working backup system looks like for a single developer running multiple PostgreSQL databases on a VPS.

On a server called popcorn, there is a Python script called daily backup that runs every night at three in the morning via cron. It dumps seven PostgreSQL databases, parkit, live, ttpanotis, listmonk, partypar, parcel, and parpixel, using pg_dump with custom format compression. Each dump gets a timestamp in the filename and gets uploaded to an S3 compatible object storage bucket. Files older than thirty days get automatically deleted. The whole thing costs a few cents per month in storage.

For the music service, which is higher risk because it has more active users making changes throughout the day, there is a separate script that runs hourly instead of daily. Same approach, just more frequent. Seven days of hourly backups means about a hundred and sixty eight recovery points for that one database.

The same backup script also handles SQLite databases, git bundles of important repositories, and a snapshot of server configuration files including environment variables, nginx configs, and systemd service files. Because when your server dies, you do not just need the data back. You need to remember how everything was configured.

Here is the important part. This entire system was built by one developer with AI assistance. The Python script is about two hundred lines. It uses boto three for S3 uploads, subprocess for pg_dump, and standard library logging. It is not clever. It is not elegant. It is reliable. And it gets tested, because every time a database needs to be migrated or a new service gets set up, the developer restores from a backup to verify it works.

This is what good enough looks like. Not a sophisticated cluster with automatic failover and point in time recovery to the millisecond. Just pg_dump, cron, S3, and the discipline to test your restores. You can ask your AI assistant to write this for you today. The pattern is so common that the generated code will be almost right. Check the pg_dump flags, make sure the S3 credentials are not hardcoded in the script, add error handling that actually notifies you when something fails, and schedule a monthly restore test. That is it. That is a backup system.

The Bridge to Production

You know how to protect your data now. pg_dump for the simple path, WAL archiving for continuous protection, replication for high availability. You know that untested backups are the same as no backups. You know that the GitLab engineer and the Pixar employee and the Ma dot gnolia team all learned the same lesson the hard way.

But backing up a database is only half the story of running one. In the next episode, we are going to look at what it actually means to run PostgreSQL on your own server. The configuration files that control everything, the authentication rules that keep people out, the monitoring that tells you when something is going wrong before it goes wrong. The popcorn VPS with its eight databases is going to become our case study for what a vibecoder's production PostgreSQL actually looks like.

Set up your backups first, though. Seriously. Before you listen to the next episode. Open a terminal, ask your AI to write you a pg_dump script with S3 upload, and schedule it. It will take less time than this episode did. And the next time something goes wrong at three in the morning, you will have something better than luck to rely on.