The Migration Fear: When Your Schema Meets Reality

The Four Hundred Millisecond Disaster

[worried] A developer at a growing startup needed to add a foreign key. The orders table had fifty million rows. The customers table was the target. On staging, the migration completed in four hundred milliseconds. Clean, fast, painless. The kind of result that makes you push to production with confidence. So they did.

The ALTER TABLE statement needed an ACCESS EXCLUSIVE lock on both tables. At that exact moment, an analytics dashboard was running a long query against the customers table. The dashboard held a weaker lock. The migration could not acquire its exclusive lock, so it waited. And here is where PostgreSQL's lock queue turned a four hundred millisecond operation into a twenty minute outage.

Every new query that arrived, every SELECT, every INSERT, every UPDATE against either table, lined up behind the waiting migration. Not behind the long-running analytics query. Behind the migration that was waiting for the analytics query. Within ninety seconds, the connection pool was exhausted. The authentication service, which needed the customers table for every single login, started throwing errors. [serious] The API went dark. <break time="1s"/> Twenty minutes of complete downtime, caused by a migration that would have taken less than half a second if it had gotten the lock immediately.

[worried] The real danger was not the SQL itself. It was invisible contention. While the migration waited for its turn, the database became completely unresponsive to all traffic. The lock queue, not execution time, created the bottleneck.

That developer wrote a postmortem. They changed how their team does migrations forever. And their story is one of hundreds like it, because schema migrations are the single most dangerous routine operation in all of software engineering. Not because they are complex. Because they look simple.

This is episode six of the Vibecoder's Guide to Postgres. Last time we went deep on indexes, how B-trees work, when to create them, and why your AI tool adds them in all the wrong places. Today we are talking about the thing that makes every developer sweat. Changing a table that already has data in it. The migration.

What Is a Migration, Actually

Here is the core problem. Your application changes. You add features, remove features, rename things, restructure things. But your database schema, the shape of your tables, was defined at some earlier point when your understanding of the problem was worse. The schema needs to evolve along with the code. A migration is a versioned, ordered set of instructions that transforms your database from one shape to another.

Think of it like this. Your application is a building. The schema is the foundation. A migration is the process of modifying the foundation while people are still living in the building. You can see why this might be stressful.

The simplest migration adds a column to a table. ALTER TABLE users ADD COLUMN last_login TIMESTAMP. One line of SQL. The more complicated ones rename columns, change data types, split tables apart, merge tables together, add constraints, create indexes, or restructure relationships. Some of these operations take milliseconds. Some take hours. Some lock your entire table while they work. And the difference between a migration that completes invisibly and one that brings down your production system often comes down to a single keyword or a missing clause.

Every migration framework, regardless of language, does roughly the same thing. It keeps a list of migrations that have been applied to the database, stored in a special table. When you deploy new code, it checks which migrations have not yet been applied and runs them in order. Alembic does this for Python. Prisma Migrate does it for JavaScript and TypeScript. Django has its own built-in system. Flyway and golang-migrate do it for Java and Go. They all solve the same problem. They all fail to protect you from the same dangers.

The ALTER TABLE Lock Problem

Here is the thing that nobody tells you until it is too late. ALTER TABLE is not one operation. It is dozens of operations hiding behind the same two words. And those operations acquire different types of locks that have wildly different impacts on your running application.

PostgreSQL has eight lock levels, from ACCESS SHARE, which is what a SELECT takes, all the way up to ACCESS EXCLUSIVE, which blocks everything, reads and writes alike. Most ALTER TABLE operations acquire ACCESS EXCLUSIVE. That means while the operation runs, nobody can read from or write to that table. Not a single query gets through.

But here is the critical nuance that separates someone who understands migrations from someone who does not. Lock level and lock duration are completely different things. Dropping a column takes an ACCESS EXCLUSIVE lock, but it completes in milliseconds because it just updates the system catalog. Changing a column's data type also takes an ACCESS EXCLUSIVE lock, but it rewrites the entire table, which on a table with a hundred million rows can take minutes or hours. Both operations have the same lock severity. One is harmless. The other is catastrophic.

Let me walk you through the operations and what they actually do.

Adding a column with no default is instant. PostgreSQL just updates the catalog. The lock is ACCESS EXCLUSIVE but it is held for a fraction of a second. Safe to run anytime.

Adding a column with a constant default value, like a string or a number, is also instant on PostgreSQL eleven and later. This was a huge improvement. Before version eleven, adding a column with a default meant PostgreSQL had to rewrite every single row in the table to fill in that default value. A table with ten million rows would lock for minutes. Since PostgreSQL eleven, the default is stored in the catalog and applied lazily when rows are read. One line in the release notes. Years of production outages prevented.

Adding a column with a volatile default, something like a function call that generates a unique ID or a current timestamp for each row, still rewrites the table. PostgreSQL cannot store a different value per row in the catalog. Every row gets touched. The lock is held until every row is written. On a big table, this is exactly the disaster it sounds like.

Setting a column to NOT NULL scans the entire table to verify no existing rows violate the constraint. If your table has a hundred million rows, that scan takes a while, and the ACCESS EXCLUSIVE lock is held the whole time.

Changing a column type, like going from INTEGER to BIGINT, rewrites the entire table. Even going from a smaller type to a larger one. There are a few exceptions. Widening a VARCHAR or converting to TEXT avoids the rewrite. But most type changes mean every row gets rewritten.

Creating an index with the regular CREATE INDEX command takes a SHARE lock, which blocks writes but allows reads. If you have a busy table, writes pile up until the index build finishes. For a table with millions of rows, that could be several minutes of blocked inserts and updates. The solution is CREATE INDEX CONCURRENTLY, which takes a weaker lock and builds the index without blocking writes. It takes longer and requires two table scans instead of one, but it does not block your application. In production, you should never use CREATE INDEX without the CONCURRENTLY keyword. Never.

The Queue That Kills You

The lock itself is only half the problem. The other half is the lock queue.

When your ALTER TABLE statement requests an ACCESS EXCLUSIVE lock, it enters a queue. If another transaction, say that analytics query from our opening story, currently holds any lock on the table, your ALTER TABLE has to wait. But here is the part that catches everyone by surprise. While your ALTER TABLE is waiting, every subsequent query that arrives also enters the queue behind it. Even simple SELECT statements that only need an ACCESS SHARE lock. PostgreSQL's lock queue is first-come, first-served. The database will not skip your waiting ALTER TABLE to serve a harmless read.

[serious] This means a single waiting DDL statement can create a traffic jam that blocks your entire application. Your migration is not even running yet. It is just waiting. And while it waits, every request to your application that touches that table stacks up behind it. Connection pool fills up. Timeouts cascade. And you have an outage caused by a migration that has not done anything yet.

The fix is deceptively simple. Set a lock timeout.

Before running any migration in production, execute SET lock_timeout equals two seconds, or five seconds, whatever makes sense for your application. This tells PostgreSQL that if the ALTER TABLE cannot acquire its lock within that window, it should fail immediately instead of waiting. Yes, your migration fails. But failure is better than a cascading outage. You try again a few seconds later when the long-running query has finished.

DDL statements in migration sessions should always set lock timeout to an appropriate value for the application. Values of less than two seconds are common. This ensures that reads and writes will not queue behind a blocked DDL statement and cause application downtime.

[slow] This is the single most important piece of migration advice in this entire episode. Set a lock timeout. Every time. No exceptions.

The Tools and What They Do Not Protect You From

Let us talk about migration tools. If you are building with Python, you are probably using Alembic, which works with SQLAlchemy. If you are in Django, you have Django's built-in migration framework. If you are in the JavaScript and TypeScript world, Prisma Migrate is increasingly common. If you are in Go, golang-migrate is popular. If you are in a polyglot enterprise, Flyway has been around since two thousand and ten and works across languages.

They all do the same basic thing. They track which migrations have been applied. They run new ones in order. They generate migration files either from model changes or from manual SQL. Some of them, like Django's migrations and Prisma Migrate, are tightly coupled to an ORM. They detect changes in your model definitions and generate the migration SQL automatically. Alembic can do both, auto-generating from SQLAlchemy models or letting you write raw SQL. Flyway and golang-migrate are SQL-first. You write the SQL yourself. They just track and execute it.

Here is what none of them do by default. None of them check whether your migration is safe to run on a production database. None of them warn you about lock durations. None of them set lock timeouts. None of them tell you that CREATE INDEX without CONCURRENTLY will block writes. None of them prevent you from adding a NOT NULL column without a default to a table with fifty million rows.

This is the gap that tools like Squawk and strong_migrations fill. Squawk is a linter specifically for PostgreSQL migration files. You feed it your SQL and it tells you which operations will acquire dangerous locks, which operations need the CONCURRENTLY keyword, which ALTER TABLE commands will rewrite the table. strong_migrations does the same thing for Ruby on Rails. These are not part of your migration framework. They are separate tools that you add to your CI pipeline to catch the mistakes your migration framework will happily execute.

The distinction between reversible and irreversible migrations matters too. A reversible migration has both an "up" and a "down." Add a column going up, drop it going down. A rename going up, a rename back going down. Irreversible migrations are ones where the "down" would lose data. Dropping a column is irreversible. You cannot un-drop data. Changing a column type might lose precision. Merging two tables destroys the original structure. Most frameworks let you mark migrations as irreversible, which means the rollback function just throws an error saying "you cannot undo this." Which is honest, but not helpful at three in the morning.

The practical advice is this. Write every migration as if you cannot roll it back, because in production, rolling back often creates more problems than rolling forward with a fix. The expand-contract pattern, which we are about to get into, makes this explicit.

Rabbit Hole: The Expand-Contract Pattern

This next section goes deep into a migration strategy used by large-scale systems. If you just want the practical checklist for safe migrations, skip ahead to the next chapter. But if you want to understand how companies like GitHub and Stripe change their schemas without downtime, this is how.

The expand-contract pattern splits every breaking schema change into three phases. Expand, migrate, contract.

Suppose you need to rename a column from shipped to status, changing it from a boolean to an enum with multiple values. In a naive migration, you would rename the column and change its type in one step. This breaks every piece of code that references the old column name and the old type. If old code is still running when the migration executes, it crashes.

With expand-contract, you do this instead. In the expand phase, you add a new column called status alongside the old shipped column. You set up dual writes so that every write to shipped also writes the corresponding value to status, and vice versa. You backfill existing rows, converting the old boolean values to the new enum values. This can happen in batches, in the background, without locking the table. At this point, both old and new versions of your application work. Old code reads shipped. New code reads status.

In the migrate phase, you deploy the new version of your application that reads from and writes to status. You verify everything works. You let it run for a while.

In the contract phase, you drop the old shipped column and remove the dual-write triggers. The migration is complete. At no point was the table locked for more than a fraction of a second. At no point did any version of the application break.

The expand and contract strategy, also known as parallel change, is a pattern used to migrate schemas with zero downtime. It is usually seen in decoupled architectures and is proven to work well at scale.

[sigh] This is more work than a single ALTER TABLE statement. Significantly more work. It requires three separate deployments. But it has a beautiful property. It is safe. Every step is independently reversible. If something goes wrong during the expand phase, you just drop the new column. If something goes wrong during the migrate phase, you roll back to the old code. If something goes wrong during the contract phase, you stop and investigate. At no point are you in a state where rolling back is impossible.

There is even a tool that automates this for PostgreSQL. It is called pgroll, built by the team at Xata. It takes a declarative description of your schema change and handles the expand-contract dance automatically, creating dual-write triggers, managing schema versions, and minimizing lock acquisition. It is open source and worth knowing about, even if you never use it, because understanding what it automates helps you understand what you are responsible for when you do migrations by hand.

When AI Writes Your Migrations

This is where the vibe coding angle gets serious. And I mean genuinely serious.

An AI assistant can write your queries. It can design your schemas. It can generate your indexes. If it gets any of those slightly wrong, you notice. The query is slow, the schema is awkward, the index does not help. You iterate, you fix it, you learn. The feedback loop is tight. But a migration operates on your live data. A bad migration does not just perform poorly. It can destroy data, lock your tables, crash your application, or put your database in a state that is extremely difficult to recover from.

Here is a specific example. You ask your AI assistant to add a required email column to your users table. The AI generates ALTER TABLE users ADD COLUMN email VARCHAR NOT NULL. Looks perfect. Clean SQL, correct syntax. But there is no DEFAULT clause. If your users table has existing rows, this migration will fail on PostgreSQL because the existing rows have no value for this column and NOT NULL means they must. Some AI tools will "fix" this by adding a DEFAULT with an empty string or a placeholder value like "unknown at example dot com." [worried] Now you have a different problem. You have garbage data in production. Every existing user has a fake email address that your application might try to send messages to.

The correct approach is a multi-step migration. First, add the column as nullable with no default. Second, backfill the real email addresses from wherever they live, your old authentication table, your user profiles, wherever. Third, once every row has a real value, add the NOT NULL constraint. This takes three migration files and a data migration script. The AI will almost never generate this. It generates one migration file that either fails or pollutes your data.

Another common AI mistake. Generating CREATE INDEX without CONCURRENTLY. The AI has probably seen a thousand CREATE INDEX statements in its training data and maybe a hundred CREATE INDEX CONCURRENTLY statements. It defaults to the more common pattern. On a busy production table, this blocks writes for the entire duration of the index build.

Another one. Wrapping multiple DDL statements in a single transaction. The AI puts BEGIN and COMMIT around five ALTER TABLE statements because that seems transactionally safe. But PostgreSQL holds the most restrictive lock from the first statement for the entire duration of the transaction. Five migrations that each take one second now hold an ACCESS EXCLUSIVE lock for five seconds. During peak traffic, that is an eternity.

[serious] The question should not be "does this look reasonable?" It should be "what assumptions is this code making, and are those assumptions safe?" AI often fills in gaps by guessing, and that something may work functionally while violating safety boundaries in hard to spot ways.

Migrations are the one category of AI-generated code where you must review every single line. Not because the AI is stupid. Because the AI is optimizing for correctness in isolation, and migrations live in context. The AI does not know your table has fifty million rows. It does not know your application has a long-running analytics query that holds locks. It does not know your connection pool has a limit of twenty connections. It does not know that you have nine services sharing the same database and a bad migration breaks all of them simultaneously.

Which brings us to a real-world example.

The Shared Schema Problem

On a VPS in Paris, there is a PostgreSQL database called parkit. It backs a productivity suite with multiple services. Capture for collecting ideas, Focus for task execution, Time for calendar management, and several others. Nine services sharing one database. The schema for this database is managed by numbered SQL migration files in a shared directory. When any of these services starts up, it checks the migration table and runs any new migrations it has not seen yet.

[worried] This architecture has a specific property that makes migrations terrifying. A bad migration does not just break one service. It breaks all of them. Every service runs through the same migration sequence on startup. If migration number forty-seven adds a constraint that violates existing data, or locks a table that another service is currently reading, or changes a column type that another service depends on, then every service that starts up after that migration will either crash or behave incorrectly.

[serious] There is no staging environment. The person running this setup is a single developer using AI to write most of the code. The migration files are the highest-risk code in the entire system. And the warning in the project documentation is explicit. A bad migration breaks everything. <break time="1s"/>

This is the reality for a lot of vibecoders. You are probably not running nine services on a shared database. But you are probably running at least one service where the database has real data that matters, where there is no staging copy, where migrations run on deploy, and where getting it wrong means your application is down until you figure out what happened and how to fix it.

The AI did not create this problem. But the AI makes it worse, because the AI can generate migration files faster than you can review them. The temptation to just run the migration because the AI said it was fine is enormous. And it will be fine ninety-nine times out of a hundred. It is the hundredth time that keeps you up at night.

The Checklist

Here is the practical advice. The things you actually do to make migrations safe, whether you wrote the migration yourself or your AI wrote it for you.

First. Always set lock_timeout before running DDL in production. Two to five seconds is a reasonable value. If the migration cannot acquire its lock in that window, it fails fast instead of creating a cascading queue that brings down your application. Retry in a few seconds when traffic is lighter.

Second. Use CREATE INDEX CONCURRENTLY for every index creation in production. Yes, it takes longer. Yes, it requires two table scans. Yes, it cannot run inside a transaction. But it does not block writes, and that is the only thing that matters on a live system.

Third. Never add a NOT NULL constraint and a column in the same statement without understanding the consequences. On PostgreSQL eleven and later, ADD COLUMN with a constant default is instant. But SET NOT NULL scans the entire table. If you need a non-nullable column, add it as nullable first, backfill the data, then add the constraint in a separate migration.

Fourth. Split large migrations into separate transactions. If you have five DDL statements, run them as five separate migrations, not one migration with five statements. Each one acquires and releases its lock independently.

Fifth. Run a migration linter. Squawk for raw SQL, strong_migrations for Rails. Wire it into your CI pipeline so it catches unsafe operations before they reach production. This is especially important when AI is generating your migration files.

Sixth. Test your migration against a copy of your production data, not just an empty database. A migration that works on a table with ten rows can fail catastrophically on a table with ten million rows. The lock duration, the rewrite time, the constraint validation time, all scale with the data.

Seventh. Know which operations rewrite the table and which do not. Adding a column with a constant default does not rewrite. Adding a volatile default does. Changing a column type usually does. Dropping a column does not. This knowledge is the difference between a migration you can run anytime and one that needs a maintenance window.

Eighth. Have a rollback plan before you run the migration. Can you reverse this? What data would you lose? Is rolling forward with a fix safer than rolling back? Write down the answer before you start.

And ninth. Read every AI-generated migration file line by line. Understand every statement. Check it against this list. If you are not sure what a statement does, look it up. This is the one place in your entire codebase where "move fast and break things" is not a philosophy. It is a threat.

The Bridge

Schema migrations are the scariest routine operation in database work. They are also completely manageable if you understand the lock system, set appropriate timeouts, and review every change carefully. The fear is rational. The tools to manage it exist. The discipline to use them is on you.

Your AI can write migrations. It can even write good ones most of the time. [serious] But the cost of a bad migration is not a slow query or an ugly schema. It is data loss, downtime, and the kind of three-in-the-morning phone call that makes you question your career choices. Review the migrations. Set the timeouts. Use the linter. And respect the fact that ALTER TABLE, those two innocent words, are doing something extraordinary. Changing the shape of a living system while it runs.

Your schema can change safely. You now know the locks, the tools, and the patterns that make it possible. But what about the data itself? When you write a row and Postgres says it is committed, what does that actually mean? What guarantees are you getting, and what could still go wrong? Next time, we take the ACID trip.