Zero-Downtime PostgreSQL Migrations in Real Systems
Zero-Downtime PostgreSQL Migrations in Real Systems
The Problem
Application teams love the phrase "just add a migration" right up until a deploy locks a hot table, breaks an older app instance, or turns a harmless schema change into a production incident.
The mistake is usually not PostgreSQL. The mistake is treating schema changes like isolated code changes.
They are not isolated.
They sit in the middle of a live dependency graph:
- old app versions still draining traffic
- background jobs running old code
- queries with different index assumptions
- dashboards and exports depending on old column names
- data backfills contending with normal production writes
If you want near-zero downtime, you need a migration strategy that assumes mixed versions and live traffic as the default case.
The Expand-Contract Pattern
The safest pattern I know is still the most boring one.
Phase 1: Expand
Add the new shape without removing the old one.
Examples:
- add a nullable column
- add a new table
- add a new index
- add a second write path
Phase 2: Migrate
Backfill or dual-write until the new shape is trustworthy.
Phase 3: Switch Reads
Move application reads to the new path only after the data is ready.
Phase 4: Contract
Remove the old column, code path, or table after you have verified nothing depends on it.
Most painful outages happen because teams try to combine all four phases into one deploy.
A Column Rename Done Safely
Renaming a column is the classic trap. The SQL is tiny. The blast radius is not.
Bad instinct:
ALTER TABLE orders RENAME COLUMN total TO total_cents;That works technically. It also breaks any still-running code that expects total.
Safer sequence:
1. Expand the schema
ALTER TABLE orders ADD COLUMN total_cents integer;2. Backfill existing rows
UPDATE orders
SET total_cents = CAST(total * 100 AS integer)
WHERE total_cents IS NULL;3. Deploy application code that writes both fields
await db.query(
`INSERT INTO orders (id, total, total_cents)
VALUES ($1, $2, $3)`,
[id, totalDollars, totalCents]
);4. Deploy application code that reads total_cents
5. Verify everything is switched
6. Remove the old column in a later deploy
ALTER TABLE orders DROP COLUMN total;This takes longer. It is also how you avoid turning a rename into an outage.
The Golden Rule: App Changes Must Tolerate Mixed Schema States
During a real deployment, these things can all be true at once:
- the migration already ran
- some pods run old code
- some pods run new code
- a worker has not restarted yet
- an old job is retrying work from before the deploy
So the application code has to tolerate both states.
That usually means:
- new columns start nullable
- old columns stay readable for a while
- code can dual-read or dual-write temporarily
- destructive changes happen later, not immediately
Avoid Table Rewrites on Large Tables
Some DDL changes are much more expensive than they look.
On large tables, avoid surprising rewrites during peak traffic.
Examples that deserve extra caution:
- adding a column with a volatile default
- changing a column type in place
- adding constraints that require validating the whole table immediately
- large updates without batching
When in doubt, choose the more incremental path.
For example, instead of:
ALTER TABLE events ADD COLUMN processed boolean NOT NULL DEFAULT false;Prefer:
ALTER TABLE events ADD COLUMN processed boolean;
UPDATE events
SET processed = false
WHERE processed IS NULL;
ALTER TABLE events ALTER COLUMN processed SET DEFAULT false;Then add NOT NULL after you have verified the data.
Use Concurrent Index Creation
Index creation is one of the easiest ways to cause avoidable write disruption if you are careless.
If the table is hot, prefer:
CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders (created_at);Not:
CREATE INDEX idx_orders_created_at ON orders (created_at);The concurrent version takes longer, but it avoids blocking normal writes in the same way.
That trade is almost always worth it on production systems.
Backfills Need Their Own Plan
Teams often treat backfills as a footnote.
That is backwards. On large tables, the backfill is the migration.
The safest backfill pattern I have used looks like this:
const batchSize = 5000;
while (true) {
const rows = await db.query(
`SELECT id, total
FROM orders
WHERE total_cents IS NULL
ORDER BY id
LIMIT $1`,
[batchSize]
);
if (rows.rowCount === 0) break;
await db.query('BEGIN');
for (const row of rows.rows) {
await db.query(
`UPDATE orders
SET total_cents = $2
WHERE id = $1`,
[row.id, Math.round(row.total * 100)]
);
}
await db.query('COMMIT');
}I also want backfills to be:
- resumable
- observable
- throttled if needed
- safe to rerun
If a backfill script can only succeed once in a perfect environment, it is brittle by definition.
Constraints Should Be Added in Stages
If you are adding a constraint to existing production data, do it carefully.
One pattern I like is:
- add the data shape
- backfill and clean bad rows
- add the constraint after the data already conforms
That gives you a chance to discover ugly real-world data before the database turns it into a deploy failure.
Deployment Order Matters More Than Most Teams Think
My default sequence for any risky migration is:
- deploy additive schema changes
- deploy app code that can work with old and new schema
- run backfill
- switch reads and behavior flags
- verify metrics and logs
- remove old schema in a later deploy
That order creates room for rollback.
If the new code has a problem, the old schema still exists. If the backfill is slow, the app can keep running. If the read switch causes issues, you can revert without undeleting structure.
What I Watch During the Migration
I do not trust schema changes that I cannot observe.
At minimum I want:
- deploy progress by app version
- database CPU and connection saturation
- slow query trends
- lock waits
- error rates by endpoint and worker
- backfill throughput
- count of rows still missing migrated data
If the migration changes behavior users will notice, I also want business metrics nearby.
Sometimes the database is fine and the product behavior is not.
Migrations That Deserve Extra Respect
These are not impossible. They just need more planning.
| Change | Risk | Safer pattern |
|---|---|---|
| column rename | mixed-version breakage | add new column, dual-write, cut over later |
| type change | rewrite or cast issues | add new column, backfill, switch reads |
| unique constraint | existing data conflicts | detect and clean duplicates first |
| large index build | write disruption | CREATE INDEX CONCURRENTLY |
| table split | application coordination | dual-write plus staged read migration |
My Practical Checklist
Before I approve a production migration, I want yes answers to these questions.
- Can old and new app versions run safely against the schema during rollout?
- Is the change additive first and destructive later?
- Do we know how large the affected table is?
- Is the backfill resumable and observable?
- Do we have a rollback story that does not depend on luck?
- Are metrics and logs ready before the migration starts?
- Is the cleanup step scheduled later instead of bundled into the risky deploy?
If the answer to several of those is no, the migration is not ready.
The Main Lesson
Zero-downtime migrations are rarely about clever SQL.
They are about respecting the fact that production systems are alive while you change them.
The teams that do this well are not braver. They are more disciplined about sequencing:
- expand first
- backfill carefully
- switch behavior deliberately
- contract later
That is the pattern. Most of the time, it is enough.