📝zero-downtime-postgres-migrations.md

Zero-Downtime PostgreSQL Migrations in Real Systems

The Problem

Application teams love the phrase "just add a migration" right up until a deploy locks a hot table, breaks an older app instance, or turns a harmless schema change into a production incident.

The mistake is usually not PostgreSQL. The mistake is treating schema changes like isolated code changes.

They are not isolated.

They sit in the middle of a live dependency graph:

old app versions still draining traffic
background jobs running old code
queries with different index assumptions
dashboards and exports depending on old column names
data backfills contending with normal production writes

If you want near-zero downtime, you need a migration strategy that assumes mixed versions and live traffic as the default case.

The Expand-Contract Pattern

The safest pattern I know is still the most boring one.

Phase 1: Expand

Add the new shape without removing the old one.

Examples:

add a nullable column
add a new table
add a new index
add a second write path

Phase 2: Migrate

Backfill or dual-write until the new shape is trustworthy.

Phase 3: Switch Reads

Move application reads to the new path only after the data is ready.

Phase 4: Contract

Remove the old column, code path, or table after you have verified nothing depends on it.

Most painful outages happen because teams try to combine all four phases into one deploy.

A Column Rename Done Safely

Renaming a column is the classic trap. The SQL is tiny. The blast radius is not.

Bad instinct:

ALTER TABLE orders RENAME COLUMN total TO total_cents;

That works technically. It also breaks any still-running code that expects total.

Safer sequence:

1. Expand the schema

ALTER TABLE orders ADD COLUMN total_cents integer;

2. Backfill existing rows

UPDATE orders
SET total_cents = CAST(total * 100 AS integer)
WHERE total_cents IS NULL;

3. Deploy application code that writes both fields

await db.query(
  `INSERT INTO orders (id, total, total_cents)
   VALUES ($1, $2, $3)`,
  [id, totalDollars, totalCents]
);

4. Deploy application code that reads `total_cents`

5. Verify everything is switched

6. Remove the old column in a later deploy

ALTER TABLE orders DROP COLUMN total;

This takes longer. It is also how you avoid turning a rename into an outage.

The Golden Rule: App Changes Must Tolerate Mixed Schema States

During a real deployment, these things can all be true at once:

the migration already ran
some pods run old code
some pods run new code
a worker has not restarted yet
an old job is retrying work from before the deploy

So the application code has to tolerate both states.

That usually means:

new columns start nullable
old columns stay readable for a while
code can dual-read or dual-write temporarily
destructive changes happen later, not immediately

Avoid Table Rewrites on Large Tables

Some DDL changes are much more expensive than they look.

On large tables, avoid surprising rewrites during peak traffic.

Examples that deserve extra caution:

adding a column with a volatile default
changing a column type in place
adding constraints that require validating the whole table immediately
large updates without batching

When in doubt, choose the more incremental path.

For example, instead of:

ALTER TABLE events ADD COLUMN processed boolean NOT NULL DEFAULT false;

Prefer:

ALTER TABLE events ADD COLUMN processed boolean;
 
UPDATE events
SET processed = false
WHERE processed IS NULL;
 
ALTER TABLE events ALTER COLUMN processed SET DEFAULT false;

Then add NOT NULL after you have verified the data.

Use Concurrent Index Creation

Index creation is one of the easiest ways to cause avoidable write disruption if you are careless.

If the table is hot, prefer:

CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders (created_at);

Not:

CREATE INDEX idx_orders_created_at ON orders (created_at);

The concurrent version takes longer, but it avoids blocking normal writes in the same way.

That trade is almost always worth it on production systems.

Backfills Need Their Own Plan

Teams often treat backfills as a footnote.

That is backwards. On large tables, the backfill is the migration.

The safest backfill pattern I have used looks like this:

const batchSize = 5000;
 
while (true) {
  const rows = await db.query(
    `SELECT id, total
     FROM orders
     WHERE total_cents IS NULL
     ORDER BY id
     LIMIT $1`,
    [batchSize]
  );
 
  if (rows.rowCount === 0) break;
 
  await db.query('BEGIN');
 
  for (const row of rows.rows) {
    await db.query(
      `UPDATE orders
       SET total_cents = $2
       WHERE id = $1`,
      [row.id, Math.round(row.total * 100)]
    );
  }
 
  await db.query('COMMIT');
}

I also want backfills to be:

resumable
observable
throttled if needed
safe to rerun

If a backfill script can only succeed once in a perfect environment, it is brittle by definition.

Constraints Should Be Added in Stages

If you are adding a constraint to existing production data, do it carefully.

One pattern I like is:

add the data shape
backfill and clean bad rows
add the constraint after the data already conforms

That gives you a chance to discover ugly real-world data before the database turns it into a deploy failure.

Deployment Order Matters More Than Most Teams Think

My default sequence for any risky migration is:

deploy additive schema changes
deploy app code that can work with old and new schema
run backfill
switch reads and behavior flags
verify metrics and logs
remove old schema in a later deploy

That order creates room for rollback.

If the new code has a problem, the old schema still exists. If the backfill is slow, the app can keep running. If the read switch causes issues, you can revert without undeleting structure.

What I Watch During the Migration

I do not trust schema changes that I cannot observe.

At minimum I want:

deploy progress by app version
database CPU and connection saturation
slow query trends
lock waits
error rates by endpoint and worker
backfill throughput
count of rows still missing migrated data

If the migration changes behavior users will notice, I also want business metrics nearby.

Sometimes the database is fine and the product behavior is not.

Migrations That Deserve Extra Respect

These are not impossible. They just need more planning.

Change	Risk	Safer pattern
column rename	mixed-version breakage	add new column, dual-write, cut over later
type change	rewrite or cast issues	add new column, backfill, switch reads
unique constraint	existing data conflicts	detect and clean duplicates first
large index build	write disruption	`CREATE INDEX CONCURRENTLY`
table split	application coordination	dual-write plus staged read migration

My Practical Checklist

Before I approve a production migration, I want yes answers to these questions.

Can old and new app versions run safely against the schema during rollout?
Is the change additive first and destructive later?
Do we know how large the affected table is?
Is the backfill resumable and observable?
Do we have a rollback story that does not depend on luck?
Are metrics and logs ready before the migration starts?
Is the cleanup step scheduled later instead of bundled into the risky deploy?

If the answer to several of those is no, the migration is not ready.

The Main Lesson

Zero-downtime migrations are rarely about clever SQL.

They are about respecting the fact that production systems are alive while you change them.

The teams that do this well are not braver. They are more disciplined about sequencing:

expand first
backfill carefully
switch behavior deliberately
contract later

That is the pattern. Most of the time, it is enough.