Zero-downtime Postgres schema migrations: what every DDL operation does under the hood
A reference map of lock types, unsafe patterns, and the expand-contract technique for large tables
Postgres schema migrations are the most reliable source of self-inflicted downtime in backend systems. The same DDL operations that complete in milliseconds during development can hold an ACCESS EXCLUSIVE lock for minutes on a production table with tens of millions of rows — not because the operations are complex, but because the table is large.
This article is a reference. It covers the lock each common DDL operation takes, which operations are instant regardless of table size, and the three patterns you need for the ones that aren't. The goal is to walk into any schema change knowing exactly what will happen before you run it.
The Postgres lock model, in one minute
Every DDL operation acquires a table-level lock. Postgres has eight lock modes. For schema changes, the three that matter are: ACCESS EXCLUSIVE, which blocks all reads and writes and is held for the full duration of the operation; SHARE UPDATE EXCLUSIVE, which allows reads but blocks writes and other schema changes; and ACCESS SHARE, which is held by ordinary SELECT statements.
The critical behaviour: when a DDL operation holds ACCESS EXCLUSIVE, any query arriving after it queues behind it — including plain reads. Those queued queries hold their own locks, which block later queries behind them. A five-minute table rewrite does not just slow down writes. It queues your reads, which queues the queries behind those, and your application appears to be down. The DDL finishes. Your database is still recovering.
A reference map: DDL operations by lock and impact
These are the operations that come up most often in application development. The lock column shows what Postgres acquires. The duration column shows what actually determines how long that lock is held.
| Operation | Lock mode | Duration on a large table | Zero-downtime safe? |
|---|---|---|---|
| ADD COLUMN (no default) | ACCESS EXCLUSIVE | Instant (metadata only) | Yes |
| ADD COLUMN (constant default, Postgres 11+) | ACCESS EXCLUSIVE | Instant (catalog-stored) | Yes |
| ADD COLUMN (volatile default e.g. now()) | ACCESS EXCLUSIVE | Full table rewrite | No |
| DROP COLUMN | ACCESS EXCLUSIVE | Instant (marked as dropped) | Yes |
| RENAME COLUMN | ACCESS EXCLUSIVE | Instant; breaks app code | No* |
| SET NOT NULL | ACCESS EXCLUSIVE | Full table scan | No |
| ADD CONSTRAINT ... NOT VALID | SHARE UPDATE EXCLUSIVE | Instant (deferred) | Yes |
| VALIDATE CONSTRAINT | SHARE UPDATE EXCLUSIVE | Full scan; reads/writes continue | Better |
| CREATE INDEX | ACCESS EXCLUSIVE | Full table scan | No |
| CREATE INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | Full scan; reads/writes continue | Yes |
| DROP INDEX CONCURRENTLY | SHARE UPDATE EXCLUSIVE | Instant (clears catalog) | Yes |
| ALTER COLUMN TYPE | ACCESS EXCLUSIVE | Full table rewrite | No |
The Postgres 11 change to constant defaults is worth emphasising. Before Postgres 11, adding any column with a non-null default required rewriting the entire table. Since Postgres 11, a default that is a compile-time constant (a string literal, an integer, or a boolean) is stored in the system catalog and applied at read time. ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending' on a 100M-row table is now instant. Many migration guides written before 2019 tell you to always add columns as nullable and backfill separately. That advice was correct on Postgres 10 and below. For constant defaults on Postgres 11+, it is unnecessary.
The four DDL patterns that will block production traffic
These four operations consistently catch teams off guard, because they are instant in development on small tables and catastrophic in production on large ones:
- Adding a NOT NULL column with a volatile default. The default forces a full table rewrite. ALTER TABLE orders ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now() rewrites every row. The fix: add as nullable, backfill, then apply the constraint.
- Creating a regular index. CREATE INDEX holds ACCESS EXCLUSIVE for the duration of the full scan — typically 10–30 minutes on a table with 50M rows and multiple secondary indexes. Always use CREATE INDEX CONCURRENTLY instead. The same applies to dropping: DROP INDEX CONCURRENTLY.
- Changing a column type. ALTER COLUMN col TYPE bigint rewrites the entire table, even when the storage types are compatible. The only safe path for a live table is the expand-contract approach: add a new column with the target type, migrate data into it, remove the old column.
- Renaming a column. The DDL is instant at the Postgres level, but any application code referencing the old name breaks the moment the rename commits. In a rolling deployment where old and new code run simultaneously, this causes immediate failures. The safe path requires three separate deploys (details in the column-rename section below).
The expand-contract pattern for zero-downtime migrations
The expand-contract pattern breaks a dangerous schema change into three phases: expand (add the new thing alongside the old), migrate (transition data and code), contract (remove the old thing). Each phase is independently deployable. No phase takes a long-duration exclusive lock.
Here is the complete sequence for the most common dangerous migration: adding a NOT NULL column with a backfill value to an existing large table.
-- Phase 1: EXPAND
-- Add the new column as nullable. Instant on any table size.
ALTER TABLE orders ADD COLUMN new_status TEXT;
-- Deploy application code that writes new_status on every INSERT/UPDATE.
-- Existing rows still have new_status = NULL at this point.
-- Phase 2: MIGRATE
-- Backfill in batches (see next section for the full loop pattern).
-- After all rows are backfilled:
-- Add the constraint NOT VALID — SHARE UPDATE EXCLUSIVE, instant.
ALTER TABLE orders
ADD CONSTRAINT orders_new_status_not_null
CHECK (new_status IS NOT NULL) NOT VALID;
-- Validate it. SHARE UPDATE EXCLUSIVE throughout — reads and writes continue.
ALTER TABLE orders VALIDATE CONSTRAINT orders_new_status_not_null;
-- Phase 3: CONTRACT
-- SET NOT NULL is now instant: Postgres 12+ skips the table scan
-- when a validated CHECK constraint already exists for this column.
ALTER TABLE orders ALTER COLUMN new_status SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_new_status_not_null;The same pattern applies to column type changes: add a new column with the target type, migrate data into it, update application code to read from the new column, then remove the old. For index replacements: CREATE INDEX CONCURRENTLY first, confirm it is built, then DROP INDEX CONCURRENTLY on the old one.
Backfilling without degrading your database
Running UPDATE orders SET new_status = 'pending' WHERE new_status IS NULL as a single statement scans the entire table in one transaction and holds row-level locks across all modified rows. On a 100M-row table this may run for 20–60 minutes while heavily contending with normal write traffic.
The right approach is batched updates with an explicit pause between batches. Five thousand rows per batch with a 100ms pause gives roughly 50,000 rows per second under moderate load without crowding out reads or writes.
DO $$
DECLARE
batch_size INT := 5000;
last_id BIGINT := 0;
max_id BIGINT;
rows_updated INT;
BEGIN
SELECT COALESCE(MAX(id), 0) INTO max_id FROM orders;
LOOP
EXIT WHEN last_id >= max_id;
UPDATE orders
SET new_status = 'pending'
WHERE id > last_id
AND id <= last_id + batch_size
AND new_status IS NULL;
GET DIAGNOSTICS rows_updated = ROW_COUNT;
last_id := last_id + batch_size;
-- Pause lets other queries run between batches.
PERFORM pg_sleep(0.1);
RAISE NOTICE 'Progress: up to id %, updated % rows', last_id, rows_updated;
END LOOP;
END $$;Lower the batch size to 1,000 on tables with many secondary indexes or heavyweight triggers — each update rewrites index entries for every modified row, and large batches hold more index locks simultaneously. The RAISE NOTICE line is useful for monitoring progress; suppress it if your execution environment swallows it.
For tables above 500M rows, run the backfill from your application layer rather than a single SQL session. Store the last-processed ID in a side table or environment variable. This gives you progress tracking, the ability to pause and resume, and graceful cancellation without losing work already done.
The column-rename trap
Column renames are instant at the Postgres level — ACCESS EXCLUSIVE for milliseconds. They are dangerous at the application level. Any code referencing the old column name breaks the moment the rename commits. In a rolling deployment where old and new application versions run simultaneously, a rename causes immediate failures in the old version. There is no way to make this safe in a single step.
Safe column rename across three production deploys, with no maintenance window:
- Add the new column alongside the old. ALTER TABLE t ADD COLUMN new_name TEXT. Instant.
- Deploy application code that reads from the old column but writes to both old and new on every INSERT/UPDATE.
- Backfill the new column: UPDATE t SET new_name = old_name WHERE new_name IS NULL. Batched, as above.
- Deploy application code that reads from the new column (still writing to both). Confirm via query logs that no reads are hitting the old column.
- Deploy application code that reads and writes only the new column.
- Drop the old column. ALTER TABLE t DROP COLUMN old_name. Instant.
That is three production deploys. For teams with CI/CD and feature flags, this is a few days of work spread across branches. For teams with weekly deploy cycles, it becomes a multi-week migration. Either way, the multi-step process is not optional for zero-downtime.
“If a schema change looks like it can be done safely in one step, it almost certainly can't.”
When to use migration tooling
For most product engineering teams doing occasional schema changes, CREATE INDEX CONCURRENTLY, ADD CONSTRAINT ... NOT VALID plus VALIDATE CONSTRAINT, and batched backfills cover the majority of cases. The patterns in this article are sufficient. The Postgres documentation is the authoritative reference.
For teams doing frequent schema changes across many concurrent feature branches (high-velocity product development, multi-tenant SaaS with per-tenant schema additions, or database-per-tenant architectures), pgroll (open-source, from Xata) automates the expand-contract lifecycle and enables serving two schema versions simultaneously during a deployment. This is useful specifically when five features are mid-migration at once and tracking the expand-contract state manually becomes error-prone.
For teams coming from MySQL: gh-ost performs online schema changes via change-data-capture and is MySQL-only. The Postgres equivalents (pglogical, pg-osc, Debezium-based CDC pipelines) add significant operational complexity and are generally only worth it at table sizes above a billion rows where even CONCURRENTLY operations take too long. Below that threshold, the expand-contract pattern is the right answer.
The pattern behind the patterns
Every zero-downtime schema change follows the same logic: decompose the dangerous operation into steps where each individual step either takes a short lock or takes a lock that allows reads. The specific steps differ by operation type, but the reasoning is consistent. If a migration looks scary, it can almost always be broken into something that isn't. The cost is additional deploys and careful coordination — and that is a known, bounded cost, not a surprise at 2am.
Frequently asked questions
Related reading
Postgres or MySQL in 2026: the answer is almost always Postgres, but here's when it isn't
In 2026 Postgres is the right default for almost every new project. But three specific workloads still favour MySQL — and migrating an existing codebase is rarely worth it without a concrete pain point.
DuckDB in your SaaS stack: what it replaces, what it doesn't, and the multi-tenant pattern that holds up
DuckDB can replace a cloud data warehouse for per-customer SaaS analytics below 100 GB per customer. The multi-tenant part requires one specific pattern, and it is not the one most tutorials show.
Rate limiting in production: the four algorithms and their failure modes
Most services reach for a token bucket and never look further. Rate limiting is four distinct algorithms with different burst behaviours and failure modes — here is what each one actually protects against.