Why does ALTER TABLE feel instant in development but cause an outage in production?

In development, tables have hundreds or thousands of rows. An ACCESS EXCLUSIVE lock on a small table completes in milliseconds. In production with millions of rows, the same operation takes seconds or minutes while holding a lock that queues every subsequent query — including reads — behind it.

Is it safe to add a column with a default value in Postgres 11+?

Yes, for constant defaults — string literals, integers, booleans. The operation is instant in Postgres 11+: the default is stored in the system catalog and applied at read time. For volatile defaults like now() or gen_random_uuid(), Postgres must write the value to every row, which requires a full table rewrite.

What does NOT VALID do in ADD CONSTRAINT and when should I use it?

NOT VALID adds the constraint without scanning existing rows — it only enforces the constraint on rows inserted or updated after creation. You then run VALIDATE CONSTRAINT separately, which scans all rows but holds only SHARE UPDATE EXCLUSIVE (allowing reads and writes). Together they replace a single long ACCESS EXCLUSIVE lock with two shorter, less disruptive operations.

How do I safely track progress on a long-running backfill?

Store the last-processed ID in a side table or environment variable at the end of each batch. This lets you resume a failed backfill without reprocessing rows and gives a progress percentage (last_id / max_id). For very long backfills, log progress and alert if the backfill stalls or the row update rate drops unexpectedly.

EngineeringMay 29, 20266 min readReviewed May 29, 2026

Zero-downtime Postgres schema migrations: what every DDL operation does under the hood

A reference map of lock types, unsafe patterns, and the expand-contract technique for large tables

By FlowVerify Editorial Team

Postgres schema migrations are the most reliable source of self-inflicted downtime in backend systems. The same DDL operations that complete in milliseconds during development can hold an ACCESS EXCLUSIVE lock for minutes on a production table with tens of millions of rows — not because the operations are complex, but because the table is large.

This article is a reference. It covers the lock each common DDL operation takes, which operations are instant regardless of table size, and the three patterns you need for the ones that aren't. The goal is to walk into any schema change knowing exactly what will happen before you run it.

The Postgres lock model, in one minute

Every DDL operation acquires a table-level lock. Postgres has eight lock modes. For schema changes, the three that matter are: ACCESS EXCLUSIVE, which blocks all reads and writes and is held for the full duration of the operation; SHARE UPDATE EXCLUSIVE, which allows reads but blocks writes and other schema changes; and ACCESS SHARE, which is held by ordinary SELECT statements.

The critical behaviour: when a DDL operation holds ACCESS EXCLUSIVE, any query arriving after it queues behind it — including plain reads. Those queued queries hold their own locks, which block later queries behind them. A five-minute table rewrite does not just slow down writes. It queues your reads, which queues the queries behind those, and your application appears to be down. The DDL finishes. Your database is still recovering.

A reference map: DDL operations by lock and impact

These are the operations that come up most often in application development. The lock column shows what Postgres acquires. The duration column shows what actually determines how long that lock is held.

Operation	Lock mode	Duration on a large table	Zero-downtime safe?
ADD COLUMN (no default)	ACCESS EXCLUSIVE	Instant (metadata only)	Yes
ADD COLUMN (constant default, Postgres 11+)	ACCESS EXCLUSIVE	Instant (catalog-stored)	Yes
ADD COLUMN (volatile default e.g. now())	ACCESS EXCLUSIVE	Full table rewrite	No
DROP COLUMN	ACCESS EXCLUSIVE	Instant (marked as dropped)	Yes
RENAME COLUMN	ACCESS EXCLUSIVE	Instant; breaks app code	No*
SET NOT NULL	ACCESS EXCLUSIVE	Full table scan	No
ADD CONSTRAINT ... NOT VALID	SHARE UPDATE EXCLUSIVE	Instant (deferred)	Yes
VALIDATE CONSTRAINT	SHARE UPDATE EXCLUSIVE	Full scan; reads/writes continue	Better
CREATE INDEX	ACCESS EXCLUSIVE	Full table scan	No
CREATE INDEX CONCURRENTLY	SHARE UPDATE EXCLUSIVE	Full scan; reads/writes continue	Yes
DROP INDEX CONCURRENTLY	SHARE UPDATE EXCLUSIVE	Instant (clears catalog)	Yes
ALTER COLUMN TYPE	ACCESS EXCLUSIVE	Full table rewrite	No

Common Postgres DDL operations: lock mode and production safety

The Postgres 11 change to constant defaults is worth emphasising. Before Postgres 11, adding any column with a non-null default required rewriting the entire table. Since Postgres 11, a default that is a compile-time constant (a string literal, an integer, or a boolean) is stored in the system catalog and applied at read time. ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending' on a 100M-row table is now instant. Many migration guides written before 2019 tell you to always add columns as nullable and backfill separately. That advice was correct on Postgres 10 and below. For constant defaults on Postgres 11+, it is unnecessary.

The four DDL patterns that will block production traffic

These four operations consistently catch teams off guard, because they are instant in development on small tables and catastrophic in production on large ones:

Adding a NOT NULL column with a volatile default. The default forces a full table rewrite. ALTER TABLE orders ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now() rewrites every row. The fix: add as nullable, backfill, then apply the constraint.
Creating a regular index. CREATE INDEX holds ACCESS EXCLUSIVE for the duration of the full scan — typically 10–30 minutes on a table with 50M rows and multiple secondary indexes. Always use CREATE INDEX CONCURRENTLY instead. The same applies to dropping: DROP INDEX CONCURRENTLY.
Changing a column type. ALTER COLUMN col TYPE bigint rewrites the entire table, even when the storage types are compatible. The only safe path for a live table is the expand-contract approach: add a new column with the target type, migrate data into it, remove the old column.
Renaming a column. The DDL is instant at the Postgres level, but any application code referencing the old name breaks the moment the rename commits. In a rolling deployment where old and new code run simultaneously, this causes immediate failures. The safe path requires three separate deploys (details in the column-rename section below).

The expand-contract pattern for zero-downtime migrations

The expand-contract pattern breaks a dangerous schema change into three phases: expand (add the new thing alongside the old), migrate (transition data and code), contract (remove the old thing). Each phase is independently deployable. No phase takes a long-duration exclusive lock.

Here is the complete sequence for the most common dangerous migration: adding a NOT NULL column with a backfill value to an existing large table.

not-null-migration.sql

-- Phase 1: EXPAND
-- Add the new column as nullable. Instant on any table size.
ALTER TABLE orders ADD COLUMN new_status TEXT;

-- Deploy application code that writes new_status on every INSERT/UPDATE.
-- Existing rows still have new_status = NULL at this point.

-- Phase 2: MIGRATE
-- Backfill in batches (see next section for the full loop pattern).
-- After all rows are backfilled:

-- Add the constraint NOT VALID — SHARE UPDATE EXCLUSIVE, instant.
ALTER TABLE orders
  ADD CONSTRAINT orders_new_status_not_null
  CHECK (new_status IS NOT NULL) NOT VALID;

-- Validate it. SHARE UPDATE EXCLUSIVE throughout — reads and writes continue.
ALTER TABLE orders VALIDATE CONSTRAINT orders_new_status_not_null;

-- Phase 3: CONTRACT
-- SET NOT NULL is now instant: Postgres 12+ skips the table scan
-- when a validated CHECK constraint already exists for this column.
ALTER TABLE orders ALTER COLUMN new_status SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_new_status_not_null;

The same pattern applies to column type changes: add a new column with the target type, migrate data into it, update application code to read from the new column, then remove the old. For index replacements: CREATE INDEX CONCURRENTLY first, confirm it is built, then DROP INDEX CONCURRENTLY on the old one.

Backfilling without degrading your database

Running UPDATE orders SET new_status = 'pending' WHERE new_status IS NULL as a single statement scans the entire table in one transaction and holds row-level locks across all modified rows. On a 100M-row table this may run for 20–60 minutes while heavily contending with normal write traffic.

The right approach is batched updates with an explicit pause between batches. Five thousand rows per batch with a 100ms pause gives roughly 50,000 rows per second under moderate load without crowding out reads or writes.

batched-backfill.sql

DO $$
DECLARE
  batch_size   INT    := 5000;
  last_id      BIGINT := 0;
  max_id       BIGINT;
  rows_updated INT;
BEGIN
  SELECT COALESCE(MAX(id), 0) INTO max_id FROM orders;

  LOOP
    EXIT WHEN last_id >= max_id;

    UPDATE orders
      SET new_status = 'pending'
      WHERE id > last_id
        AND id <= last_id + batch_size
        AND new_status IS NULL;

    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    last_id := last_id + batch_size;

    -- Pause lets other queries run between batches.
    PERFORM pg_sleep(0.1);

    RAISE NOTICE 'Progress: up to id %, updated % rows', last_id, rows_updated;
  END LOOP;
END $$;

Lower the batch size to 1,000 on tables with many secondary indexes or heavyweight triggers — each update rewrites index entries for every modified row, and large batches hold more index locks simultaneously. The RAISE NOTICE line is useful for monitoring progress; suppress it if your execution environment swallows it.

For tables above 500M rows, run the backfill from your application layer rather than a single SQL session. Store the last-processed ID in a side table or environment variable. This gives you progress tracking, the ability to pause and resume, and graceful cancellation without losing work already done.

The column-rename trap

Column renames are instant at the Postgres level — ACCESS EXCLUSIVE for milliseconds. They are dangerous at the application level. Any code referencing the old column name breaks the moment the rename commits. In a rolling deployment where old and new application versions run simultaneously, a rename causes immediate failures in the old version. There is no way to make this safe in a single step.

Safe column rename across three production deploys, with no maintenance window:

Add the new column alongside the old. ALTER TABLE t ADD COLUMN new_name TEXT. Instant.
Deploy application code that reads from the old column but writes to both old and new on every INSERT/UPDATE.
Backfill the new column: UPDATE t SET new_name = old_name WHERE new_name IS NULL. Batched, as above.
Deploy application code that reads from the new column (still writing to both). Confirm via query logs that no reads are hitting the old column.
Deploy application code that reads and writes only the new column.
Drop the old column. ALTER TABLE t DROP COLUMN old_name. Instant.

That is three production deploys. For teams with CI/CD and feature flags, this is a few days of work spread across branches. For teams with weekly deploy cycles, it becomes a multi-week migration. Either way, the multi-step process is not optional for zero-downtime.

“If a schema change looks like it can be done safely in one step, it almost certainly can't.”

— FlowVerify Engineering

When to use migration tooling

For most product engineering teams doing occasional schema changes, CREATE INDEX CONCURRENTLY, ADD CONSTRAINT ... NOT VALID plus VALIDATE CONSTRAINT, and batched backfills cover the majority of cases. The patterns in this article are sufficient. The Postgres documentation is the authoritative reference.

For teams doing frequent schema changes across many concurrent feature branches (high-velocity product development, multi-tenant SaaS with per-tenant schema additions, or database-per-tenant architectures), pgroll (open-source, from Xata) automates the expand-contract lifecycle and enables serving two schema versions simultaneously during a deployment. This is useful specifically when five features are mid-migration at once and tracking the expand-contract state manually becomes error-prone.

For teams coming from MySQL: gh-ost performs online schema changes via change-data-capture and is MySQL-only. The Postgres equivalents (pglogical, pg-osc, Debezium-based CDC pipelines) add significant operational complexity and are generally only worth it at table sizes above a billion rows where even CONCURRENTLY operations take too long. Below that threshold, the expand-contract pattern is the right answer.

The pattern behind the patterns

Every zero-downtime schema change follows the same logic: decompose the dangerous operation into steps where each individual step either takes a short lock or takes a lock that allows reads. The specific steps differ by operation type, but the reasoning is consistent. If a migration looks scary, it can almost always be broken into something that isn't. The cost is additional deploys and careful coordination — and that is a known, bounded cost, not a surprise at 2am.

Frequently asked questions

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

Reddit moved 500+ Kafka brokers and a petabyte of live data from EC2 to Kubernetes with zero downtime. The three techniques behind it aren't specific to Kafka.

Jul 8, 2026Read full article →

EngineeringMay 29, 20266 min readReviewed May 29, 2026

Zero-downtime Postgres schema migrations: what every DDL operation does under the hood

A reference map of lock types, unsafe patterns, and the expand-contract technique for large tables

By FlowVerify Editorial Team

The Postgres lock model, in one minute

A reference map: DDL operations by lock and impact

Operation	Lock mode	Duration on a large table	Zero-downtime safe?
ADD COLUMN (no default)	ACCESS EXCLUSIVE	Instant (metadata only)	Yes
ADD COLUMN (constant default, Postgres 11+)	ACCESS EXCLUSIVE	Instant (catalog-stored)	Yes
ADD COLUMN (volatile default e.g. now())	ACCESS EXCLUSIVE	Full table rewrite	No
DROP COLUMN	ACCESS EXCLUSIVE	Instant (marked as dropped)	Yes
RENAME COLUMN	ACCESS EXCLUSIVE	Instant; breaks app code	No*
SET NOT NULL	ACCESS EXCLUSIVE	Full table scan	No
ADD CONSTRAINT ... NOT VALID	SHARE UPDATE EXCLUSIVE	Instant (deferred)	Yes
VALIDATE CONSTRAINT	SHARE UPDATE EXCLUSIVE	Full scan; reads/writes continue	Better
CREATE INDEX	ACCESS EXCLUSIVE	Full table scan	No
CREATE INDEX CONCURRENTLY	SHARE UPDATE EXCLUSIVE	Full scan; reads/writes continue	Yes
DROP INDEX CONCURRENTLY	SHARE UPDATE EXCLUSIVE	Instant (clears catalog)	Yes
ALTER COLUMN TYPE	ACCESS EXCLUSIVE	Full table rewrite	No

Common Postgres DDL operations: lock mode and production safety

The four DDL patterns that will block production traffic

These four operations consistently catch teams off guard, because they are instant in development on small tables and catastrophic in production on large ones:

Adding a NOT NULL column with a volatile default. The default forces a full table rewrite. ALTER TABLE orders ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now() rewrites every row. The fix: add as nullable, backfill, then apply the constraint.
Creating a regular index. CREATE INDEX holds ACCESS EXCLUSIVE for the duration of the full scan — typically 10–30 minutes on a table with 50M rows and multiple secondary indexes. Always use CREATE INDEX CONCURRENTLY instead. The same applies to dropping: DROP INDEX CONCURRENTLY.
Changing a column type. ALTER COLUMN col TYPE bigint rewrites the entire table, even when the storage types are compatible. The only safe path for a live table is the expand-contract approach: add a new column with the target type, migrate data into it, remove the old column.
Renaming a column. The DDL is instant at the Postgres level, but any application code referencing the old name breaks the moment the rename commits. In a rolling deployment where old and new code run simultaneously, this causes immediate failures. The safe path requires three separate deploys (details in the column-rename section below).

The expand-contract pattern for zero-downtime migrations

Here is the complete sequence for the most common dangerous migration: adding a NOT NULL column with a backfill value to an existing large table.

not-null-migration.sql

-- Phase 1: EXPAND
-- Add the new column as nullable. Instant on any table size.
ALTER TABLE orders ADD COLUMN new_status TEXT;

-- Deploy application code that writes new_status on every INSERT/UPDATE.
-- Existing rows still have new_status = NULL at this point.

-- Phase 2: MIGRATE
-- Backfill in batches (see next section for the full loop pattern).
-- After all rows are backfilled:

-- Add the constraint NOT VALID — SHARE UPDATE EXCLUSIVE, instant.
ALTER TABLE orders
  ADD CONSTRAINT orders_new_status_not_null
  CHECK (new_status IS NOT NULL) NOT VALID;

-- Validate it. SHARE UPDATE EXCLUSIVE throughout — reads and writes continue.
ALTER TABLE orders VALIDATE CONSTRAINT orders_new_status_not_null;

-- Phase 3: CONTRACT
-- SET NOT NULL is now instant: Postgres 12+ skips the table scan
-- when a validated CHECK constraint already exists for this column.
ALTER TABLE orders ALTER COLUMN new_status SET NOT NULL;
ALTER TABLE orders DROP CONSTRAINT orders_new_status_not_null;

Backfilling without degrading your database

batched-backfill.sql

DO $$
DECLARE
  batch_size   INT    := 5000;
  last_id      BIGINT := 0;
  max_id       BIGINT;
  rows_updated INT;
BEGIN
  SELECT COALESCE(MAX(id), 0) INTO max_id FROM orders;

  LOOP
    EXIT WHEN last_id >= max_id;

    UPDATE orders
      SET new_status = 'pending'
      WHERE id > last_id
        AND id <= last_id + batch_size
        AND new_status IS NULL;

    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    last_id := last_id + batch_size;

    -- Pause lets other queries run between batches.
    PERFORM pg_sleep(0.1);

    RAISE NOTICE 'Progress: up to id %, updated % rows', last_id, rows_updated;
  END LOOP;
END $$;

The column-rename trap

Safe column rename across three production deploys, with no maintenance window:

Add the new column alongside the old. ALTER TABLE t ADD COLUMN new_name TEXT. Instant.
Deploy application code that reads from the old column but writes to both old and new on every INSERT/UPDATE.
Backfill the new column: UPDATE t SET new_name = old_name WHERE new_name IS NULL. Batched, as above.
Deploy application code that reads from the new column (still writing to both). Confirm via query logs that no reads are hitting the old column.
Deploy application code that reads and writes only the new column.
Drop the old column. ALTER TABLE t DROP COLUMN old_name. Instant.

“If a schema change looks like it can be done safely in one step, it almost certainly can't.”

— FlowVerify Engineering

Zero-downtime Postgres schema migrations: what every DDL operation does under the hood

The Postgres lock model, in one minute

A reference map: DDL operations by lock and impact

The four DDL patterns that will block production traffic

The expand-contract pattern for zero-downtime migrations

Backfilling without degrading your database

The column-rename trap

When to use migration tooling

The pattern behind the patterns

Frequently asked questions

Related reading

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

CRDTs vs OT is a solved question in 2026. Where you draw the sync boundary is not.

Railway disconnected a carrier to contain an outage. It cut its last route instead.

Stay ahead on eSignatures, compliance, and document workflows

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

Zero-downtime Postgres schema migrations: what every DDL operation does under the hood

The Postgres lock model, in one minute

A reference map: DDL operations by lock and impact

The four DDL patterns that will block production traffic

The expand-contract pattern for zero-downtime migrations

Backfilling without degrading your database

The column-rename trap

When to use migration tooling

The pattern behind the patterns

Frequently asked questions

Related reading

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

CRDTs vs OT is a solved question in 2026. Where you draw the sync boundary is not.

Railway disconnected a carrier to contain an outage. It cut its last route instead.

Stay ahead on eSignatures, compliance, and document workflows

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.