What should a server return when a duplicate request arrives and the key is in 'processing' state?

Return 202 Accepted with a Retry-After header or a Location header pointing to a polling endpoint. The cached result doesn't exist yet. Returning an error would penalise a client that is correctly retrying with the same key.

Does the request_hash check protect against body mutation on a duplicate key?

Yes. If a duplicate key arrives with a different request body, return 422 rather than replying with the cached result from a different call. This catches the common client bug of reusing a key across different requests.

Does this pattern work with distributed databases like CockroachDB or PlanetScale?

The INSERT ... ON CONFLICT pattern works in any Postgres-compatible database. Row-level locking semantics may differ under serialisable isolation in distributed systems. Check your database documentation on conflict handling before relying on identical behaviour to standard Postgres.

EngineeringMay 10, 20267 min readReviewed May 10, 2026

Idempotency keys done right: the schema, the race, and the zombie problem

Q: How long should completed idempotency keys be retained?

24 hours covers the retry window for most payment and API use cases. For async or batch operations where clients might wait days before retrying, extend expires_at at completion time to match the realistic retry window. Avoid a one-size-fits-all TTL.

A complete Postgres implementation covering five states, concurrent-request locking, crash recovery, and TTL cleanup

By FlowVerify Editorial Team

The Stripe engineering blog introduced most backend engineers to idempotency keys. The pattern is clean: the client attaches a unique key to each request; the server stores that key with the result; duplicate requests get the cached result instead of re-execution. Safe retries, no double charges.

What that post doesn't cover: the exact schema, what state a key is in while the request is processing, what happens when two identical requests arrive within the same millisecond, and what to do when the worker crashes halfway through. These are not edge cases. They happen in production before they happen in tests. Here is the complete implementation.

What an idempotency key does (and what it cannot do)

An idempotency key is a client-supplied identifier that scopes a request to a unique intent. The server uses it to cache the response. If the client retries with the same key, the server checks the cache first and returns the stored result rather than processing the request again.

Two things an idempotency key does not do. First, it does not make the underlying operation safe. If your payment code charges a card twice, an idempotency key won't prevent the second charge; it just caches whatever result your code produces. The operation itself still needs to be correct. Second, it does not handle concurrent requests well by default. The naive implementation (check whether the key exists, insert if not) has a race condition that surfaces in production under modest load.

The key belongs to the client. UUIDs (v4 or v7) are standard. The server treats the key as opaque: never generate one on behalf of a client, never reuse one across requests. Clients that lose their generated key and retry with a new one should expect to be charged or processed twice. That is correct behaviour from the server's perspective.

The schema: what each column earns

Most implementations start with two columns: the key and the cached response. The schema below has more, and each column justifies itself.

idempotency_keys.sql

CREATE TABLE idempotency_keys (
  id            uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  key           text NOT NULL,
  endpoint      text NOT NULL,
  status        text NOT NULL DEFAULT 'pending'
                  CHECK (status IN ('pending', 'processing', 'complete', 'failed')),
  request_hash  text NOT NULL,
  response_code int,
  response_body jsonb,
  created_at    timestamptz NOT NULL DEFAULT now(),
  locked_at     timestamptz,
  finished_at   timestamptz,
  expires_at    timestamptz NOT NULL DEFAULT now() + INTERVAL '24 hours'
);

CREATE UNIQUE INDEX uq_idempotency_key_endpoint
  ON idempotency_keys (key, endpoint);

CREATE INDEX idx_idempotency_cleanup
  ON idempotency_keys (expires_at)
  WHERE status IN ('complete', 'failed');

CREATE INDEX idx_idempotency_stuck
  ON idempotency_keys (locked_at)
  WHERE status = 'processing';

The unique index is on (key, endpoint), not key alone. A key is meaningful only within the scope of a specific operation. A client that generates one UUID and reuses it on two different endpoints should get independent results for each, not a cross-endpoint collision. This matches how Stripe scopes keys to resource types.

request_hash is a SHA-256 (or HMAC-SHA-256) of the request body. If a duplicate key arrives with a different body, the server returns 422 Unprocessable Entity rather than silently replying with a cached result from a different call. Omitting this check is the most common mistake in idempotency implementations: a client bug reusing a key across different requests gets a silently wrong response.

locked_at tracks when a worker claimed the key. The partial index on locked_at WHERE status = 'processing' powers the zombie-key sweeper described later; it stays small and cheap because it only indexes actively-claimed rows.

expires_at is set at creation and controls eviction. It is separate from created_at because the cleanup job needs to target only completed records, not in-flight ones. They are different problems with different solutions.

Five states, not two

Treating idempotency as binary (exists or it doesn't) obscures the real state machine. An idempotency key moves through five distinct states, each with different semantics for what the server should return to the client.

Status	Meaning	Next state	Client receives
pending	Key inserted; worker not yet claimed it	processing	202 Accepted + Retry-After
processing	Worker has claimed the key and is running	complete or failed	202 Accepted + Retry-After
complete	Operation finished; response stored	(terminal)	Cached response code and body
failed	Non-retryable error stored	(terminal)	Cached error response
expired	Past expires_at; eviction candidate	(terminal)	404 Not Found or 410 Gone

Idempotency key state machine

The key transition is pending to processing. A worker claims the key by running an UPDATE that sets status to 'processing' and stamps locked_at. This UPDATE is the lock: it prevents a second worker from double-processing the same key if two workers somehow both see the same pending row. After the operation finishes, one final UPDATE stores the response code and body, and sets the terminal status.

claim_key.sql

-- Claim the key before doing any work
UPDATE idempotency_keys
SET    status = 'processing', locked_at = now()
WHERE  id     = $1
  AND  status = 'pending'
RETURNING *;

-- After work is done, store the result
UPDATE idempotency_keys
SET    status        = 'complete',
       response_code = $2,
       response_body = $3,
       finished_at   = now()
WHERE  id = $1;

When a client retries and the key is in pending or processing state, the server cannot return the cached result; it does not exist yet. Return 202 Accepted with a Retry-After header. Returning an error here would punish a client for doing the right thing.

The race you're probably not handling

Two retries arrive with the same key within the same millisecond, before either has been processed. Your handler reads the keys table. Neither finds the row, and both attempt to insert. One wins; the other gets a unique constraint violation and crashes the request handler. Your client receives a 500, which looks identical to a real server error.

The fix is to use the INSERT itself as the lock rather than a separate read:

insert_idempotency_key.sql

INSERT INTO idempotency_keys
  (key, endpoint, status, request_hash)
VALUES
  ($1, $2, 'pending', $3)
ON CONFLICT (key, endpoint)
DO UPDATE SET key = EXCLUDED.key  -- no-op: activates RETURNING on conflict
RETURNING *;

The DO UPDATE SET key = EXCLUDED.key clause is intentionally a no-op (updating key to its own conflict value changes nothing in the row). But it activates the RETURNING clause. With ON CONFLICT DO NOTHING, Postgres returns an empty result set for conflicting rows and RETURNING has nothing to work with. With DO UPDATE, RETURNING gives back the existing row after the (no-op) update.

This makes the INSERT atomic. The first request inserts and gets back a fresh pending row. The second request hits the conflict path and gets back the same row. Both can inspect the status field and respond appropriately: no second read, no race window, no constraint violation crash.

One consequence: the DO UPDATE path acquires a row-level lock until the transaction commits. If your operation takes 30 seconds, requests B, C, and D queue behind that lock for up to 30 seconds. For long-running operations, decouple the insert from the processing: insert synchronously, return 202, process asynchronously, let the client poll a status endpoint.

The zombie-key problem

The worker inserts the key as pending. It claims it by setting status to 'processing' and stamping locked_at. Then it dies: out of memory, a deploy restart, a network timeout that kills the goroutine mid-flight. The key is stuck in processing state. locked_at is 20 minutes ago. Nothing is running.

The next retry arrives and sees status = 'processing'. It has no way to tell whether the original worker is still running or long dead. Without a sweeper, the key is stuck forever. Every subsequent retry returns 202 Accepted, the client keeps polling, nothing finishes.

The sweeper resets zombie keys back to pending:

sweep_zombie_keys.sql

UPDATE idempotency_keys
SET    status = 'pending', locked_at = NULL
WHERE  status    = 'processing'
  AND  locked_at < now() - INTERVAL '5 minutes'
RETURNING id, key, endpoint;

This resets any key that has been in processing state for longer than the timeout threshold. The partial index on locked_at WHERE status = 'processing' makes this fast even on a table with millions of completed rows; it only scans actively-claimed rows.

Set the timeout to at least 2 to 3 times your 99th-percentile operation time, not the mean. If legitimate operations regularly take 4 minutes and the 99th percentile is 7 minutes, a 5-minute timeout will incorrectly reset live keys under load. Instrument your operation durations before picking the number.

Run the sweeper on a regular schedule; every 60 seconds works for most APIs. Log every key it resets, and track the count as a metric. Each reset is a signal that a worker crashed; the sweeper fixes the stuck state, but the crash is still worth investigating.

TTL cleanup without evicting the wrong keys

The naive cleanup query is appealing because it's short:

naive_cleanup.sql

-- Don't do this: deletes in-flight and still-useful keys
DELETE FROM idempotency_keys
WHERE created_at < now() - INTERVAL '24 hours';

It deletes pending and processing keys that are mid-flight. It also deletes completed keys before any late-arriving duplicate has had a chance to use the cached result. Neither outcome is what you want.

The correct approach separates two concerns: evicting completed records past their retention window, and expiring abandoned pending keys.

correct_cleanup.sql

-- Target only terminal states past their retention window
DELETE FROM idempotency_keys
WHERE  status     IN ('complete', 'failed')
  AND  expires_at <  now()
RETURNING id, key;

-- Separately, expire abandoned pending keys
UPDATE idempotency_keys
SET    status        = 'failed',
       response_code = 408,
       response_body = '{"error":"timeout"}'::jsonb,
       finished_at   = now()
WHERE  status    = 'pending'
  AND  expires_at < now();

The expires_at column defines the cache lifetime. It is set at key creation. For most payment APIs, 24 hours covers the realistic retry window. For slow batch operations where clients might wait days before retrying, extend expires_at at completion time to match the documented retry window for that endpoint.

The idx_idempotency_cleanup partial index targets only complete and failed rows. The DELETE never scans pending or processing keys. On a table with 10 million rows, most of which are completed, this index keeps the cleanup query under 10 milliseconds and avoids a full table scan on every eviction run.

The second query converts old pending keys to a failed state before deletion. This preserves a terminal state for any client that still holds the key and polls after expiry. They see 'failed' with a timeout error rather than 404, which is more informative and easier to handle on the client side.

When Redis helps (and when it gets in the way)

The Redis pattern: before the Postgres INSERT, call SET idempotency:{key}:{endpoint} NX EX 86400. If the SET fails, the key exists in Redis, so short-circuit to the cached result. If it succeeds, proceed to Postgres. This adds a fast in-memory check that avoids a database round-trip for most duplicate requests at high volume.

The problem: you now have two systems that must agree. Redis TTL and Postgres expires_at need to stay in sync. If Redis evicts the key before Postgres does, a client gets a 404 from the fast path on one retry and the cached result from the slow path on the next. The behaviour is inconsistent from the same endpoint. If Redis is unavailable, you fall back to the Postgres path anyway, which means the Postgres implementation must be correct regardless of whether Redis is in front of it.

For most APIs, including those handling several thousand idempotent requests per second, the Postgres-only approach with a partial unique index is fast enough. The INSERT on a unique index is a single B-tree lookup plus a row write. The locked row contention only affects concurrent duplicates, not normal traffic. Measuring actual lock contention before adding Redis is the right call; adding it as a default is a reliability cost disguised as a performance optimisation.

Use Redis when you have measured Postgres index contention under load and it is a bottleneck. Not before.

Frequently asked questions

Every Postgres isolation level, and the specific bug it lets through

Three isolation levels, three distinct failure modes. Most Postgres deployments run at Read Committed without knowing it. Here is what each level permits and what upgrading actually costs.

May 15, 2026Read full article →

EngineeringMay 10, 20267 min readReviewed May 10, 2026

Idempotency keys done right: the schema, the race, and the zombie problem

A complete Postgres implementation covering five states, concurrent-request locking, crash recovery, and TTL cleanup

By FlowVerify Editorial Team

What an idempotency key does (and what it cannot do)

The schema: what each column earns

Most implementations start with two columns: the key and the cached response. The schema below has more, and each column justifies itself.

idempotency_keys.sql

CREATE TABLE idempotency_keys (
  id            uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  key           text NOT NULL,
  endpoint      text NOT NULL,
  status        text NOT NULL DEFAULT 'pending'
                  CHECK (status IN ('pending', 'processing', 'complete', 'failed')),
  request_hash  text NOT NULL,
  response_code int,
  response_body jsonb,
  created_at    timestamptz NOT NULL DEFAULT now(),
  locked_at     timestamptz,
  finished_at   timestamptz,
  expires_at    timestamptz NOT NULL DEFAULT now() + INTERVAL '24 hours'
);

CREATE UNIQUE INDEX uq_idempotency_key_endpoint
  ON idempotency_keys (key, endpoint);

CREATE INDEX idx_idempotency_cleanup
  ON idempotency_keys (expires_at)
  WHERE status IN ('complete', 'failed');

CREATE INDEX idx_idempotency_stuck
  ON idempotency_keys (locked_at)
  WHERE status = 'processing';

Five states, not two

Status	Meaning	Next state	Client receives
pending	Key inserted; worker not yet claimed it	processing	202 Accepted + Retry-After
processing	Worker has claimed the key and is running	complete or failed	202 Accepted + Retry-After
complete	Operation finished; response stored	(terminal)	Cached response code and body
failed	Non-retryable error stored	(terminal)	Cached error response
expired	Past expires_at; eviction candidate	(terminal)	404 Not Found or 410 Gone

Idempotency key state machine

claim_key.sql

-- Claim the key before doing any work
UPDATE idempotency_keys
SET    status = 'processing', locked_at = now()
WHERE  id     = $1
  AND  status = 'pending'
RETURNING *;

-- After work is done, store the result
UPDATE idempotency_keys
SET    status        = 'complete',
       response_code = $2,
       response_body = $3,
       finished_at   = now()
WHERE  id = $1;

The race you're probably not handling

The fix is to use the INSERT itself as the lock rather than a separate read:

insert_idempotency_key.sql

INSERT INTO idempotency_keys
  (key, endpoint, status, request_hash)
VALUES
  ($1, $2, 'pending', $3)
ON CONFLICT (key, endpoint)
DO UPDATE SET key = EXCLUDED.key  -- no-op: activates RETURNING on conflict
RETURNING *;

The zombie-key problem

The sweeper resets zombie keys back to pending:

sweep_zombie_keys.sql

UPDATE idempotency_keys
SET    status = 'pending', locked_at = NULL
WHERE  status    = 'processing'
  AND  locked_at < now() - INTERVAL '5 minutes'
RETURNING id, key, endpoint;

TTL cleanup without evicting the wrong keys

The naive cleanup query is appealing because it's short:

naive_cleanup.sql

-- Don't do this: deletes in-flight and still-useful keys
DELETE FROM idempotency_keys
WHERE created_at < now() - INTERVAL '24 hours';

The correct approach separates two concerns: evicting completed records past their retention window, and expiring abandoned pending keys.

correct_cleanup.sql

-- Target only terminal states past their retention window
DELETE FROM idempotency_keys
WHERE  status     IN ('complete', 'failed')
  AND  expires_at <  now()
RETURNING id, key;

-- Separately, expire abandoned pending keys
UPDATE idempotency_keys
SET    status        = 'failed',
       response_code = 408,
       response_body = '{"error":"timeout"}'::jsonb,
       finished_at   = now()
WHERE  status    = 'pending'
  AND  expires_at < now();

When Redis helps (and when it gets in the way)

Use Redis when you have measured Postgres index contention under load and it is a bottleneck. Not before.

Idempotency keys done right: the schema, the race, and the zombie problem

What an idempotency key does (and what it cannot do)

The schema: what each column earns

Five states, not two

The race you're probably not handling

The zombie-key problem

TTL cleanup without evicting the wrong keys

When Redis helps (and when it gets in the way)

Frequently asked questions

Related reading

Every Postgres isolation level, and the specific bug it lets through

LLM database access: the RBAC gap most teams don't see

Rate limiting in production: why the algorithm you chose is probably wrong for your workload

Stay ahead on eSignatures, compliance, and document workflows

Every Postgres isolation level, and the specific bug it lets through

Idempotency keys done right: the schema, the race, and the zombie problem

What an idempotency key does (and what it cannot do)

The schema: what each column earns

Five states, not two

The race you're probably not handling

The zombie-key problem

TTL cleanup without evicting the wrong keys

When Redis helps (and when it gets in the way)

Frequently asked questions

Related reading

Every Postgres isolation level, and the specific bug it lets through

LLM database access: the RBAC gap most teams don't see

Rate limiting in production: why the algorithm you chose is probably wrong for your workload

Stay ahead on eSignatures, compliance, and document workflows

Every Postgres isolation level, and the specific bug it lets through