Idempotency keys in production: what the tutorials don't cover
The check-then-act race condition, deduplication table bottlenecks, and key scoping across services
An API endpoint is idempotent if making the same request twice produces the same result as making it once. For GET requests this is automatic. For POST requests (creating a payment, sending an email, provisioning a resource), it isn't. Idempotency keys are the mechanism that makes them so: the client generates a unique key for each logical operation, sends it with the request, and the server uses it to deduplicate retries.
The pattern looks simple. It isn't. Most tutorial implementations are correct in the happy path and wrong in the three cases that matter most in production.
What idempotency keys actually guarantee
The textbook definition (same inputs, same outputs) understates what's required at the API layer. An idempotent endpoint must also survive the case where the first request completed successfully but the client never received the response. Network partitions, load balancer timeouts, and client-side abort on slow response all produce this. The client cannot distinguish 'request never arrived' from 'request arrived, processed, response lost in transit'. It retries.
Idempotency keys give the server a way to recognise the retry and return the original result without re-executing the operation. The key is the client's assertion: this is the same logical operation as the one I sent before. The server's job is to honour that assertion efficiently and safely.
'Efficiently' is doing real work here. The server has to check whether the key exists, and in a concurrent system, that check has to be atomic with the claim. Most tutorials stop before explaining why.
The check-then-act race condition
Every tutorial implements idempotency like this:
- Extract the key from the request header.
- Query the deduplication store: does this key already exist?
- If yes: return the cached response.
- If no: execute the operation, store the result, return it.
The bug lives between steps 2 and 3. In concurrent systems, two requests carrying the same key can both reach step 2 before either completes step 4. Both see 'key doesn't exist'. Both execute the operation. The charge goes through twice.
This isn't a corner case. It's what happens under any exponential-backoff retry pattern when the first request is slow: the client gives up and retries, the server is still executing the first request, and both are now racing.
The fix requires making the claim atomic. Three practical options:
Option 1: unique constraint and conflict detection. Insert the key before executing the operation, using INSERT ... ON CONFLICT DO NOTHING. Check whether a row was actually inserted. If it was, this request owns the key and proceeds. If not, another request owns it: poll until it finishes and return its result.
-- Claim the key atomically before any work is done
INSERT INTO idempotency_keys (key, status, expires_at)
VALUES ($1, 'pending', now() + interval '30 minutes')
ON CONFLICT (key) DO NOTHING;
-- rows_affected = 0 means another request owns this key
-- poll for its result rather than proceedingOption 2: SELECT FOR UPDATE on the key row. Locks the row exclusively so only one concurrent request proceeds through the check-and-execute path. Works reliably, but serialises all requests sharing a key. Acceptable when retries are rare and lock hold time is short.
Option 3: Postgres advisory locks. pg_try_advisory_xact_lock(key_hash) acquires a transaction-scoped lock on a 64-bit integer derived from the key. Fast, releases automatically on commit or rollback, does not require a row to exist first. The limitation is that advisory locks are connection-local. Behind PgBouncer in transaction pooling mode, the lock does not propagate — use option 1 instead.
Of the three, option 1 is the most portable. The unique constraint does the deduplication atomically at the database level, without relying on lock scoping or connection affinity.
The deduplication table as a second bottleneck
Once the race condition is addressed, the next problem is operational. The deduplication table grows without bound. At 5,000 requests per minute, that's 300,000 rows per hour, 7 million per day. Cleanup is not optional, and how you clean up matters as much as how you insert.
The naive fix is a periodic job: DELETE FROM idempotency_keys WHERE expires_at < now(). This works until you have millions of rows to delete. A bulk delete of 500,000 rows in a single transaction holds locks for seconds and leaves the vacuum process a large dead-tuple job, which generates I/O spikes during peak traffic.
Better approaches:
Time partitioning. Partition the table by created_at, daily or weekly. Dropping an old partition is a metadata operation: no lock, no vacuum, no dead tuples. At 5,000 req/min with 7-day retention, you maintain seven active daily partitions and drop the oldest each morning. The DROP command takes milliseconds.
Bounded deletes. If partitioning is too complex for your setup, run small frequent deletes: DELETE WHERE expires_at < now() LIMIT 1000, every minute. Small transactions minimise lock pressure and keep the dead-tuple volume manageable. Set the limit based on your write rate — the cleanup rate needs to stay above the expiry rate.
Keep the key column small. Store UUIDs (36 bytes) or hash them to a BIGINT. Do not put request payloads in the same column as the key index — store response data in a separate JSONB column with TOAST, or a separate table. A hot index over large values slows every write.
| Strategy | Lock impact | Vacuum pressure | Operational complexity |
|---|---|---|---|
| Bulk DELETE | High (long transaction) | High (many dead tuples) | Low |
| Bounded DELETE (1 k rows/run) | Low (short transactions) | Moderate | Low |
| Time-partitioned table | None (DROP PARTITION) | None | Medium (partition mgmt) |
Key scoping across service boundaries
Single-service idempotency is straightforward. In a microservice architecture, a single user-facing request fans out to multiple downstream services, and the scoping question becomes harder.
Consider a checkout flow: the client sends one request with one idempotency key. The checkout service calls inventory to reserve stock, then calls payments to charge the card, then calls notifications to send a receipt. Which key goes where?
Do not propagate the raw key. If the inventory service and the payments service both receive the same key value, they deduplicate against different stores with different semantics. Worse: a client that accidentally reuses a key across distinct operations will see a request to payments match a key stored by inventory — a false positive deduplication that suppresses a legitimate charge.
Derive child keys deterministically. Compose the child key from the parent key and the service boundary:
import hashlib
def child_key(parent_key: str, service: str) -> str:
return hashlib.sha256(
f"{parent_key}:{service}".encode()
).hexdigest()[:32]
# Each service gets a deterministic, globally unique key
inventory_key = child_key(request_key, "inventory:reserve")
payment_key = child_key(request_key, "payment:charge")
notify_key = child_key(request_key, "notification:receipt")Each child key is globally unique to that operation type but fully deterministic from the parent. A retry of the parent request produces identical child keys. Each downstream service deduplicates independently, and the parent service does not need to coordinate across them.
The parent key covers the full fan-out. If the checkout service retries only the payments call because inventory already succeeded, it uses the same derived payment key. The payments service correctly recognises it as a duplicate and returns the cached result without re-charging.
Expiry windows and why 24 hours is not universal
Most tutorials suggest 24 to 72 hours as the idempotency key TTL. This is a starting point, not a derived value.
The right TTL is your retry window plus a buffer. If your client retries three times with exponential backoff that tops out at two minutes per attempt, the full retry sequence completes within about ten minutes. A 30-minute TTL covers it with headroom. A 24-hour TTL protects against a user submitting the same form a day apart — which is usually a different user intent, not a retry you want to deduplicate.
Overly long TTLs cause storage growth and can produce confusing failures. A user attempts a payment, gets a transient error, fixes their payment method, and retries 12 hours later with the same client-generated key. The server correctly deduplicates it and returns the original failure. The user sees a stale error rather than a fresh attempt. This is spec-compliant but wrong.
The fix is a client-side convention: generate a new key for each new logical attempt, not for each network call. If the user intentionally retries after fixing an error, the client generates a new key. If the network drops mid-request, the client retries with the same key. Most SDK implementations get this right; most hand-rolled implementations do not.
A two-tier TTL is worth considering: a short active window (15 to 30 minutes) for deduplication during retry sequences, and a longer audit window (7 days) for response caching to support debugging and support queries. Store them in separate columns with separate cleanup schedules — the active window deletes fast, the audit window stays for a week.
What to measure once it's running
A deduplication system you cannot observe is one you cannot trust. Four metrics worth instrumenting from day one:
Dedup rate: duplicate hits divided by total requests, as a percentage. A healthy system stays well under 1%. A sudden spike means client-side retry logic is misbehaving or a network layer is duplicating requests. This metric makes the misbehaviour visible before users notice a charge doubled.
Pending key age: P95 and P99 of how long a key stays in pending status. Under normal conditions, this should be low — seconds at most. Keys that stay pending for minutes indicate a stuck request or a cleanup failure. These are the rows that cause the polling path to wait indefinitely.
Table size: row count and bytes on disk. At a known write rate and TTL, you can project the expected steady-state size. If actuals exceed the projection, the cleanup job is falling behind. Set an alert before the table hits a size that will cause index scans to degrade.
Lock contention: if you're using SELECT FOR UPDATE, track wait time on that query. Near-zero under normal conditions. Elevated wait means concurrent retries are genuinely racing — understand the cause before assuming it's expected load.
Most APM tools will not instrument the deduplication table automatically. Add explicit counters at the application layer: increment idempotency.miss when a new key is claimed, idempotency.hit when a duplicate is caught, and idempotency.pending_timeout when the polling path gives up waiting. Those three counters surface 90% of production incidents before they reach users.
A production-ready idempotency key pattern
Pulling this together: a Postgres-backed implementation that handles the race condition atomically, cleans up without vacuum pressure, and emits the metrics you need.
CREATE TABLE idempotency_keys (
key TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending | done | failed
response JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (key)
) PARTITION BY RANGE (created_at);
-- One partition per day; automate creation with pg_partman
CREATE TABLE idempotency_keys_2026_06_13
PARTITION OF idempotency_keys
FOR VALUES FROM ('2026-06-13') TO ('2026-06-14');
-- Index for the cleanup job
CREATE INDEX ON idempotency_keys (expires_at)
WHERE status IN ('done', 'failed');import hashlib, json, asyncio
from datetime import datetime, timedelta, timezone
async def with_idempotency(db, key: str, ttl_minutes: int, operation):
expires_at = datetime.now(timezone.utc) + timedelta(minutes=ttl_minutes)
# Atomic claim: only one concurrent request wins this INSERT
result = await db.execute("""
INSERT INTO idempotency_keys (key, status, expires_at)
VALUES ($1, 'pending', $2)
ON CONFLICT (key) DO NOTHING
""", key, expires_at)
if result.rowcount == 0:
# Another request owns this key. Poll for its result.
return await poll_for_result(db, key, timeout_seconds=10)
try:
response = await operation()
await db.execute("""
UPDATE idempotency_keys
SET status = 'done', response = $1
WHERE key = $2
""", json.dumps(response), key)
metrics.increment('idempotency.miss')
return response
except Exception:
await db.execute(
"UPDATE idempotency_keys SET status = 'failed' WHERE key = $1",
key
)
raise
async def poll_for_result(db, key: str, timeout_seconds: int):
deadline = asyncio.get_event_loop().time() + timeout_seconds
while asyncio.get_event_loop().time() < deadline:
row = await db.fetchrow(
"SELECT status, response FROM idempotency_keys WHERE key = $1",
key
)
if row and row['status'] == 'done':
metrics.increment('idempotency.hit')
return json.loads(row['response'])
if row and row['status'] == 'failed':
raise IdempotencyKeyFailedError(key)
await asyncio.sleep(0.5)
metrics.increment('idempotency.pending_timeout')
raise IdempotencyPendingTimeoutError(key)The critical design decision: claim first, execute second. The unique constraint does the deduplication atomically. No manual locking, no application-level race. The polling path handles the window where the first request is still in flight.
Time partitioning handles cleanup without lock pressure. The cleanup job drops the previous day's partition each morning — no DELETE, no vacuum, no dead tuples. At moderate scale, up to several million keys per day, this table stays fast indefinitely.
The three metrics (miss, hit, pending_timeout) give you the signal you need to trust the system is working. The dedup rate tells you if clients are misbehaving. The pending timeout rate tells you if the first request is stuck. If both are near zero, the system is healthy.
Frequently asked questions
Related reading
TypeScript 7's Go compiler is 10x faster. Here is what actually breaks.
TypeScript 6.0 shipped as the last JavaScript-based release. TypeScript 7.0 brings a Go-native compiler with genuine 10x build speedups and removes the Compiler API that a surprising share of tooling depends on.
Why your Node.js streaming pipeline crashes under load (and how backpressure fixes it)
Most Node.js streaming pipelines have a hidden failure mode: when the consumer falls behind the producer, data accumulates in heap buffers until the process crashes. Adding a bigger buffer makes it worse.
Every Postgres index type, and the bug you get when you pick wrong
B-tree is the default, but it is the wrong choice more often than you expect. This guide covers all six Postgres index types, the bug each was built to prevent, and the gotcha that disables each one silently.