Webhooks in production: the delivery guarantees your integration is probably not honouring
Every modern SaaS integration relies on webhooks. Payment gateways fire them when a charge succeeds. Document platforms fire them when a signature is applied. CI pipelines fire them when a build completes. Despite this ubiquity, the failure modes that sit underneath webhook delivery are poorly understood — and most integrations handle them incorrectly until something breaks in production.
This article covers the delivery guarantees that actually exist, the guarantees that do not, and the receiver-side patterns that separate integrations which survive incidents from ones that silently corrupt state.
What at-least-once delivery actually means
The first thing to understand about webhooks is that the sender cannot know whether delivery succeeded. HTTP is stateless. A 200 OK tells the sender that the receiver accepted the payload, but it says nothing about whether the receiver processed it, stored it, or acted on it. Network partitions, application crashes, and database timeouts all happen after the response is sent.
Because of this, every production webhook system (Stripe, GitHub, Twilio, DocuSign, and every well-engineered internal event bus) operates under an at-least-once delivery guarantee. The sender will retry failed deliveries. If the receiver was partitioned from the network at delivery time but processed the event before the retry arrived, the retry lands on a receiver that has already handled the event. This is not a bug; it is a documented consequence of the delivery model.
Exactly-once delivery is a well-known impossibility result in distributed systems. It cannot be provided at the transport layer. What some vendors describe as "exactly-once" is actually at-least-once delivery combined with idempotent processing on the receiver side. The distinction matters because it places the responsibility where it belongs: on you.
The three failure modes that actually occur
Teams building webhook receivers tend to worry about the wrong things. The payload being tampered with, the sender going down, the schema changing. These are real concerns, but they are not where most production incidents originate. The three failure modes that actually break integrations are more mundane.
1. Double processing
A receiver processes an event and updates a database row, but the response times out before reaching the sender. The sender retries. The receiver processes the same event a second time. For read-only operations this is harmless. For writes such as charging a card, provisioning an account, sending a notification, or updating a signature status, it is not.
The standard mitigation is idempotency keying on the receiver side. Every webhook payload from a well-designed sender includes a stable event identifier: a UUID that remains constant across retries. The receiver stores this identifier in a processed-events table before acting on the payload. Before processing any incoming event, the receiver queries that table. If the identifier is present, the event is acknowledged and discarded. If absent, it is inserted atomically with the business-logic update.
The atomic requirement is worth enforcing strictly. A non-atomic implementation (check for existence, then insert) creates a race condition when two retries arrive simultaneously. Use a database transaction, or a Redis SETNX with a TTL that covers your expected retry window (Stripe uses 24 hours; that is a reasonable baseline).
2. Processing before acknowledging
The second common failure mode is doing work inside the HTTP handler before returning a response. The sender has a timeout, typically 30 seconds. Any processing that takes longer than the timeout will trigger a retry even though the first delivery is still in progress. Now two copies of the same event are being processed concurrently.
The correct pattern is to immediately acknowledge and defer: return a 200 OK as soon as the payload passes signature verification, then push the raw event to an internal queue: SQS, RabbitMQ, a Postgres table, or even an in-process channel. A background worker consumes from the queue asynchronously. This decouples delivery acknowledgement from processing, eliminates the timeout race, and gives you natural retry semantics on the processing side independent of the sender's retry policy.
3. Silent failures after partial processing
The most dangerous failure mode is one where processing begins, something fails midway, and the receiver returns a 200 OK anyway. This happens when error handling is incomplete: a database write succeeds, a downstream API call fails, and the handler catches the exception to avoid surfacing it to the sender. The sender marks the delivery as successful. The event is gone. The state is corrupt.
Return non-2xx for failures you want retried. Return 2xx only when you are confident the event has been safely accepted (not necessarily processed, but enqueued in a durable store). If your queue write fails, return a 500 so the sender retries.
Retry policies and what you can rely on
Senders implement retry policies, but those policies vary more than developers expect. Stripe retries up to 15 times over 72 hours, with exponential backoff. GitHub uses a shorter window. Internal event buses may retry indefinitely or not at all. You cannot assume a consistent retry window across integrations, and you should not design your receiver to depend on a specific retry duration.
What you can rely on: most production-grade senders use exponential backoff with jitter to avoid thundering herd problems during receiver outages. They will not hammer your endpoint at a fixed rate. When your service recovers after an outage, retries will arrive in a spread pattern rather than a simultaneous burst.
What you cannot rely on: the sender will keep retrying long enough to cover your maintenance windows. If you deploy a breaking change to your receiver endpoint, events that arrive during the deployment will be retried; if your window exceeds the sender's retry limit, those events are lost. Design deployments to keep the old handler alive until the retry window closes, or build a separate catch-up mechanism.
Dead-letter queues and the events you will lose
After a sender exhausts its retry budget, the event is dropped. This is the only guarantee: events that cannot be delivered within the retry window are permanently discarded by the sender. If your receiver was down for four days and Stripe's window is three, some events are gone.
The mitigation is a dead-letter queue on the receiver side. When your background worker fails to process an event after a configurable number of attempts, it writes the raw payload to a dead-letter store rather than discarding it. This allows manual or automated replay once the underlying cause is fixed. A dead-letter queue does not help with events the sender has already dropped (nothing does), but it closes the gap on events the sender delivered but your processor failed to handle.
The operational requirement: someone needs to monitor the dead-letter queue. An alarm that fires when queue depth exceeds a threshold is the minimum. Unmonitored dead-letter queues accumulate events that represent real state drift between systems, and the drift compounds over time.
Signature verification and why you cannot skip it
Webhook endpoints are public HTTP endpoints. Any actor who discovers your endpoint URL can POST arbitrary payloads to it. Without signature verification, your receiver is open to injection: an attacker can trigger state changes in your system by crafting payloads that match your expected schema.
Most senders provide HMAC-based signature headers. Stripe uses the Stripe-Signature header with a timestamp and a SHA-256 HMAC of the raw body. GitHub uses X-Hub-Signature-256. The pattern is consistent: concatenate the timestamp and raw payload, compute an HMAC using your shared secret, and compare the result to the header value using a constant-time comparison function.
Two implementation notes. First, compute the HMAC over the raw bytes of the request body, not over a parsed JSON object. JSON serialisers are not deterministic across languages; whitespace changes, object key ordering changes, and the signature will not match. Buffer the raw body before parsing. Second, validate the timestamp component to prevent replay attacks. A payload with a timestamp older than five minutes should be rejected, even if the HMAC is valid.
Ordering: what you will not get
Webhooks do not guarantee ordering. If two events are fired in sequence (document.created followed by document.signed), they may arrive in any order, or one may arrive after a retry delay that causes the other to arrive first. Any receiver that assumes order will eventually process events out of sequence.
The practical implication: your business logic must be order-independent, or you must build ordering yourself. The common approach is to include a sequence number or version field in each payload and use optimistic locking on the receiver side, rejecting updates where the incoming version is lower than the stored version. This requires senders that include version information; many internal event schemas skip it, which forces the receiver to rely on timestamps, which are unreliable across clock skew.
For integrations where ordering genuinely matters, consider replacing the webhook with a polling endpoint that allows the receiver to fetch events in order. Webhooks are push; polling is pull. Push is operationally simpler for the sender. Pull is operationally safer for the receiver. The right choice depends on your tolerance for latency versus your tolerance for ordering bugs.
Observability: the layer most teams skip
Webhook integrations are notoriously difficult to observe. The sender fires events; whether they are processed correctly is invisible to the sender. Your receiver processes events; whether the resulting state is consistent is invisible to the receiver. Bugs live in the gap between them, and they often go undetected until a customer reports a discrepancy.
The minimum observability stack for a production webhook receiver: a counter on incoming events by type and status (accepted, duplicate, failed); a counter on dead-letter entries; an alert when processing latency for any event type exceeds your expected window; and a periodic reconciliation job that compares state in your system against the source of truth in the sender's API.
That last piece, reconciliation, is the highest-value and the most commonly skipped. Counters and latency alerts catch processing failures. Reconciliation catches silent state drift: events that were delivered and acknowledged but resulted in incorrect state due to bugs in your business logic. A daily reconciliation run that checks a sample of records against the sender's API is cheap to build and catches an entire class of bugs that metrics will miss.
A checklist for production webhook receivers
Before marking a webhook integration as production-ready, verify the following. Your receiver returns 200 immediately after enqueuing the payload, not after processing it. You verify the HMAC signature over the raw body before any other processing. You check the event ID against a processed-events store before acting on each payload. Business logic updates and idempotency key insertion happen in the same database transaction. Non-2xx responses are returned for any failure mode you want the sender to retry. Payloads that exhaust worker retries land in a monitored dead-letter queue. You have an alert on dead-letter queue depth. You have a reconciliation job that runs on a schedule independent of webhook delivery.
None of these items is novel. Each has been documented by platform engineering teams for years. What is notable is how rarely they are all present in the same integration. Most webhook receivers in production handle the happy path correctly and fail on one or two of these requirements, typically idempotency or reconciliation,, which only surface under load or after outages when the stakes are highest.
What this means for integrations that handle documents and signatures
For integrations that sit downstream of document workflows where a webhook fires on signature completion, audit-trail generation, or status change, the failure modes carry higher stakes than in most SaaS contexts. A payment webhook that is processed twice results in a duplicate charge that can be refunded. A signature-status webhook that is processed twice, or not processed at all, may result in an agreement being marked complete when it is not, or a process that should block on a signature failing to block.
The right architecture for these integrations is the same as any other webhook receiver, but the operational requirements are stricter. Idempotency is non-negotiable. Dead-letter monitoring needs a response SLA. Reconciliation should run more frequently than daily. And the receiver should maintain its own audit log of every event received, not just the ones that resulted in state changes, so that any dispute can be resolved by inspecting the event history rather than relying on the sender's delivery logs.
Webhooks are a solved problem in the sense that the patterns are well understood and widely documented. They are an unsolved problem in the sense that most integrations in production are missing at least one of those patterns. The checklist above is short. The cost of skipping any item on it is not.
Related reading
Feature flags in production: the lifecycle teams skip
Most teams have a system for adding feature flags. Almost none have a system for retiring them. Here is the full lifecycle: flag types, staleness detection, and the cleanup playbook.
Every Postgres isolation level, and the production bug it's designed to prevent
Most Postgres users never touch isolation levels — until a double-charge or an oversold booking forces the question. What each level allows, and the production bug that follows when you pick the wrong one.
When the model fails: engineering graceful degradation into LLM-powered features
LLM features fail slowly, partially, and semantically — not with clean error codes. Designing for this requires different patterns from the distributed systems toolkit you already know.