Is Redis required for production rate limiting?

No. Redis is common because of its atomic INCR and sorted set operations, but any data store with atomic counter operations works: Postgres advisory locks, DynamoDB conditional writes, or purpose-built stores like Envoy's rate limit service. The algorithm choice is independent of the storage choice.

Should rate limiting happen at the API gateway or in application code?

Both are valid, but for different purposes. Gateway-level rate limiting (Nginx, Kong, Cloudflare) is the right place for DDoS mitigation and coarse-grained per-IP limits. Application-level rate limiting is the right place for per-user, per-tenant, or per-plan limits that require application context the gateway does not have.

How do I test that my rate limiter is actually enforcing the limit?

Write a test that fires requests split across the window boundary: a burst just before the window ends and another just after. If your backend processes more than the stated limit in that period, the algorithm has the boundary bug. For token bucket, test a burst above the bucket capacity immediately followed by normal traffic, and confirm the burst was rejected correctly.

Can I use the same rate limiter for DDoS protection and fair-use limits?

Technically yes, but the algorithms optimised for each are different. DDoS mitigation needs to handle millions of IPs with minimal memory, favouring fixed window or approximate counting at the edge. Fair-use limits care about accuracy and client experience, favouring sliding window counters in application code. Mixing them in one layer usually means doing neither well.

EngineeringMay 11, 20266 min readReviewed May 11, 2026

Rate limiting in production: why the algorithm you chose is probably wrong for your workload

A workload-first framework for picking between fixed window, sliding window, and token bucket

By FlowVerify Editorial Team

The first decision most engineers make about rate limiting is the wrong one. They look at the available rate limiting algorithms (token bucket, sliding window, fixed window), pick the one that looks simplest to implement, and ship it. Three months later, they’re debugging an incident where a badly-behaved client blasted through what should have been a firm limit.

The algorithm wasn’t the first decision that mattered. The first decision is: what failure mode are you actually trying to prevent?

Two failures that look the same

Rate limiting failures come in two varieties, and they require opposite solutions.

The first: you let through more traffic than intended. A client sends 150 requests in the window where 100 were allowed. Your SLO breaks. The database gets hammered. The fix is a tighter algorithm with fewer gaps at the edges.

The second: you block traffic that should have been allowed. A human user submits a form ten times in quick succession, legitimately, and hits a limit designed to stop automated scrapers. Support tickets pile up. The fix is an algorithm that tolerates legitimate bursts from real users.

Most implementations optimise hard for the first failure mode and discover the second in production. Before picking an algorithm, name which failure you’re most exposed to — malicious attacker, misbehaving SDK, or just a colleague with a trigger-happy script.

Fixed window: the boundary you can’t hide from

Fixed window counting divides time into one-minute windows, keeps a counter per client per window, and rejects requests once the counter hits the limit.

The implementation is straightforward. The problem is the boundary.

Suppose the limit is 100 requests per minute. A client sends 100 requests at 12:00:55. The counter for the 12:00 window fills up; the client is blocked. At 12:01:00, a new window opens. The client immediately sends 100 more.

Between 12:00:55 and 12:01:10, that client sent 200 requests in 15 seconds. The stated limit is 100 per minute. The effective burst at the boundary is 2×, and it is not a bug in your implementation; it is inherent to the algorithm.

For cost control over long windows (hourly totals, daily quotas), the boundary problem matters less. A 2× spike at midnight on a daily counter is rarely catastrophic. For limits that exist because a spike of that magnitude would break something downstream, fixed window is the wrong choice.

Sliding window log: exact but memory-hungry

The sliding window log stores the timestamp of every request within the window. To decide whether the current request is allowed, count the number of stored timestamps that fall within the last N seconds.

check.py

# Sliding window log using a Redis sorted set
def check(client_key, limit, window_seconds, now):
    cutoff = now - window_seconds
    pipe = redis.pipeline()
    pipe.zremrangebyscore(client_key, 0, cutoff)  # discard old entries
    pipe.zadd(client_key, {uuid4().hex: now})     # record this request
    pipe.zcard(client_key)                        # count in window
    pipe.expire(client_key, window_seconds + 1)
    _, _, count, _ = pipe.execute()
    return count <= limit

This eliminates the boundary bug. The cost is memory: one sorted set entry per request per client. For a client making 500 requests per minute, that is 500 Redis entries. Across tens of thousands of active clients, the memory footprint becomes difficult to predict under load.

Sliding window log is the right choice when exactness is non-negotiable and the per-client request rate is low: transaction limits on a payment API, for instance, where the count of requests in a period carries regulatory or financial meaning. It is the wrong choice when you have many clients making many requests and cannot bound memory consumption.

Sliding window counter: the approximation worth making

The sliding window counter keeps only two values per client: the count in the previous complete window and the count in the current partial window. The effective rate at any point is estimated as a weighted average of both:

text

effective_count = (prev_window_count × overlap_fraction) + current_window_count

overlap_fraction = 1 − (seconds_into_current_window / window_size)

Example: 30% into a 60s window  →  overlap_fraction = 0.70

The approximation introduces a small error — at most a few percent of the limit for realistic traffic distributions, and usually much less. In exchange, the memory cost is constant: two integers per client regardless of request rate. Stripe’s API rate limits use this approach. Kong and Envoy default to it.

This is the algorithm to reach for first in most API rate limiting scenarios. It eliminates the boundary bug from fixed windows, keeps memory overhead bounded, and the approximation is small enough to be irrelevant unless you are enforcing limits where an off-by-one carries real consequences — regulatory compliance caps, hard financial limits, or adversarial environments where the error margin itself can be exploited.

Token bucket: designed for machine clients

The token bucket maintains a bucket per client that refills at a constant rate (10 tokens per second) up to a maximum capacity. Each request consumes one token. If the bucket has tokens, the request goes through. If not, it is rejected or queued depending on configuration.

This makes token bucket a natural fit for SDK-to-API communication. A machine client can accumulate credit during idle periods and spend it in a burst: exactly the pattern in batch jobs, webhook consumers, and automated workflows. The burst is intentional; token bucket handles it gracefully.

The problem surfaces with human-facing endpoints. A user submits a form ten times in quick succession, more quickly than usual but each action legitimate. Token bucket may block them once the burst capacity is exhausted, even though nothing abusive happened. The user sees an error where they expected success.

The distinction is structural. Token bucket controls the rate of steady-state consumption and tolerates short bursts above that rate. Sliding window controls the count within a time period. Human-facing endpoints usually need the second property. Token bucket is right when burst tolerance is a product feature, not the default for every endpoint that a human might trigger.

What rate limit responses should tell clients

A rate limiter that rejects requests silently is only half-built. When a client hits a limit, it needs three things: confirmation that it hit a rate limit and not a server error, a signal for when it can retry, and visibility into the headroom remaining before the next rejection.

HTTP 429 (Too Many Requests) is the correct status code. HTTP 503 is sometimes used but implies a server-side failure, which corrupts retry logic in client SDKs that treat 503 as a transient backend problem.

The standard response headers, now well-adopted across major public APIs:

response-headers

RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1715425200
Retry-After: 47

RateLimit-Reset is a Unix timestamp of when the current window resets. Retry-After is the number of seconds the client should wait. Both are worth including; not all clients parse both. More importantly: send these headers on every response, not just 429s. A client that can see it has 3 remaining requests will pace itself. A client that only sees a limit hit after it has already violated the limit will retry blindly and amplify the load.

Distributed rate limiting: where you’re counting matters

Single-server rate limiting is a solved problem. The interesting failure modes appear when traffic is spread across multiple application servers.

Centralised counting with Redis INCR is the most common approach: every server increments the same counter. It is accurate (the count reflects the true total across all servers), but Redis latency is now in the request path, and a Redis outage disables rate limiting entirely. For most applications this is acceptable, and a Redis Sentinel or Cluster setup addresses the availability concern.

Local counters with periodic sync remove Redis from the critical path. Each server tracks its own count and syncs to a shared store every few seconds. The tradeoff: during the sync gap, each server operates on stale totals. With five servers and a 2-second sync interval, a client can in principle hit 5× the intended limit: 100 requests reaching each server before any of them have communicated. Whether that behaviour is acceptable depends entirely on the cost of a violation.

Sticky routing assigns each client to a fixed server, so local counters stay accurate. It works until a server fails and the client’s count history disappears, accidentally resetting the limit. It also reduces load balancing flexibility in ways that compound over time.

The practical question in the distributed case is: how bad is a 1.5× or 2× violation for a few seconds? For DDoS mitigation, that headroom is unacceptable. For fair-use limits on a B2B API, it is usually fine and the latency saving from local counters is worth it. Match the consistency model to the actual cost of a violation, not to an abstract preference for exactness.

Matching the rate limiting algorithm to your failure mode

The choice follows from the failure mode you are preventing, not from which algorithm appears most often in system design diagrams.

Situation	Algorithm	Tradeoff you’re accepting
Human-facing endpoint, count-per-period limit	Sliding window counter	Small approximation error; two integers per client
SDK or batch client, burst tolerance is intentional	Token bucket	Poor fit if humans use the same endpoint
Low-traffic API, exact counts required	Sliding window log	Memory scales with requests per client
Cost control over long windows (hours, days)	Fixed window	2× burst allowed at the window boundary
Distributed, accuracy is critical	Centralised Redis counter	Redis latency in request path; single point of failure
Distributed, latency matters more than precision	Local counters with sync	Bursts above the limit possible during sync gap

Rate limiting algorithm selection

Most production failures in rate limiting come not from implementation bugs but from picking an algorithm whose properties do not match the traffic shape. The algorithm you deploy at launch tends to be harder to change than expected — clients learn to work around rate limit behaviour, and configuration calcifies around their workarounds. Pick for your workload, not your deadline.

Frequently asked questions

Every Postgres isolation level, and the specific bug it lets through

Three isolation levels, three distinct failure modes. Most Postgres deployments run at Read Committed without knowing it. Here is what each level permits and what upgrading actually costs.

May 15, 2026Read full article →

EngineeringMay 11, 20266 min readReviewed May 11, 2026

Rate limiting in production: why the algorithm you chose is probably wrong for your workload

A workload-first framework for picking between fixed window, sliding window, and token bucket

By FlowVerify Editorial Team

The algorithm wasn’t the first decision that mattered. The first decision is: what failure mode are you actually trying to prevent?

Two failures that look the same

Rate limiting failures come in two varieties, and they require opposite solutions.

Fixed window: the boundary you can’t hide from

Fixed window counting divides time into one-minute windows, keeps a counter per client per window, and rejects requests once the counter hits the limit.

The implementation is straightforward. The problem is the boundary.

Sliding window log: exact but memory-hungry

check.py

# Sliding window log using a Redis sorted set
def check(client_key, limit, window_seconds, now):
    cutoff = now - window_seconds
    pipe = redis.pipeline()
    pipe.zremrangebyscore(client_key, 0, cutoff)  # discard old entries
    pipe.zadd(client_key, {uuid4().hex: now})     # record this request
    pipe.zcard(client_key)                        # count in window
    pipe.expire(client_key, window_seconds + 1)
    _, _, count, _ = pipe.execute()
    return count <= limit

Sliding window counter: the approximation worth making

text

effective_count = (prev_window_count × overlap_fraction) + current_window_count

overlap_fraction = 1 − (seconds_into_current_window / window_size)

Example: 30% into a 60s window  →  overlap_fraction = 0.70

Token bucket: designed for machine clients

What rate limit responses should tell clients

The standard response headers, now well-adopted across major public APIs:

response-headers

RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1715425200
Retry-After: 47

Distributed rate limiting: where you’re counting matters

Single-server rate limiting is a solved problem. The interesting failure modes appear when traffic is spread across multiple application servers.

Matching the rate limiting algorithm to your failure mode

The choice follows from the failure mode you are preventing, not from which algorithm appears most often in system design diagrams.

Situation	Algorithm	Tradeoff you’re accepting
Human-facing endpoint, count-per-period limit	Sliding window counter	Small approximation error; two integers per client
SDK or batch client, burst tolerance is intentional	Token bucket	Poor fit if humans use the same endpoint
Low-traffic API, exact counts required	Sliding window log	Memory scales with requests per client
Cost control over long windows (hours, days)	Fixed window	2× burst allowed at the window boundary
Distributed, accuracy is critical	Centralised Redis counter	Redis latency in request path; single point of failure
Distributed, latency matters more than precision	Local counters with sync	Bursts above the limit possible during sync gap

Rate limiting algorithm selection

Rate limiting in production: why the algorithm you chose is probably wrong for your workload

Two failures that look the same

Fixed window: the boundary you can’t hide from

Sliding window log: exact but memory-hungry

Sliding window counter: the approximation worth making

Token bucket: designed for machine clients

What rate limit responses should tell clients

Distributed rate limiting: where you’re counting matters

Matching the rate limiting algorithm to your failure mode

Frequently asked questions

Related reading

Every Postgres isolation level, and the specific bug it lets through

Redis, Valkey, or Dragonfly in 2026: how to actually decide

Idempotency keys: the layer you're protecting isn't the one that bites you

Stay ahead on eSignatures, compliance, and document workflows

Every Postgres isolation level, and the specific bug it lets through

Rate limiting in production: why the algorithm you chose is probably wrong for your workload

Two failures that look the same

Fixed window: the boundary you can’t hide from

Sliding window log: exact but memory-hungry

Sliding window counter: the approximation worth making

Token bucket: designed for machine clients

What rate limit responses should tell clients

Distributed rate limiting: where you’re counting matters

Matching the rate limiting algorithm to your failure mode

Frequently asked questions

Related reading

Every Postgres isolation level, and the specific bug it lets through

Redis, Valkey, or Dragonfly in 2026: how to actually decide

Idempotency keys: the layer you're protecting isn't the one that bites you

Stay ahead on eSignatures, compliance, and document workflows

Every Postgres isolation level, and the specific bug it lets through