Rate limiting in production: why the algorithm you chose is probably wrong for your workload
A workload-first framework for picking between fixed window, sliding window, and token bucket
The first decision most engineers make about rate limiting is the wrong one. They look at the available rate limiting algorithms (token bucket, sliding window, fixed window), pick the one that looks simplest to implement, and ship it. Three months later, they’re debugging an incident where a badly-behaved client blasted through what should have been a firm limit.
The algorithm wasn’t the first decision that mattered. The first decision is: what failure mode are you actually trying to prevent?
Two failures that look the same
Rate limiting failures come in two varieties, and they require opposite solutions.
The first: you let through more traffic than intended. A client sends 150 requests in the window where 100 were allowed. Your SLO breaks. The database gets hammered. The fix is a tighter algorithm with fewer gaps at the edges.
The second: you block traffic that should have been allowed. A human user submits a form ten times in quick succession, legitimately, and hits a limit designed to stop automated scrapers. Support tickets pile up. The fix is an algorithm that tolerates legitimate bursts from real users.
Most implementations optimise hard for the first failure mode and discover the second in production. Before picking an algorithm, name which failure you’re most exposed to — malicious attacker, misbehaving SDK, or just a colleague with a trigger-happy script.
Fixed window: the boundary you can’t hide from
Fixed window counting divides time into one-minute windows, keeps a counter per client per window, and rejects requests once the counter hits the limit.
The implementation is straightforward. The problem is the boundary.
Suppose the limit is 100 requests per minute. A client sends 100 requests at 12:00:55. The counter for the 12:00 window fills up; the client is blocked. At 12:01:00, a new window opens. The client immediately sends 100 more.
Between 12:00:55 and 12:01:10, that client sent 200 requests in 15 seconds. The stated limit is 100 per minute. The effective burst at the boundary is 2×, and it is not a bug in your implementation; it is inherent to the algorithm.
For cost control over long windows (hourly totals, daily quotas), the boundary problem matters less. A 2× spike at midnight on a daily counter is rarely catastrophic. For limits that exist because a spike of that magnitude would break something downstream, fixed window is the wrong choice.
Sliding window log: exact but memory-hungry
The sliding window log stores the timestamp of every request within the window. To decide whether the current request is allowed, count the number of stored timestamps that fall within the last N seconds.
# Sliding window log using a Redis sorted set
def check(client_key, limit, window_seconds, now):
cutoff = now - window_seconds
pipe = redis.pipeline()
pipe.zremrangebyscore(client_key, 0, cutoff) # discard old entries
pipe.zadd(client_key, {uuid4().hex: now}) # record this request
pipe.zcard(client_key) # count in window
pipe.expire(client_key, window_seconds + 1)
_, _, count, _ = pipe.execute()
return count <= limitThis eliminates the boundary bug. The cost is memory: one sorted set entry per request per client. For a client making 500 requests per minute, that is 500 Redis entries. Across tens of thousands of active clients, the memory footprint becomes difficult to predict under load.
Sliding window log is the right choice when exactness is non-negotiable and the per-client request rate is low: transaction limits on a payment API, for instance, where the count of requests in a period carries regulatory or financial meaning. It is the wrong choice when you have many clients making many requests and cannot bound memory consumption.
Sliding window counter: the approximation worth making
The sliding window counter keeps only two values per client: the count in the previous complete window and the count in the current partial window. The effective rate at any point is estimated as a weighted average of both:
effective_count = (prev_window_count × overlap_fraction) + current_window_count
overlap_fraction = 1 − (seconds_into_current_window / window_size)
Example: 30% into a 60s window → overlap_fraction = 0.70The approximation introduces a small error — at most a few percent of the limit for realistic traffic distributions, and usually much less. In exchange, the memory cost is constant: two integers per client regardless of request rate. Stripe’s API rate limits use this approach. Kong and Envoy default to it.
This is the algorithm to reach for first in most API rate limiting scenarios. It eliminates the boundary bug from fixed windows, keeps memory overhead bounded, and the approximation is small enough to be irrelevant unless you are enforcing limits where an off-by-one carries real consequences — regulatory compliance caps, hard financial limits, or adversarial environments where the error margin itself can be exploited.
Token bucket: designed for machine clients
The token bucket maintains a bucket per client that refills at a constant rate (10 tokens per second) up to a maximum capacity. Each request consumes one token. If the bucket has tokens, the request goes through. If not, it is rejected or queued depending on configuration.
This makes token bucket a natural fit for SDK-to-API communication. A machine client can accumulate credit during idle periods and spend it in a burst: exactly the pattern in batch jobs, webhook consumers, and automated workflows. The burst is intentional; token bucket handles it gracefully.
The problem surfaces with human-facing endpoints. A user submits a form ten times in quick succession, more quickly than usual but each action legitimate. Token bucket may block them once the burst capacity is exhausted, even though nothing abusive happened. The user sees an error where they expected success.
The distinction is structural. Token bucket controls the rate of steady-state consumption and tolerates short bursts above that rate. Sliding window controls the count within a time period. Human-facing endpoints usually need the second property. Token bucket is right when burst tolerance is a product feature, not the default for every endpoint that a human might trigger.
What rate limit responses should tell clients
A rate limiter that rejects requests silently is only half-built. When a client hits a limit, it needs three things: confirmation that it hit a rate limit and not a server error, a signal for when it can retry, and visibility into the headroom remaining before the next rejection.
HTTP 429 (Too Many Requests) is the correct status code. HTTP 503 is sometimes used but implies a server-side failure, which corrupts retry logic in client SDKs that treat 503 as a transient backend problem.
The standard response headers, now well-adopted across major public APIs:
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1715425200
Retry-After: 47RateLimit-Reset is a Unix timestamp of when the current window resets. Retry-After is the number of seconds the client should wait. Both are worth including; not all clients parse both. More importantly: send these headers on every response, not just 429s. A client that can see it has 3 remaining requests will pace itself. A client that only sees a limit hit after it has already violated the limit will retry blindly and amplify the load.
Distributed rate limiting: where you’re counting matters
Single-server rate limiting is a solved problem. The interesting failure modes appear when traffic is spread across multiple application servers.
Centralised counting with Redis INCR is the most common approach: every server increments the same counter. It is accurate (the count reflects the true total across all servers), but Redis latency is now in the request path, and a Redis outage disables rate limiting entirely. For most applications this is acceptable, and a Redis Sentinel or Cluster setup addresses the availability concern.
Local counters with periodic sync remove Redis from the critical path. Each server tracks its own count and syncs to a shared store every few seconds. The tradeoff: during the sync gap, each server operates on stale totals. With five servers and a 2-second sync interval, a client can in principle hit 5× the intended limit: 100 requests reaching each server before any of them have communicated. Whether that behaviour is acceptable depends entirely on the cost of a violation.
Sticky routing assigns each client to a fixed server, so local counters stay accurate. It works until a server fails and the client’s count history disappears, accidentally resetting the limit. It also reduces load balancing flexibility in ways that compound over time.
The practical question in the distributed case is: how bad is a 1.5× or 2× violation for a few seconds? For DDoS mitigation, that headroom is unacceptable. For fair-use limits on a B2B API, it is usually fine and the latency saving from local counters is worth it. Match the consistency model to the actual cost of a violation, not to an abstract preference for exactness.
Matching the rate limiting algorithm to your failure mode
The choice follows from the failure mode you are preventing, not from which algorithm appears most often in system design diagrams.
| Situation | Algorithm | Tradeoff you’re accepting |
|---|---|---|
| Human-facing endpoint, count-per-period limit | Sliding window counter | Small approximation error; two integers per client |
| SDK or batch client, burst tolerance is intentional | Token bucket | Poor fit if humans use the same endpoint |
| Low-traffic API, exact counts required | Sliding window log | Memory scales with requests per client |
| Cost control over long windows (hours, days) | Fixed window | 2× burst allowed at the window boundary |
| Distributed, accuracy is critical | Centralised Redis counter | Redis latency in request path; single point of failure |
| Distributed, latency matters more than precision | Local counters with sync | Bursts above the limit possible during sync gap |
Most production failures in rate limiting come not from implementation bugs but from picking an algorithm whose properties do not match the traffic shape. The algorithm you deploy at launch tends to be harder to change than expected — clients learn to work around rate limit behaviour, and configuration calcifies around their workarounds. Pick for your workload, not your deadline.
Frequently asked questions
Related reading
Every Postgres isolation level, and the specific bug it lets through
Three isolation levels, three distinct failure modes. Most Postgres deployments run at Read Committed without knowing it. Here is what each level permits and what upgrading actually costs.
Redis, Valkey, or Dragonfly in 2026: how to actually decide
The Redis licence change created a three-way choice. Most teams are making it based on benchmarks. The real decision factors are licencing risk, ecosystem backing, and cloud-provider alignment.
Idempotency keys: the layer you're protecting isn't the one that bites you
An Idempotency-Key header handles one of five layers where duplicates cause harm. Database writes, queue consumers, external API calls, and saga compensation each have failure modes the HTTP key doesn't cover.