Rate limiting in production: the four algorithms and their failure modes
Fixed windows, token buckets, leaky buckets, and sliding windows each break differently. Here is what each one actually protects against.
Rate limiting has four distinct algorithms, and most services ship with just one of them (the token bucket) without knowing what the other three do differently. This isn't always wrong. Token buckets handle the common traffic shape well. But each of the four rate-limiting algorithms has a specific failure mode that shows up under burst traffic, and choosing one without understanding the others means the failure, when it arrives, is a surprise.
This article covers all four: fixed window, sliding window log, token bucket, and leaky bucket. For each: what it protects against, what it misses, and the traffic pattern that exposes it. There is also a fifth option, the sliding window counter, that most engineering resources skip over despite being the practical default in most production rate-limiting stacks. It earns a section too.
Fixed window: predictable, and predictably exploitable at clock boundaries
Fixed window is the first implementation most engineers reach for, and for good reason: it fits in a handful of lines, requires a single Redis key per client, and is easy to reason about.
import time
import redis
r = redis.Redis()
def is_allowed(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
now = int(time.time())
window_key = f"{client_id}:{now // window_seconds}"
count = r.incr(window_key)
if count == 1:
r.expire(window_key, window_seconds * 2)
return count <= limitThe boundary problem is easy to miss in testing because tests rarely straddle clock-aligned window edges. The scenario: your limit is 100 requests per minute, windows aligned to the wall clock. A client sends 100 requests at 12:00:59. One second passes. The window resets. The same client sends 100 more at 12:01:01. Two hundred requests in two seconds: twice the stated limit, and none of the checks fired.
Fixed window earns its place for daily or hourly quotas where a user exhausting their allocation at 11:59pm and starting fresh at midnight is an acceptable edge case, not a problem worth solving. When precision doesn't matter and simplicity does, it is hard to beat.
Sliding window log: accurate, and memory-expensive at scale
The sliding window log fixes the boundary problem by abandoning aligned windows entirely. It tracks the exact timestamp of every admitted request. When a new request arrives: discard all entries older than the window period, count what remains, admit if below limit, and add the current timestamp. The result is provably correct: at most N requests in any rolling window of the configured length, not just in aligned clock slices.
The cost is memory proportional to the request rate. A client admitted at 1,000 requests per minute keeps 1,000 timestamps in a Redis sorted set. Across thousands of concurrent clients at meaningful rates, this becomes a real memory budget, not a theoretical concern. At a single-service or low-volume scale it is invisible. At a high-volume public API with a wide rate limit window, it adds up.
Sliding window log earns its place in front of operations where boundary doubling has real consequences: billing API calls where 2× throughput translates directly to 2× cost to the caller, sensitive data exports, third-party API calls enforced strictly on the vendor's side. When accuracy matters more than memory, this is the algorithm to reach for.
Token bucket: the flexible default that mishandles burst concurrency
Token buckets are the correct general-purpose choice for most APIs because they model the most common legitimate traffic pattern well. A CI/CD pipeline accumulates tokens overnight and spends them in a burst when engineers push code in the morning. A user clicking through a multi-step wizard generates a short cluster of API calls. Token buckets pass these through without false rejections, which is why they are the framework default in most rate-limiting libraries.
The model: a bucket holds up to a configured capacity of tokens, refilling at a fixed rate. Each request consumes one token. An idle client accumulates tokens up to the capacity ceiling. A full bucket means a burst up to that ceiling is permitted in essentially zero time.
The failure mode is the burst itself. Say the limit is 100 requests per minute, expressed as a token bucket with capacity 100 and a refill rate of 1.67 tokens per second. A client idle for a minute has a full bucket. It sends 100 requests in 200 milliseconds. All 100 pass. Each check found tokens available. But a downstream database or connection pool absorbing 100 requests in 200ms may behave very differently from the same 100 requests spread across 60 seconds, even though the rate check considered both 'within limits'.
Token buckets model rate as requests-per-time-period. Downstream systems often care about arrival concurrency, not rolling average rate. The calibration decision that gets skipped most often: bucket capacity should reflect what the downstream can absorb in a burst, not just the average request rate. Measure p95 burst size over representative windows before setting capacity, rather than picking a round number.
Leaky bucket: a traffic shaper wearing a rate limiter's label
The leaky bucket is not quite a rate limiter. Requests enter a queue; they drain from the queue at a fixed rate regardless of when they arrived. If the queue is full, the incoming request is dropped.
The effect is a constant output rate. Whether traffic arrives at 10 requests per second or 200 simultaneously, the downstream sees a steady stream at the configured drain rate. This is useful when the downstream's problem is burst arrival rather than average rate: connection pools that saturate under concurrency spikes, webhooks fanning out to hundreds of receivers, third-party API calls with strict per-second quotas enforced without mercy on the vendor's side.
The cost is latency. A request arriving in a burst is queued behind requests that arrived earlier. In an interactive API where a user is waiting, queue time shows up as perceptible delay, often 400–1,000ms under moderate burst conditions. Leaky bucket belongs in background processing pipelines where latency is invisible to the caller. It does not belong in front of request-response APIs where the user is waiting.
Sliding window counter: the practical approximation most references skip
There is a fifth approach less often listed alongside the classic four: the sliding window counter, which approximates the accuracy of the sliding window log at O(1) memory. Most engineering blog posts on rate limiting mention the four canonical algorithms and stop there. The sliding window counter is worth knowing about because it is what many production systems actually use once the fixed window's boundary problem becomes unacceptable and the sliding window log's memory footprint becomes too large.
The approximation: weight the previous window's count by the fraction of the current window not yet elapsed.
import time
import redis
r = redis.Redis()
def is_allowed_sliding(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
now = time.time()
current_window = int(now // window_seconds)
elapsed_fraction = (now % window_seconds) / window_seconds
prev_key = f"{client_id}:{current_window - 1}"
curr_key = f"{client_id}:{current_window}"
prev_count = int(r.get(prev_key) or 0)
curr_count = int(r.get(curr_key) or 0)
# Weight: as the current window elapses, the previous window counts less
weighted = curr_count + prev_count * (1 - elapsed_fraction)
if weighted >= limit:
return False
r.incr(curr_key)
r.expire(curr_key, window_seconds * 2)
return TrueAs the current window elapses, the previous window's count is weighted down proportionally. If the window is 40% elapsed, you count 40% of this window's requests plus 60% of the previous window's. The maximum boundary error is bounded; it cannot double the way a fixed window can, but it is not zero. Cloudflare used a variant of this approach for Workers rate limiting and described the trade-off in their engineering documentation. For most service-level limits, where the goal is catching broken clients rather than enforcing microsecond-precise quotas, the approximation error is irrelevant noise.
Choosing the right rate-limiting algorithm for your traffic shape
The right algorithm depends on what you are protecting against and which failure mode you can tolerate, not on what the framework defaults to.
| Algorithm | Boundary accuracy | Burst behaviour | Memory | Avoid when |
|---|---|---|---|---|
| Fixed window | Poor: up to 2N at boundary | Passes up to 2N at boundary | O(1) | Downstream cannot absorb 2× burst |
| Sliding window log | Exact | Rejects over-limit accurately | O(requests in window) | Scale means many clients at high rate |
| Token bucket | Per-request, not per-period | Allows burst up to capacity | O(1) | Downstream is sensitive to arrival concurrency |
| Leaky bucket | N/A: shapes only | Smooths to fixed drain rate | O(queue size) | Interactive APIs where latency matters |
| Sliding window counter | Good: bounded error | Weighted across two windows | O(1) | Sub-second precision is required |
One practical note before reaching for any of these: most rate-limiting bugs in production are calibration bugs, not algorithm bugs. The threshold was set from intuition rather than measured traffic distributions, or the burst capacity does not reflect what the downstream can actually absorb. Measure p95 burst size over representative windows before setting limits. Measure the downstream's failure threshold under concurrent load before calibrating service-level capacity. The algorithm gets you most of the way there. Measurement closes the remaining gap.
The compound pattern most production services end up using
Most services that take rate limiting seriously end up with two layers rather than one, because service-level and client-level limits are protecting against different things.
The first layer operates at the service or endpoint level, using a token bucket or sliding window counter with a high capacity threshold. The trigger is extreme behaviour: 1,000 requests in 5 seconds from a single caller, not 20 in a minute. This catches broken scripts and runaway automation without touching normal traffic. Keep the capacity generous enough that a legitimately bursty but bounded client never trips it.
The second layer operates at the authenticated caller level, using a sliding window counter calibrated to normal usage patterns. For an interactive product this might be 100–200 requests per minute. For a developer-tier API it might be 1,000. The approximation error in the sliding window counter is irrelevant at this resolution; client limits are enforced over a long enough horizon that the error averages out.
A third layer appears when there is a burst-sensitive downstream: a leaky bucket in the outbound client wrapper for a third-party API, or in front of a connection-limited database that saturates under concurrency spikes. This layer lives close to the downstream — in the service that calls the downstream — not in the API gateway.
Building all three amounts to three Redis keys per request path instead of one, with different TTLs and different thresholds. The configuration is the work, not the code. Set thresholds from measured traffic distributions: what does p95 burst size look like over a 10-second window across real users? What is the downstream's measured breaking point under concurrent load? Conservative limits that regularly trip on legitimate traffic cause more operational pain than permissive ones that occasionally pass a burst through.
The algorithm matters less than knowing its failure mode. A token bucket with a burst capacity calibrated to what the downstream can actually absorb outperforms a theoretically precise sliding window log with a threshold set from a guess. Know what each algorithm optimises for, know what it costs when that optimisation doesn't fit the traffic you're actually seeing, and set thresholds from data.
Frequently asked questions
Related reading
DuckDB in your SaaS stack: what it replaces, what it doesn't, and the multi-tenant pattern that holds up
DuckDB can replace a cloud data warehouse for per-customer SaaS analytics below 100 GB per customer. The multi-tenant part requires one specific pattern, and it is not the one most tutorials show.
Distributed service timeouts: the three production failure modes your 30-second default doesn't prevent
Most service timeouts are arbitrary numbers set once and never revisited. Here are the three failure modes that escape a misconfigured timeout — and how to calculate values that account for them.
Zero-downtime Postgres schema migrations: what every DDL operation does under the hood
Most Postgres schema changes that cause outages aren't dangerous by nature — they're dangerous because the table is large. Here's what lock each DDL operation takes and the exact patterns to make them safe.