Which rate-limiting algorithm does Redis implement natively?

Redis does not implement rate limiting natively. It provides the primitives — INCR, EXPIRE, sorted sets, Lua scripts — that all four algorithms can use. Most Redis-based implementations reach for fixed window (INCR + EXPIRE) or token bucket because they require the fewest round trips. The sliding window counter can be implemented with two INCR keys and a read, which makes it nearly as efficient at O(1) memory.

Can I layer multiple rate-limiting algorithms on the same service?

Yes, and most production services that take rate limiting seriously do. Service-level and client-level limits often use different algorithms because the traffic shapes they protect against differ. A token bucket at the service level catches runaway burst from a single bad client. A sliding window counter at the client level enforces per-caller quotas over a longer horizon. A leaky bucket in front of a downstream API smooths egress regardless of how requests arrive upstream.

What is the most common production mistake with rate limiting?

Setting thresholds from intuition rather than measured traffic distributions. An overly conservative limit that regularly trips on legitimate traffic causes more operational pain than a permissive one that occasionally lets a burst through. Measure p95 burst size over representative windows before setting capacity limits, and measure the downstream's failure threshold under concurrent load before calibrating service-level limits.

Does the algorithm choice matter when using an API gateway?

It matters to know which algorithm the gateway uses, even if you cannot change it. Most gateways expose token bucket or fixed window as the configurable option. Knowing which one — and its failure mode — determines whether you need an additional layer in front of burst-sensitive downstream services, and whether the burst capacity is calibrated to the downstream's actual tolerance.

EngineeringMay 31, 20267 min readReviewed May 31, 2026

Rate limiting in production: the four algorithms and their failure modes

Fixed windows, token buckets, leaky buckets, and sliding windows each break differently. Here is what each one actually protects against.

By FlowVerify Editorial Team

Rate limiting has four distinct algorithms, and most services ship with just one of them (the token bucket) without knowing what the other three do differently. This isn't always wrong. Token buckets handle the common traffic shape well. But each of the four rate-limiting algorithms has a specific failure mode that shows up under burst traffic, and choosing one without understanding the others means the failure, when it arrives, is a surprise.

This article covers all four: fixed window, sliding window log, token bucket, and leaky bucket. For each: what it protects against, what it misses, and the traffic pattern that exposes it. There is also a fifth option, the sliding window counter, that most engineering resources skip over despite being the practical default in most production rate-limiting stacks. It earns a section too.

Fixed window: predictable, and predictably exploitable at clock boundaries

Fixed window is the first implementation most engineers reach for, and for good reason: it fits in a handful of lines, requires a single Redis key per client, and is easy to reason about.

fixed_window.py

import time
import redis

r = redis.Redis()

def is_allowed(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    now = int(time.time())
    window_key = f"{client_id}:{now // window_seconds}"
    count = r.incr(window_key)
    if count == 1:
        r.expire(window_key, window_seconds * 2)
    return count <= limit

The boundary problem is easy to miss in testing because tests rarely straddle clock-aligned window edges. The scenario: your limit is 100 requests per minute, windows aligned to the wall clock. A client sends 100 requests at 12:00:59. One second passes. The window resets. The same client sends 100 more at 12:01:01. Two hundred requests in two seconds: twice the stated limit, and none of the checks fired.

Fixed window earns its place for daily or hourly quotas where a user exhausting their allocation at 11:59pm and starting fresh at midnight is an acceptable edge case, not a problem worth solving. When precision doesn't matter and simplicity does, it is hard to beat.

Sliding window log: accurate, and memory-expensive at scale

The sliding window log fixes the boundary problem by abandoning aligned windows entirely. It tracks the exact timestamp of every admitted request. When a new request arrives: discard all entries older than the window period, count what remains, admit if below limit, and add the current timestamp. The result is provably correct: at most N requests in any rolling window of the configured length, not just in aligned clock slices.

The cost is memory proportional to the request rate. A client admitted at 1,000 requests per minute keeps 1,000 timestamps in a Redis sorted set. Across thousands of concurrent clients at meaningful rates, this becomes a real memory budget, not a theoretical concern. At a single-service or low-volume scale it is invisible. At a high-volume public API with a wide rate limit window, it adds up.

Sliding window log earns its place in front of operations where boundary doubling has real consequences: billing API calls where 2× throughput translates directly to 2× cost to the caller, sensitive data exports, third-party API calls enforced strictly on the vendor's side. When accuracy matters more than memory, this is the algorithm to reach for.

Token bucket: the flexible default that mishandles burst concurrency

Token buckets are the correct general-purpose choice for most APIs because they model the most common legitimate traffic pattern well. A CI/CD pipeline accumulates tokens overnight and spends them in a burst when engineers push code in the morning. A user clicking through a multi-step wizard generates a short cluster of API calls. Token buckets pass these through without false rejections, which is why they are the framework default in most rate-limiting libraries.

The model: a bucket holds up to a configured capacity of tokens, refilling at a fixed rate. Each request consumes one token. An idle client accumulates tokens up to the capacity ceiling. A full bucket means a burst up to that ceiling is permitted in essentially zero time.

The failure mode is the burst itself. Say the limit is 100 requests per minute, expressed as a token bucket with capacity 100 and a refill rate of 1.67 tokens per second. A client idle for a minute has a full bucket. It sends 100 requests in 200 milliseconds. All 100 pass. Each check found tokens available. But a downstream database or connection pool absorbing 100 requests in 200ms may behave very differently from the same 100 requests spread across 60 seconds, even though the rate check considered both 'within limits'.

Token buckets model rate as requests-per-time-period. Downstream systems often care about arrival concurrency, not rolling average rate. The calibration decision that gets skipped most often: bucket capacity should reflect what the downstream can absorb in a burst, not just the average request rate. Measure p95 burst size over representative windows before setting capacity, rather than picking a round number.

Leaky bucket: a traffic shaper wearing a rate limiter's label

The leaky bucket is not quite a rate limiter. Requests enter a queue; they drain from the queue at a fixed rate regardless of when they arrived. If the queue is full, the incoming request is dropped.

The effect is a constant output rate. Whether traffic arrives at 10 requests per second or 200 simultaneously, the downstream sees a steady stream at the configured drain rate. This is useful when the downstream's problem is burst arrival rather than average rate: connection pools that saturate under concurrency spikes, webhooks fanning out to hundreds of receivers, third-party API calls with strict per-second quotas enforced without mercy on the vendor's side.

The cost is latency. A request arriving in a burst is queued behind requests that arrived earlier. In an interactive API where a user is waiting, queue time shows up as perceptible delay, often 400–1,000ms under moderate burst conditions. Leaky bucket belongs in background processing pipelines where latency is invisible to the caller. It does not belong in front of request-response APIs where the user is waiting.

Sliding window counter: the practical approximation most references skip

There is a fifth approach less often listed alongside the classic four: the sliding window counter, which approximates the accuracy of the sliding window log at O(1) memory. Most engineering blog posts on rate limiting mention the four canonical algorithms and stop there. The sliding window counter is worth knowing about because it is what many production systems actually use once the fixed window's boundary problem becomes unacceptable and the sliding window log's memory footprint becomes too large.

The approximation: weight the previous window's count by the fraction of the current window not yet elapsed.

sliding_window_counter.py

import time
import redis

r = redis.Redis()

def is_allowed_sliding(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    elapsed_fraction = (now % window_seconds) / window_seconds

    prev_key = f"{client_id}:{current_window - 1}"
    curr_key = f"{client_id}:{current_window}"

    prev_count = int(r.get(prev_key) or 0)
    curr_count = int(r.get(curr_key) or 0)

    # Weight: as the current window elapses, the previous window counts less
    weighted = curr_count + prev_count * (1 - elapsed_fraction)
    if weighted >= limit:
        return False

    r.incr(curr_key)
    r.expire(curr_key, window_seconds * 2)
    return True

As the current window elapses, the previous window's count is weighted down proportionally. If the window is 40% elapsed, you count 40% of this window's requests plus 60% of the previous window's. The maximum boundary error is bounded; it cannot double the way a fixed window can, but it is not zero. Cloudflare used a variant of this approach for Workers rate limiting and described the trade-off in their engineering documentation. For most service-level limits, where the goal is catching broken clients rather than enforcing microsecond-precise quotas, the approximation error is irrelevant noise.

Choosing the right rate-limiting algorithm for your traffic shape

The right algorithm depends on what you are protecting against and which failure mode you can tolerate, not on what the framework defaults to.

Algorithm	Boundary accuracy	Burst behaviour	Memory	Avoid when
Fixed window	Poor: up to 2N at boundary	Passes up to 2N at boundary	O(1)	Downstream cannot absorb 2× burst
Sliding window log	Exact	Rejects over-limit accurately	O(requests in window)	Scale means many clients at high rate
Token bucket	Per-request, not per-period	Allows burst up to capacity	O(1)	Downstream is sensitive to arrival concurrency
Leaky bucket	N/A: shapes only	Smooths to fixed drain rate	O(queue size)	Interactive APIs where latency matters
Sliding window counter	Good: bounded error	Weighted across two windows	O(1)	Sub-second precision is required

Rate-limiting algorithms: what each one optimises for

One practical note before reaching for any of these: most rate-limiting bugs in production are calibration bugs, not algorithm bugs. The threshold was set from intuition rather than measured traffic distributions, or the burst capacity does not reflect what the downstream can actually absorb. Measure p95 burst size over representative windows before setting limits. Measure the downstream's failure threshold under concurrent load before calibrating service-level capacity. The algorithm gets you most of the way there. Measurement closes the remaining gap.

The compound pattern most production services end up using

Most services that take rate limiting seriously end up with two layers rather than one, because service-level and client-level limits are protecting against different things.

The first layer operates at the service or endpoint level, using a token bucket or sliding window counter with a high capacity threshold. The trigger is extreme behaviour: 1,000 requests in 5 seconds from a single caller, not 20 in a minute. This catches broken scripts and runaway automation without touching normal traffic. Keep the capacity generous enough that a legitimately bursty but bounded client never trips it.

The second layer operates at the authenticated caller level, using a sliding window counter calibrated to normal usage patterns. For an interactive product this might be 100–200 requests per minute. For a developer-tier API it might be 1,000. The approximation error in the sliding window counter is irrelevant at this resolution; client limits are enforced over a long enough horizon that the error averages out.

A third layer appears when there is a burst-sensitive downstream: a leaky bucket in the outbound client wrapper for a third-party API, or in front of a connection-limited database that saturates under concurrency spikes. This layer lives close to the downstream — in the service that calls the downstream — not in the API gateway.

Building all three amounts to three Redis keys per request path instead of one, with different TTLs and different thresholds. The configuration is the work, not the code. Set thresholds from measured traffic distributions: what does p95 burst size look like over a 10-second window across real users? What is the downstream's measured breaking point under concurrent load? Conservative limits that regularly trip on legitimate traffic cause more operational pain than permissive ones that occasionally pass a burst through.

The algorithm matters less than knowing its failure mode. A token bucket with a burst capacity calibrated to what the downstream can actually absorb outperforms a theoretically precise sliding window log with a threshold set from a guess. Know what each algorithm optimises for, know what it costs when that optimisation doesn't fit the traffic you're actually seeing, and set thresholds from data.

Frequently asked questions

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

Reddit moved 500+ Kafka brokers and a petabyte of live data from EC2 to Kubernetes with zero downtime. The three techniques behind it aren't specific to Kafka.

Jul 8, 2026Read full article →

EngineeringMay 31, 20267 min readReviewed May 31, 2026

Rate limiting in production: the four algorithms and their failure modes

Fixed windows, token buckets, leaky buckets, and sliding windows each break differently. Here is what each one actually protects against.

By FlowVerify Editorial Team

Fixed window: predictable, and predictably exploitable at clock boundaries

Fixed window is the first implementation most engineers reach for, and for good reason: it fits in a handful of lines, requires a single Redis key per client, and is easy to reason about.

fixed_window.py

import time
import redis

r = redis.Redis()

def is_allowed(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    now = int(time.time())
    window_key = f"{client_id}:{now // window_seconds}"
    count = r.incr(window_key)
    if count == 1:
        r.expire(window_key, window_seconds * 2)
    return count <= limit

Sliding window log: accurate, and memory-expensive at scale

Token bucket: the flexible default that mishandles burst concurrency

Leaky bucket: a traffic shaper wearing a rate limiter's label

The leaky bucket is not quite a rate limiter. Requests enter a queue; they drain from the queue at a fixed rate regardless of when they arrived. If the queue is full, the incoming request is dropped.

Sliding window counter: the practical approximation most references skip

The approximation: weight the previous window's count by the fraction of the current window not yet elapsed.

sliding_window_counter.py

import time
import redis

r = redis.Redis()

def is_allowed_sliding(client_id: str, limit: int = 100, window_seconds: int = 60) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    elapsed_fraction = (now % window_seconds) / window_seconds

    prev_key = f"{client_id}:{current_window - 1}"
    curr_key = f"{client_id}:{current_window}"

    prev_count = int(r.get(prev_key) or 0)
    curr_count = int(r.get(curr_key) or 0)

    # Weight: as the current window elapses, the previous window counts less
    weighted = curr_count + prev_count * (1 - elapsed_fraction)
    if weighted >= limit:
        return False

    r.incr(curr_key)
    r.expire(curr_key, window_seconds * 2)
    return True

Choosing the right rate-limiting algorithm for your traffic shape

The right algorithm depends on what you are protecting against and which failure mode you can tolerate, not on what the framework defaults to.

Algorithm	Boundary accuracy	Burst behaviour	Memory	Avoid when
Fixed window	Poor: up to 2N at boundary	Passes up to 2N at boundary	O(1)	Downstream cannot absorb 2× burst
Sliding window log	Exact	Rejects over-limit accurately	O(requests in window)	Scale means many clients at high rate
Token bucket	Per-request, not per-period	Allows burst up to capacity	O(1)	Downstream is sensitive to arrival concurrency
Leaky bucket	N/A: shapes only	Smooths to fixed drain rate	O(queue size)	Interactive APIs where latency matters
Sliding window counter	Good: bounded error	Weighted across two windows	O(1)	Sub-second precision is required

Rate-limiting algorithms: what each one optimises for

The compound pattern most production services end up using

Most services that take rate limiting seriously end up with two layers rather than one, because service-level and client-level limits are protecting against different things.

Rate limiting in production: the four algorithms and their failure modes

Fixed window: predictable, and predictably exploitable at clock boundaries

Sliding window log: accurate, and memory-expensive at scale

Token bucket: the flexible default that mishandles burst concurrency

Leaky bucket: a traffic shaper wearing a rate limiter's label

Sliding window counter: the practical approximation most references skip

Choosing the right rate-limiting algorithm for your traffic shape

The compound pattern most production services end up using

Frequently asked questions

Related reading

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

Railway disconnected a carrier to contain an outage. It cut its last route instead.

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

Stay ahead on eSignatures, compliance, and document workflows

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

Rate limiting in production: the four algorithms and their failure modes

Fixed window: predictable, and predictably exploitable at clock boundaries

Sliding window log: accurate, and memory-expensive at scale

Token bucket: the flexible default that mishandles burst concurrency

Leaky bucket: a traffic shaper wearing a rate limiter's label

Sliding window counter: the practical approximation most references skip

Choosing the right rate-limiting algorithm for your traffic shape

The compound pattern most production services end up using

Frequently asked questions

Related reading

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.

Railway disconnected a carrier to contain an outage. It cut its last route instead.

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

Stay ahead on eSignatures, compliance, and document workflows

Reddit's zero-downtime migration of 500 Kafka brokers wasn't about Kafka. It was three reusable techniques.