What is the difference between AOF everysec and AOF always for write performance?

AOF everysec batches writes and fsyncs to disk once per second, amortising disk I/O across many writes. AOF always fsyncs on every individual write, making each write synchronously durable at the cost of much higher latency — typically 1-5ms per write versus sub-millisecond for everysec. For most production workloads, everysec is the right choice: you risk losing up to 1 second of writes on a crash, which is acceptable for a cache layer.

How do I tell if Redis Cluster is actively rebalancing?

Run redis-cli --cluster check and look for slots in a migrating state. You can also watch your client library's ASK redirect counter — most well-maintained Redis clients expose this in their stats. A sudden spike in ASK redirects correlates directly with active slot migration.

Does Redis 7's I/O threading change any of these failure modes?

Redis 7's I/O threading offloads network I/O to multiple threads while keeping command execution single-threaded. This improves throughput for large-value reads and writes but does not change the AOF rewrite pause, the stampede dynamics, or the Cluster ASK redirect overhead — those are in the command execution and persistence paths, not the network I/O path.

Is there a way to do zero-downtime Redis Cluster rebalancing?

Not completely — ASK redirects during slot migration are inherent to how Redis Cluster migration works. The practical approach is to rebalance during low-traffic windows, use the migration-barrier setting to slow the migration rate, and ensure your client library handles ASK correctly so the redirect adds one extra round trip rather than a failure.

EngineeringMay 18, 20267 min readReviewed May 18, 2026

Redis writes at scale: what benchmarks don't capture

Three failure modes — AOF rewrites, expiry stampedes, and Cluster rebalancing — that only surface in production

By FlowVerify Editorial Team

Key takeaways

Redis benchmarks measure writes under ideal conditions — no persistence, no replicas, no expiry — which does not reflect production.
AOF rewrite pauses can cause 200-500ms write stalls and are invisible to monitoring setups that do not correlate with Redis internal events.
Write-expiry stampedes corrupt counters at high write rates; fix them with atomic Lua scripts rather than GET-then-SET patterns.
Redis Cluster rebalancing adds an extra round trip (ASK redirect) to writes hitting migrated keys, doubling write latency during rebalancing windows.
The right alternative to Redis for write-heavy workloads depends on the write pattern — not a blanket switch to Postgres.
Check persistence mode, slowlog, ASK redirect rate, write atomicity, TTL jitter, and replica lag before concluding Redis is the bottleneck.

Redis writes at scale expose failure modes the benchmark your team relied on never tested for. The benchmark said 1.2 million operations per second. Not wrong, but measuring under conditions that do not exist in production: empty keyspace, no persistence, no replicas, no key expiry. Production has all four.

Then someone fires 5,000 writes per second at a key cluster for six hours and starts seeing P99 latency spikes of 400ms from a system that is supposed to answer in under 5ms.

Here is what actually happens inside Redis when you write, and three specific failure modes that only appear once you are running at scale.

How Redis handles a write: the actual path

A write to Redis is not one operation. Depending on your configuration, it is three:

The in-memory write: the key is set in the hash table. This is the fast part: O(1) for a simple SET, bounded by memory bandwidth. This is what the benchmark measures.
The persistence write: if you are running AOF (Append Only File), the write is also appended to the AOF file on disk. If you are running RDB, it contributes to the in-memory snapshot buffer. If you are running both (the recommended production configuration), it does both. If you are running neither (the default if you have not touched the config), it does neither, and a restart loses everything.
The replication write: if you have replicas, the write is sent to each one asynchronously. Nearly invisible in normal operation. Matters when a replica falls behind.

Steps 1 and 3 get attention. Step 2 is where the interesting failures live.

Mode	What happens on write	Risk on crash	Latency impact
No persistence	Memory only	Lose everything since last restart	None
RDB only	Memory write; periodic disk snapshot via fork()	Lose all writes since last snapshot	Spike during fork()
AOF (everysec)	Memory + fsync to AOF file once per second	Lose up to 1 second of writes	Low baseline; pause during rewrite
AOF (always)	Memory + fsync on every individual write	Lose at most one write	High sustained latency
AOF + RDB	Both of the above	Minimal	Combined impact of both

Redis persistence modes and their trade-offs

Which mode are you actually running? Most teams do not know without checking:

check-persistence.sh

redis-cli CONFIG GET save          # RDB snapshot schedule
redis-cli CONFIG GET appendonly    # AOF enabled?
redis-cli CONFIG GET appendfsync   # always, everysec, or no
redis-cli INFO persistence          # current AOF size, rewrite status

If appendonly is no and save is empty, you are running with no persistence. Every restart is a cold cache. A large fraction of teams discover this at 3am during an incident.

AOF rewrite: the pause that does not show up in latency graphs

The AOF file grows continuously. Every write appends a line. Left unchecked, it grows until disk fills. So Redis periodically rewrites the AOF file: it forks a child process, the child writes a compact version of the current keyspace to a new file, then atomically replaces the old AOF.

The fork itself is fast. What is not fast is what comes after: while the child is writing, the parent continues accepting writes, which are tracked in an in-memory buffer. When the child finishes, the parent applies that buffer to the new AOF file before the file swap. At high write rates, this buffer gets large. Applying it is blocking.

If you are writing 50,000 keys per second and an AOF rewrite takes 8 seconds (realistic for a 10GB keyspace), the buffer holds roughly 400,000 operations when the rewrite completes. Flushing it can take hundreds of milliseconds. During that time, all writes queue behind it.

Watch for this in INFO persistence:

bash

aof_rewrite_in_progress:1
aof_current_size:2853123104
aof_base_size:142657843
aof_pending_rewrite:0

When aof_rewrite_in_progress flips from 1 to 0, watch your latency graph. If you see unexplained 200-500ms P99 spikes uncorrelated with traffic, correlate them against rewrite completion times. This is invisible to most monitoring setups, which track request latency without referencing Redis internal events.

There is no cost-free fix, but three practical mitigations:

Tune auto-aof-rewrite-percentage and auto-aof-rewrite-min-size to trigger rewrites during low-traffic windows rather than whenever the file doubles.
Set no-appendfsync-on-rewrite yes to skip fsyncs during the rewrite phase, reducing buffer flush time at the cost of a slightly higher crash risk.
Separate write-heavy keys from read-heavy keys onto different Redis instances so a write-heavy instance's rewrite cycle does not spike read latency.

The write-expiry stampede

The read-cache stampede is well documented: when a hot key expires, all readers simultaneously find a cache miss and rush the backend. The write-expiry variant is less discussed.

Consider a write-heavy counter: a per-user rate limit bucket, a rolling window aggregate, or a page-level view counter. A common implementation pattern:

bash

GET key
if missing: SET key 0 EX 60
INCR key

At low traffic, this works. At high traffic, the expiry creates a thundering-herd problem on the write path. When the key expires with 800 concurrent writers active, all 800 find the key missing, all 800 issue SET key 0, and INCR is now racing against a key that 800 processes are simultaneously resetting. The first few hundred INCRs hit the value just set; then the key expires and the cycle repeats. Your counters are garbage.

The correct pattern uses an atomic Lua script:

atomic-counter.lua

local current = redis.call('INCR', KEYS[1])
if current == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return current

This ensures the expiry is set exactly once, by the writer who created the key, atomically. The Lua script executes as a single Redis command, so no client can interleave between the INCR and EXPIRE.

The scale-specific problem: at low throughput, the race window is small enough that your counter is slightly off but rarely resets mid-window. At 50,000 writes per second on a 60-second TTL key, the race is constant. The key resets dozens of times per cycle.

Redis Cluster rebalancing and write latency

Redis Cluster shards data across nodes using a 16,384-slot hash ring. When you add or remove nodes, slots migrate between them. During migration, writes to a migrating slot take a different path:

Client sends a write to the source node.
Source node checks whether the slot is migrating.
If the key exists on source, the write proceeds as normal.
If the key has already migrated to the destination, source returns an ASK redirect.
Client re-issues ASKING plus the original command to the destination node, adding one extra round trip.

The ASK redirect adds a full network round trip to every write that hits a key that has migrated. If your client handles only MOVED redirects and not ASK, it will loop or fail. If it handles ASK correctly, affected writes take 2x the latency of a normal write.

In production, rebalancing typically takes minutes to hours depending on keyspace size and migration batch settings. During that window, P99 write latency can be 2-4x higher than baseline, depending on what fraction of your keyspace sits in migrating slots.

The diagnosis is simple: run redis-cli --cluster check during a latency event to see slot migration status. Mitigation requires planning rather than reaction:

Rebalance during known-quiet windows, not reactively during traffic spikes.
Use the --cluster-migration-barrier option to limit parallel key migrations and reduce the blast radius.
Monitor the ASK redirect rate in your client library's metrics. A spike in ASK redirects is a direct signal of active slot migration.

What to reach for when Redis writes are the bottleneck

The most common piece of bad advice here is to replace Redis with Postgres. That is correct for a specific case: small-to-medium datasets, writes and reads from the same record, network latency to the database is not the bottleneck. It is wrong in most situations where you have actually hit Redis write limits.

The more useful frame is: what is the write pattern?

Write-heavy with time-series semantics

Metrics, counters, events: Redis Streams fits better than SET/INCR here. It is designed for append-only writes and has native consumer group semantics. For write rates above 100,000 events per second where you also need query, TimescaleDB or ClickHouse handle high-write-rate time series with better compression and without the AOF rewrite problem.

Write-heavy with strong consistency

Redis is not the right tool. The in-memory nature and async replication mean you will get split-brain scenarios under network partition. This is where you want a Raft-based store (etcd or TiKV) or Postgres with synchronous replication and connection pooling.

Write-heavy hot-key pattern

Many clients writing to the same key simultaneously: this is a data model problem more than a Redis problem. Redis handles roughly 200,000 writes per second to a single key before the event loop becomes the bottleneck. If you are at that limit, shard the key space: partition your counter into N sub-keys, write to key:hash(writer_id) % N, and aggregate on read. This is the approach Redis itself documents for hot-key scenarios.

The diagnostic question is not 'is Redis wrong?' It is 'which write pattern does my workload fit, and am I using the right Redis features for that pattern?' Most of the time, the answer is the wrong data structure or the wrong persistence mode rather than the wrong database.

The diagnostic checklist

If you are seeing unexplained write latency spikes in a Redis deployment, check these in order:

Check your persistence mode first. Run redis-cli INFO persistence. If AOF is enabled, look at aof_rewrite_in_progress. Correlate rewrite completion times with latency spikes in your monitoring.
Check the slowlog. redis-cli SLOWLOG GET 25. Any command over 10ms is a candidate for investigation.
Check the ASK redirect rate in your client library's metrics. A sudden spike means active cluster rebalancing.
Check write patterns for atomicity gaps. If you are doing GET then check then SET then INCR sequences, convert them to Lua scripts or atomic Redis commands such as SET key value NX EX seconds.
Check key TTL distribution. If a large fraction of your write-heavy keys expire at the same clock minute (all set with EX 3600 at startup), you get synchronised stampedes every hour. Add jitter: EX followed by 3600 plus a random offset of up to 300 seconds.
Check replica lag. redis-cli INFO replication, specifically the delta between master_repl_offset and slave_repl_offset. Significant lag means replicas are consuming primary write bandwidth.

Redis is genuinely fast. The in-memory write path is hard to beat for the right workload. The failure modes above are predictable — they emerge at scale because that is when persistence, replication, and cluster mechanics become the dominant cost rather than the memory operation itself. Know your persistence mode, audit your write patterns for atomicity, and plan cluster rebalancing as scheduled maintenance rather than a reactive emergency.

Frequently asked questions

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

pgvector handles most RAG workloads under ten million vectors just fine. The HNSW index underneath it has a memory requirement Postgres won't mention until the build already ran 40x slower.

Jul 3, 2026Read full article →

EngineeringMay 18, 20267 min readReviewed May 18, 2026

Redis writes at scale: what benchmarks don't capture

Three failure modes — AOF rewrites, expiry stampedes, and Cluster rebalancing — that only surface in production

By FlowVerify Editorial Team

Key takeaways

Redis benchmarks measure writes under ideal conditions — no persistence, no replicas, no expiry — which does not reflect production.
AOF rewrite pauses can cause 200-500ms write stalls and are invisible to monitoring setups that do not correlate with Redis internal events.
Write-expiry stampedes corrupt counters at high write rates; fix them with atomic Lua scripts rather than GET-then-SET patterns.
Redis Cluster rebalancing adds an extra round trip (ASK redirect) to writes hitting migrated keys, doubling write latency during rebalancing windows.
The right alternative to Redis for write-heavy workloads depends on the write pattern — not a blanket switch to Postgres.
Check persistence mode, slowlog, ASK redirect rate, write atomicity, TTL jitter, and replica lag before concluding Redis is the bottleneck.

Then someone fires 5,000 writes per second at a key cluster for six hours and starts seeing P99 latency spikes of 400ms from a system that is supposed to answer in under 5ms.

Here is what actually happens inside Redis when you write, and three specific failure modes that only appear once you are running at scale.

How Redis handles a write: the actual path

A write to Redis is not one operation. Depending on your configuration, it is three:

The in-memory write: the key is set in the hash table. This is the fast part: O(1) for a simple SET, bounded by memory bandwidth. This is what the benchmark measures.
The persistence write: if you are running AOF (Append Only File), the write is also appended to the AOF file on disk. If you are running RDB, it contributes to the in-memory snapshot buffer. If you are running both (the recommended production configuration), it does both. If you are running neither (the default if you have not touched the config), it does neither, and a restart loses everything.
The replication write: if you have replicas, the write is sent to each one asynchronously. Nearly invisible in normal operation. Matters when a replica falls behind.

Steps 1 and 3 get attention. Step 2 is where the interesting failures live.

Mode	What happens on write	Risk on crash	Latency impact
No persistence	Memory only	Lose everything since last restart	None
RDB only	Memory write; periodic disk snapshot via fork()	Lose all writes since last snapshot	Spike during fork()
AOF (everysec)	Memory + fsync to AOF file once per second	Lose up to 1 second of writes	Low baseline; pause during rewrite
AOF (always)	Memory + fsync on every individual write	Lose at most one write	High sustained latency
AOF + RDB	Both of the above	Minimal	Combined impact of both

Redis persistence modes and their trade-offs

Which mode are you actually running? Most teams do not know without checking:

check-persistence.sh

redis-cli CONFIG GET save          # RDB snapshot schedule
redis-cli CONFIG GET appendonly    # AOF enabled?
redis-cli CONFIG GET appendfsync   # always, everysec, or no
redis-cli INFO persistence          # current AOF size, rewrite status

If appendonly is no and save is empty, you are running with no persistence. Every restart is a cold cache. A large fraction of teams discover this at 3am during an incident.

AOF rewrite: the pause that does not show up in latency graphs

Watch for this in INFO persistence:

bash

aof_rewrite_in_progress:1
aof_current_size:2853123104
aof_base_size:142657843
aof_pending_rewrite:0

There is no cost-free fix, but three practical mitigations:

Tune auto-aof-rewrite-percentage and auto-aof-rewrite-min-size to trigger rewrites during low-traffic windows rather than whenever the file doubles.
Set no-appendfsync-on-rewrite yes to skip fsyncs during the rewrite phase, reducing buffer flush time at the cost of a slightly higher crash risk.
Separate write-heavy keys from read-heavy keys onto different Redis instances so a write-heavy instance's rewrite cycle does not spike read latency.

The write-expiry stampede

The read-cache stampede is well documented: when a hot key expires, all readers simultaneously find a cache miss and rush the backend. The write-expiry variant is less discussed.

Consider a write-heavy counter: a per-user rate limit bucket, a rolling window aggregate, or a page-level view counter. A common implementation pattern:

bash

GET key
if missing: SET key 0 EX 60
INCR key

The correct pattern uses an atomic Lua script:

atomic-counter.lua

local current = redis.call('INCR', KEYS[1])
if current == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return current

This ensures the expiry is set exactly once, by the writer who created the key, atomically. The Lua script executes as a single Redis command, so no client can interleave between the INCR and EXPIRE.

Redis Cluster rebalancing and write latency

Redis Cluster shards data across nodes using a 16,384-slot hash ring. When you add or remove nodes, slots migrate between them. During migration, writes to a migrating slot take a different path:

Client sends a write to the source node.
Source node checks whether the slot is migrating.
If the key exists on source, the write proceeds as normal.
If the key has already migrated to the destination, source returns an ASK redirect.
Client re-issues ASKING plus the original command to the destination node, adding one extra round trip.

The diagnosis is simple: run redis-cli --cluster check during a latency event to see slot migration status. Mitigation requires planning rather than reaction:

Rebalance during known-quiet windows, not reactively during traffic spikes.
Use the --cluster-migration-barrier option to limit parallel key migrations and reduce the blast radius.
Monitor the ASK redirect rate in your client library's metrics. A spike in ASK redirects is a direct signal of active slot migration.

What to reach for when Redis writes are the bottleneck

The more useful frame is: what is the write pattern?

Write-heavy with time-series semantics

Write-heavy with strong consistency

Write-heavy hot-key pattern

The diagnostic checklist

If you are seeing unexplained write latency spikes in a Redis deployment, check these in order:

Check your persistence mode first. Run redis-cli INFO persistence. If AOF is enabled, look at aof_rewrite_in_progress. Correlate rewrite completion times with latency spikes in your monitoring.
Check the slowlog. redis-cli SLOWLOG GET 25. Any command over 10ms is a candidate for investigation.
Check the ASK redirect rate in your client library's metrics. A sudden spike means active cluster rebalancing.
Check write patterns for atomicity gaps. If you are doing GET then check then SET then INCR sequences, convert them to Lua scripts or atomic Redis commands such as SET key value NX EX seconds.
Check key TTL distribution. If a large fraction of your write-heavy keys expire at the same clock minute (all set with EX 3600 at startup), you get synchronised stampedes every hour. Add jitter: EX followed by 3600 plus a random offset of up to 300 seconds.
Check replica lag. redis-cli INFO replication, specifically the delta between master_repl_offset and slave_repl_offset. Significant lag means replicas are consuming primary write bandwidth.

Redis writes at scale: what benchmarks don't capture

How Redis handles a write: the actual path

AOF rewrite: the pause that does not show up in latency graphs

The write-expiry stampede

Redis Cluster rebalancing and write latency

What to reach for when Redis writes are the bottleneck

Write-heavy with time-series semantics

Write-heavy with strong consistency

Write-heavy hot-key pattern

The diagnostic checklist

Frequently asked questions

Related reading

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

The AI memory shortage just rewrote the cloud cost-optimisation playbook

Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.

Stay ahead on eSignatures, compliance, and document workflows

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

Redis writes at scale: what benchmarks don't capture

How Redis handles a write: the actual path

AOF rewrite: the pause that does not show up in latency graphs

The write-expiry stampede

Redis Cluster rebalancing and write latency

What to reach for when Redis writes are the bottleneck

Write-heavy with time-series semantics

Write-heavy with strong consistency

Write-heavy hot-key pattern

The diagnostic checklist

Frequently asked questions

Related reading

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it

The AI memory shortage just rewrote the cloud cost-optimisation playbook

Coinbase's AWS outage lasted 18 hours. The postmortem shows why multi-AZ didn't help.

Stay ahead on eSignatures, compliance, and document workflows

pgvector's HNSW index has a memory cliff, and the Postgres defaults walk right into it