How do I measure how much lag my replica actually has?

On Postgres, query pg_stat_replication on the primary to see write_lag, flush_lag, and replay_lag per replica. On the replica itself, SELECT NOW() - pg_last_xact_replay_timestamp() gives you the age of the last replayed transaction. Most managed services (AWS RDS, Google Cloud SQL, Supabase) also expose replica lag as a CloudWatch or Prometheus metric you can alert on.

Does synchronous replication eliminate these bugs?

Yes, but at a cost. With synchronous replication the primary waits for the replica to confirm before acknowledging the write, which eliminates lag but adds write latency equal to the round-trip between primary and replica. For a same-region replica on a fast network that might be 2-5ms; across regions it can be 50-200ms per write. Most teams using replicas for read scaling accept async replication and fix the application layer instead.

Our ORM handles primary/replica routing automatically. Are we still affected?

Most ORMs that support replica routing (ActiveRecord, Sequelize, SQLAlchemy, Prisma) do not implement read-your-writes by default. They route writes to the primary and reads to a replica, full stop. You still need to configure a post-write routing window or explicit primary affinity for consistency-sensitive queries. Check your ORM's documentation for a "sticky primary" or "read-your-writes" option — if it doesn't have one, implement the session-flag pattern in middleware.

Is PgBouncer a better first step than adding a read replica?

For many teams, yes. PgBouncer pools connections to the primary, dramatically reducing the overhead of connection establishment and letting a smaller primary handle more concurrent application instances. If the bottleneck is connection pressure rather than raw read throughput, PgBouncer often recovers more headroom than a replica does, without adding any consistency complexity. Profile before you scale: if your primary CPU is low but connection count is high, start with pooling.

EngineeringMay 8, 20266 min readReviewed May 8, 2026

Read replicas are not transparent: the application bugs async replication creates

The DBA adds the replica. The application engineer deals with the fallout.

By FlowVerify Editorial Team

The standard advice for scaling a Postgres-backed service is clear: when your primary strains under read load, add a read replica and route SELECT queries to it. Primary handles writes; replica handles reads; problem solved. The database layer works exactly as described. The problem is in what the advice leaves out about your application.

Async streaming replication is the default in Postgres and in most managed database offerings. It means the replica is always slightly behind the primary. Usually milliseconds. Under write bursts, seconds. After a maintenance event or network partition, minutes. This gap is not an edge case or a misconfiguration. It is the designed behaviour.

Most teams add a replica and route reads to it without auditing their codebase for the patterns that break under this guarantee. The resulting bugs do not appear in tests. They appear intermittently in production, reported by confused users, pointing to no specific code path.

How async replication works

When you write to the primary, Postgres writes the change to its write-ahead log (WAL). The replica streams that WAL and replays it. Replication lag is the delay between when the WAL record is written on the primary and when it is applied on the replica.

On a healthy, lightly-loaded system with good network connectivity between primary and replica, this lag is typically under 50ms. Under write bursts — bulk imports, end-of-period reporting jobs, high concurrent write rates — it can grow to seconds. After a replica restart or a brief network partition, catching back up may take minutes.

The key point: when your application routes a query to the replica, the replica may be anywhere from 50ms to several minutes behind the primary. The application receives a valid SQL response with no error, no warning. The response just reflects an older state of the data.

The four bug classes

Bug 1: The phantom 404 after creation

A user submits a form. Your application writes the new record to the primary and returns a 201 response with the new record's ID. The frontend immediately follows up with a GET request to /api/records/:id, which routes to the replica. The replica has not yet applied the WAL record. The query returns no rows. Your application returns a 404.

From the user's perspective: they clicked Save, saw a success message, and the record does not exist.

This bug is intermittent because it only triggers when the frontend makes the follow-up read faster than replication propagates, which is most of the time on a modern single-page application where navigation is nearly instantaneous after a successful POST.

Bug 2: Stale list data after an update

A user changes their name in their profile. The write goes to the primary. They navigate to a page listing all users in their organisation. That query routes to the replica. Their old name appears. They think the save failed, try again, and both writes land on the primary in quick succession.

This class appears in any UI pattern where an update is followed by a list reload: order management screens, document status views, team permission panels. The bug usually surfaces as a support ticket that reads: "I changed it but the list didn't update."

Bug 3: "My settings aren't saved"

A user changes a preference. The application writes to the primary and confirms success. They navigate away and back. If the profile page's query routes to the replica and the replica has not caught up, the old value appears. The user thinks the page discarded their change.

This is the most frustrating class for users because it appears to validate their input, confirm success, and then silently ignore the change. The support ticket reads: "It says saved but when I refresh it goes back."

Bug 4: Multi-step workflow consistency failures

The hardest class to find. A background job or a downstream service reads state from the replica to decide its next action, but that state is stale relative to a write that just committed. Example: an approval workflow where one step writes an approval record and the next step reads the current approval count. If the count query hits the replica before replication catches up, the step may proceed as if the approval was not submitted, duplicating or skipping a stage.

These bugs involve different processes, often running at different times, and the failure manifests in application logic rather than in a missing record. They are the hardest to attribute to replication lag because the connection between the write and the stale read is not obvious from logs.

Why these bugs are invisible in tests

Unit and integration tests typically run against a single database instance with no replication. Even tests that use a full database stack usually run serially, so there is no concurrent read-after-write race to trigger. Load tests rarely verify correctness after writes; they measure throughput and latency, not data consistency.

The intermittency is the tell. If a bug is "works in staging, happens sometimes in production, we can't reproduce it on demand," replica lag is a plausible cause worth investigating before reaching for more complex explanations.

Replication lag also grows with write load, so teams see the bugs most during peak traffic, exactly when they are least positioned to investigate. A service that runs cleanly during normal hours may produce a flurry of stale-read reports during a data migration or a traffic spike.

Application-level fixes

These fixes belong in application code, not in database configuration. Tuning replica lag through hardware upgrades, smaller write batches, or synchronous replication addresses symptoms. Routing logic addresses the cause.

Read-your-writes: sticky primary after write

After a write, route the same request's subsequent reads to the primary for a configurable window. One to two seconds covers the immediate post-write navigation pattern for the same user session, which addresses Bug 1, Bug 2, and Bug 3.

Most application frameworks support this by threading a connection preference through request context. A simple implementation uses a session flag set on every write response:

db-middleware.js

// After any write, mark the session to prefer primary reads
function markPrimaryAfterWrite(req) {
  req.session.primaryUntil = Date.now() + 1500; // 1.5 seconds
}

// Read routing middleware: call this before any SELECT
function getReadConnection(req, pools) {
  const preferPrimary =
    req.session.primaryUntil && Date.now() < req.session.primaryUntil;
  return preferPrimary ? pools.primary : pools.replica;
}

// Usage in a route handler
app.post('/api/records', async (req, res) => {
  const record = await pools.primary.query(
    'INSERT INTO records (...) VALUES (...) RETURNING id',
    [...]
  );
  markPrimaryAfterWrite(req);
  res.status(201).json(record.rows[0]);
});

app.get('/api/records/:id', async (req, res) => {
  const conn = getReadConnection(req, pools);
  const record = await conn.query(
    'SELECT * FROM records WHERE id = $1',
    [req.params.id]
  );
  if (!record.rows.length) return res.status(404).json({ error: 'not found' });
  res.json(record.rows[0]);
});

This is not "never use the replica." It is "use the replica unless the user just wrote something, in which case give replication a moment to catch up."

Primary reads for consistency-sensitive paths

For multi-step workflow logic (Bug 4), the fix is architectural: do not route workflow-state reads to the replica. Consistency-sensitive paths — approval workflows, payment state machines, inventory checks before fulfilment — should read from the primary regardless of whether a write just occurred. These queries are usually a small fraction of total volume, so the load relief from the replica still applies to the bulk of reads (dashboards, search, list views with no recent writes).

Causal tokens for cross-service consistency

If reads and writes happen in different services, the sticky-session approach does not propagate across service boundaries. A causal consistency token — a WAL position written by the service after a write, forwarded by the client on subsequent requests — lets the read-routing layer verify that the replica is at least as current as the causal point before serving from it. Postgres supports this via pg_current_wal_lsn() on the primary after a write and pg_last_wal_replay_lsn() on the replica for comparison. This is more complex to wire up but solves the cross-service case without routing all reads to the primary.

What should go to the replica

Query type	Route to	Why
Dashboard aggregates, reports	Replica	Stale by minutes is acceptable; these are inherently approximate
Admin list views with low write rate	Replica	Low write frequency makes lag rarely visible
Profile/settings page (cold load, no recent write)	Replica	Stale by seconds is fine for a cold page load
Profile/settings page (right after a save)	Primary	Read-your-writes window; Bug 2 and Bug 3 territory
GET immediately after POST (same session)	Primary	The phantom 404 scenario; always route to primary here
Workflow state in multi-step processes	Primary	Correctness matters; lag is not acceptable
Background job inputs that drive logic	Primary	Stale state drives wrong decisions; Bug 4 applies
Search and filtering	Replica	Results lag by seconds is a minor UX concern, not a correctness bug

A routing decision for common query types

When not to bother with a replica

Read replicas are a net gain for workloads that are heavily read-skewed and where most reads tolerate approximate data. They add operational complexity: lag monitoring, routing middleware, potential split-brain during failover. The benefit varies with your query mix.

Consider skipping the replica if your primary is comfortably handling read load without it, if more than a third of your reads are consistency-sensitive (you'd be routing most reads to primary anyway), if the bottleneck is actually write throughput rather than read capacity, or if your team lacks bandwidth to add lag monitoring and routing middleware.

A connection pooler like PgBouncer and targeted query optimisation often recover more capacity from an existing primary than a replica adds, without any consistency trade-off. Profile before you scale: if primary CPU is well below its ceiling but connection count is high, pooling is the better first step.

When you do add a replica, treat it as a change to your application's consistency model, not purely a database topology change. Audit the codebase for read-after-write patterns before the replica goes live. The bugs it creates are intermittent, hard to reproduce, and almost always filed as "the app is acting weird" rather than "we have a replication lag problem."

Frequently asked questions

LLM database access: the RBAC gap most teams don't see

May 13, 2026Read full article →

EngineeringMay 8, 20266 min readReviewed May 8, 2026

Read replicas are not transparent: the application bugs async replication creates

The DBA adds the replica. The application engineer deals with the fallout.

By FlowVerify Editorial Team

How async replication works

The four bug classes

Bug 1: The phantom 404 after creation

From the user's perspective: they clicked Save, saw a success message, and the record does not exist.

Bug 2: Stale list data after an update

Bug 3: "My settings aren't saved"

Bug 4: Multi-step workflow consistency failures

Why these bugs are invisible in tests

Application-level fixes

Read-your-writes: sticky primary after write

Most application frameworks support this by threading a connection preference through request context. A simple implementation uses a session flag set on every write response:

db-middleware.js

// After any write, mark the session to prefer primary reads
function markPrimaryAfterWrite(req) {
  req.session.primaryUntil = Date.now() + 1500; // 1.5 seconds
}

// Read routing middleware: call this before any SELECT
function getReadConnection(req, pools) {
  const preferPrimary =
    req.session.primaryUntil && Date.now() < req.session.primaryUntil;
  return preferPrimary ? pools.primary : pools.replica;
}

// Usage in a route handler
app.post('/api/records', async (req, res) => {
  const record = await pools.primary.query(
    'INSERT INTO records (...) VALUES (...) RETURNING id',
    [...]
  );
  markPrimaryAfterWrite(req);
  res.status(201).json(record.rows[0]);
});

app.get('/api/records/:id', async (req, res) => {
  const conn = getReadConnection(req, pools);
  const record = await conn.query(
    'SELECT * FROM records WHERE id = $1',
    [req.params.id]
  );
  if (!record.rows.length) return res.status(404).json({ error: 'not found' });
  res.json(record.rows[0]);
});

This is not "never use the replica." It is "use the replica unless the user just wrote something, in which case give replication a moment to catch up."

Primary reads for consistency-sensitive paths

Causal tokens for cross-service consistency

What should go to the replica

Query type	Route to	Why
Dashboard aggregates, reports	Replica	Stale by minutes is acceptable; these are inherently approximate
Admin list views with low write rate	Replica	Low write frequency makes lag rarely visible
Profile/settings page (cold load, no recent write)	Replica	Stale by seconds is fine for a cold page load
Profile/settings page (right after a save)	Primary	Read-your-writes window; Bug 2 and Bug 3 territory
GET immediately after POST (same session)	Primary	The phantom 404 scenario; always route to primary here
Workflow state in multi-step processes	Primary	Correctness matters; lag is not acceptable
Background job inputs that drive logic	Primary	Stale state drives wrong decisions; Bug 4 applies
Search and filtering	Replica	Results lag by seconds is a minor UX concern, not a correctness bug

A routing decision for common query types

How async replication works

The four bug classes

Bug 1: The phantom 404 after creation

Bug 2: Stale list data after an update

Bug 3: "My settings aren't saved"

Bug 4: Multi-step workflow consistency failures

Why these bugs are invisible in tests

Application-level fixes

Read-your-writes: sticky primary after write

Primary reads for consistency-sensitive paths

Causal tokens for cross-service consistency

What should go to the replica

When not to bother with a replica

Frequently asked questions

How do I measure how much lag my replica actually has?

Does synchronous replication eliminate these bugs?

Our ORM handles primary/replica routing automatically. Are we still affected?

Is PgBouncer a better first step than adding a read replica?

Related reading

Stay ahead on eSignatures, compliance, and document workflows

How async replication works

The four bug classes

Bug 1: The phantom 404 after creation

Bug 2: Stale list data after an update

Bug 3: "My settings aren't saved"

Bug 4: Multi-step workflow consistency failures

Why these bugs are invisible in tests

Application-level fixes

Read-your-writes: sticky primary after write

Primary reads for consistency-sensitive paths

Causal tokens for cross-service consistency

What should go to the replica

When not to bother with a replica

Frequently asked questions

How do I measure how much lag my replica actually has?

Does synchronous replication eliminate these bugs?

Our ORM handles primary/replica routing automatically. Are we still affected?

Is PgBouncer a better first step than adding a read replica?

Related reading

Stay ahead on eSignatures, compliance, and document workflows