Why your p99 latency is lying to you (and what to measure instead)
Your dashboard says p99 is 240 ms. Users are filing tickets about timeouts. Both are true, and the gap between them is not a mystery. It is a consequence of how percentile metrics are computed, aggregated, and displayed in most observability stacks.
This is not a niche concern. Misread latency metrics have delayed incident responses, justified bad architectural decisions, and caused engineering teams to spend weeks optimising the wrong thing. Understanding what p99 actually measures, and where it breaks down, is one of the highest-value things a backend engineer can do.
What p99 latency actually means
Percentile latency answers a specific question: "If I sort all my requests by how long they took, what is the duration of the request at position 99% from the bottom?"
In a set of 1,000 requests, p99 is the latency of the 990th slowest request. The fastest 990 requests finished in that time or less. The slowest 10 took longer, possibly far longer.
This is well understood in theory. The problem is what happens in practice, at scale, with real monitoring infrastructure.
The aggregation problem
Most dashboards compute p99 per time window (typically one minute) and then graph those values over time. Here is where the maths falls apart.
Percentiles are not additive. You cannot take the p99 from window 1 (180 ms), the p99 from window 2 (260 ms), and average them to get a meaningful p99 across both windows. The actual p99 across all requests in both windows could be significantly higher than either individual window's p99, or lower, depending on the distribution of requests.
When Prometheus scrapes your histograms and your Grafana panel computes histogram_quantile(0.99, ...) across multiple instances, it is doing something mathematically valid — but only because Prometheus histograms aggregate bucket counts, not pre-computed percentiles. If you are instead aggregating p99 values that were already computed upstream (say, in a StatsD-style pipeline that emits pre-aggregated percentiles), you are computing an approximation of an approximation, and the error compounds.
The practical consequence: teams running multiple application instances often see their p99 dashboard report lower latency than any individual instance is actually experiencing at the tail.
The merging problem
Consider a service with two endpoints: a fast read endpoint (p99 = 50 ms) and a slow write endpoint (p99 = 800 ms). If you merge all requests into a single p99 metric, you get a number that depends entirely on your read/write ratio. At 95% reads, your aggregate p99 might report 80 ms — which tells you nothing meaningful about how your write endpoint is behaving.
This is widespread. Teams track a single service-level p99 without segmenting by endpoint, operation type, or customer tier. When the write endpoint degrades to 2,000 ms, the aggregate p99 moves from 80 ms to 120 ms, still green on most dashboards. Users filing tickets about write timeouts are experiencing a real problem that the metric has structurally hidden.
The fix is obvious once you see it: always segment latency by the dimensions that matter. For most services, that means at minimum: endpoint or operation, response status (success vs error), and customer tier if you have multiple tiers.
The window boundary problem
A request that starts at 11:59:58 and completes at 12:00:03 takes five seconds. But if your monitoring system assigns it to the 11:59 minute bucket, that five-second request may not appear in either bucket's p99 because it is an outlier in a bucket where it was one of a thousand requests.
This is not a theoretical edge case. Long-running requests (the ones most likely to represent real user pain) are systematically under-weighted in time-window percentile calculations. The p99 of any given minute is computed from requests that completed within that minute. Very slow requests, by definition, span multiple windows.
Some observability systems handle this correctly by recording completion time with the full duration. Many do not. It is worth checking your specific stack's behaviour.
What percentiles hide: the bimodal distribution
Bimodal latency distributions are common in systems with a fast path and a slow path. An authentication service might complete most requests in 10–30 ms (cache hit) and a subset in 400–600 ms (cache miss, database lookup). The p99 might sit at 550 ms, which looks like a single population behaving consistently slowly.
A histogram tells a completely different story: two distinct peaks, with a valley between them. The p99 number obscures that 80% of requests are performing excellently and 20% are hitting a specific code path that could be addressed with targeted caching.
If you have access to histograms (and in Prometheus-based stacks you typically do), spend time looking at the full distribution, not just the 50th, 95th, and 99th percentile. Grafana's heatmap panel, histogram_quantile with multiple quantile values, or raw data exported to a spreadsheet for a specific time window all work.
The practical alternative: Apdex and error budgets
Apdex (Application Performance Index) is an old idea — it dates to 2007 and is published as an open standard — but it often outperforms raw percentiles for operational decision-making.
The formula: you define a "satisfactory" threshold T (say, 200 ms) and a "tolerating" threshold of 4T (800 ms). Each request is either Satisfied (completed in ≤ T), Tolerating (completed in T–4T), or Frustrated (took longer than 4T or errored). Apdex is computed as:
(Satisfied + 0.5 × Tolerating) / Total
This gives you a single number from 0 to 1 that directly represents user experience quality. A score above 0.94 is typically "Excellent"; below 0.7 is "Poor". Unlike p99, it is additive across time windows and across service instances. Unlike p99, it naturally captures both latency and errors in one signal.
The limitation of Apdex is that it requires you to choose T thoughtfully. If you set T too high, the score flatters you. Too low, and you are always red. But this forcing function — making a deliberate business decision about what "acceptable" means — is itself valuable. Most engineering teams have never had that conversation explicitly.
Error budgets take this further. Instead of tracking p99 and reacting when it spikes, you define a service level objective (say, "99.5% of requests complete in under 300 ms within any 28-day rolling window"), compute your remaining error budget in real time, and burn down the budget with every slow request. Teams consuming their error budget too quickly know they need to stop shipping new features and focus on reliability.
What to actually instrument
Here is a concrete instrumentation checklist for a typical backend service:
Per endpoint / operation:
- Request rate (requests per second)
- Error rate (percentage of 5xx or equivalent)
- Latency histogram with sufficient bucket resolution (do not use the default Prometheus buckets; set them based on your actual distribution)
Histogram bucket guidance: If your typical p50 is 30 ms, your buckets should cover 5 ms, 10 ms, 20 ms, 50 ms, 100 ms, 200 ms, 500 ms, 1000 ms, 2000 ms, 5000 ms. The default Prometheus histogram buckets (.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, in seconds) are designed for HTTP services with sub-second expectations. They are inadequate for services with p50 latencies above 100 ms.
System-wide:
- An Apdex score with a T value your team has agreed represents "good enough for users"
- A rolling error budget burn rate. If you are burning budget 3× faster than the budget allows, that is an alert worth waking someone up for.
What to avoid:
- Aggregate p99 across all endpoints combined
- Pre-aggregated percentile metrics from StatsD-style pipelines if you need to merge across instances
- Monitoring dashboards that only show min/max/average (the mean of a bimodal distribution is meaningless)
Real numbers from real systems
In 2019, the Netflix engineering team published an analysis showing that their p99.9 response times were 6–8× higher than their p99 response times for certain services. For a service reporting p99 = 100 ms, the 1-in-1,000 user was experiencing 600–800 ms. At Netflix's scale, that "1 in 1,000" represented tens of thousands of users per day.
The pattern appears at smaller scales too. A 2022 analysis of latency distributions across open-source observability data (published by the Honeycomb team) found that the ratio of p99.9 to p99 was above 4× for roughly 60% of services examined. If your system serves 10,000 requests per minute, your p99.9 represents 10 requests per minute experiencing serious latency — about 14,000 requests per day.
Whether that matters depends on your service. For a checkout flow, 14,000 frustrated users per day is a significant business problem. For an internal batch processing job, it may be acceptable. But you should be making that decision consciously, not missing it because p99 looks fine.
The latency metric that actually tells the truth
If you can make one change today: switch your primary latency alert from p99 to an error budget burn rate based on a defined SLO.
Define what "good" means (e.g., "95% of requests complete in under 500 ms"). Compute that percentage continuously over a rolling 24-hour window. Alert when you are on track to exhaust your monthly budget within 24 hours.
This metric is honest because it is directly tied to a user experience definition you chose deliberately. It is stable because it smooths over single-minute spikes that percentile metrics amplify. And it forces a conversation between engineering, product, and leadership about what the service is supposed to deliver — a conversation most teams have never had explicitly.
p99 is not useless. It is a reasonable proxy when used carefully, with proper histogram aggregation, segmented by endpoint, and understood as a snapshot of tail behaviour in a specific window. Used carelessly, averaged across instances, merged across endpoints, read off a pre-aggregated StatsD pipeline — it is actively misleading.
Your dashboard says p99 is 240 ms. That number might be right. It might be wrong. The only way to know is to understand exactly how it was computed.
FAQ
Why does my p99 look fine but users still complain about slowness?
The most common causes are endpoint aggregation (slow endpoints masked by fast ones), the window boundary problem (very slow requests spanning minute buckets), or pre-aggregated percentile averaging across instances. Start by segmenting your latency metric by endpoint and comparing the per-endpoint p99 with your aggregate.
Should I use p99, p99.9, or p99.99?
It depends on what guarantees you want to make and at what scale. At 1,000 requests per minute, p99.9 represents one request per minute. At 100,000 requests per minute, p99.9 represents 100 requests per minute, which may be many real users. A practical starting point: use p99 for alerting (sensitive enough to catch most problems, stable enough to avoid alert fatigue), and track p99.9 as an investigation tool when something feels wrong.
Is Apdex used in practice or is it old-fashioned?
Both. Apdex is less common than it deserves to be in product engineering, but it remains the standard metric for assessing user-facing performance in many SRE organisations. Google's Site Reliability Engineering book uses a structurally similar concept (SLIs and SLOs). If the name "Apdex" feels dated, call it your "latency SLO compliance rate" — it is the same maths.
My monitoring vendor already shows me p99. What do I need to change?
First, check whether your vendor is computing p99 from histograms (correct) or averaging pre-computed percentiles (incorrect). DataDog, Honeycomb, and Grafana with Prometheus histograms all compute correctly. StatsD with pre-aggregation does not. Second, ensure you can segment by endpoint, not just service-wide. Third, look at whether your bucket resolution matches your actual latency distribution.
Related reading
PgBouncer in production: the three modes, the five mistakes, and the mental model that fixes them
Most pooling incidents are configuration problems, not capacity problems. This covers PgBouncer's three modes, the failure patterns each creates when misconfigured, and the pool-sizing maths teams get wrong.
Webhooks in production: the delivery guarantees your integration is probably not honouring
At-least-once delivery is a guarantee, not a bug. Most webhook receivers in production are missing idempotency, dead-letter queues, or reconciliation. Here is the full checklist.
Feature flags in production: the lifecycle teams skip
Most teams have a system for adding feature flags. Almost none have a system for retiring them. Here is the full lifecycle: flag types, staleness detection, and the cleanup playbook.