Your on-call rotation punishes the engineers who care most
Equal paging counts feel fair. They measure the wrong thing.
The rotation schedule looks balanced. Every engineer gets the same number of on-call weeks per quarter. The incident dashboard shows a tidy distribution. The engineering manager presents the numbers at the next planning meeting and calls it equitable.
It is not equitable. It is equal. These are different things.
What your on-call rotation metric actually measures
A rotation schedule tracks time on-call per engineer, incidents routed per engineer, and — if the team is diligent — incidents closed per engineer. It does not track: how long the engineer was actually awake, whether they understood the problem or simply restarted the service, what they did the following morning, or how much of the incident resolution required genuine system knowledge versus following a runbook.
These omissions are not details. They are where the burden actually lives.
The knowledge gap you are not routing around
When an alert fires at 2am, the paged engineer either built part of the affected system and resolves it in eleven minutes, or is reading the runbook for the first time and escalating within twenty because they are not confident what the metric is telling them. The rotation log records the same number of disrupted nights. The human experience is completely different.
High-knowledge engineers do not spiral during incidents. They wake up, read the alert, and already know the shape of the problem. They have seen this failure before, or something close to it. Lower-knowledge engineers start the incident with a set of open questions and work through them in real time, under time pressure, in the middle of the night. That is not a criticism of their skill. It is a description of what system-specific knowledge does and does not transfer. You cannot rotate institutional knowledge the same way you rotate names on a schedule.
The gap compounds over time. The engineer with deep context gets paged, resolves quickly, and goes back to sleep. The engineer without it gets paged, stays awake longer, escalates more, and — if the incident is complex — ends up involving the high-knowledge engineer anyway. At which point both engineers have been on-call, and only one of them appears in the rotation count.
Three things that never appear in the dashboard
Resolution quality is the first. Some engineers resolve incidents by restarting the service. Others resolve them by identifying the root cause, documenting it, and writing a note that the next engineer can actually use. The first approach costs twenty minutes and defers the next three incidents. The second costs two hours and prevents them. Both appear identically in the incident log: closed.
Follow-up work is the second. The engineers who care most open tickets after incidents. They fix the runbook. They add the alert label that would have made the 3am page more specific. They file the issue about the flaky external dependency. None of this is tracked in on-call rotation metrics. If the team makes resourcing decisions based on those metrics, this work is invisible — which means the people doing it are effectively doing untracked labour on top of their rotation count.
Responsiveness variation is the third. A twenty-second acknowledgment and a four-minute one represent meaningfully different states of mind. A resolution at 2:18am and one at 3:52am represent meaningfully different amounts of lost sleep. Engineers with high system familiarity tend to resolve faster. Engineers with less context are still paged at the same rate. Equal paging counts do not capture this.
Alert noise is not distributed equally either
Noisy alerting does not affect all engineers equally. It affects the engineers who actually look at the alerts.
If your on-call rotation produces 300 alerts per week and 180 of them fire between midnight and 6am, the engineer who clicks acknowledge and returns to sleep is in a different rotation than the one who reads the alert, checks whether it is a real signal, concludes it probably is not, and then lies awake for thirty minutes anyway. Both engineers are recorded as having been on-call. Only one of them was.
The team almost certainly knows the alerts are noisy. There is a backlog of alert improvements that has been there for several months. The engineers generating items in that backlog are the ones who actually read the alerts — which is to say, the same engineers who will be on-call again before the backlog is acted on.
What a different model looks like in practice
Three changes, in rough order of impact.
First: stop using rotation count as the primary fairness metric. Add at minimum two more signals: P95 resolution time per engineer across the last quarter, and the ratio of noise pages to actionable pages per engineer per rotation. These will show you quickly whether the rotation is actually balanced or just equally scheduled.
Second: route alerts based on system familiarity where it matters. If four engineers have deep knowledge of the billing system and ten do not, billing-related alerts should go to that group more often — not forever, but until knowledge spreads deliberately. This is not a permanent arrangement; it is an acknowledgment that your on-call policy should account for where the knowledge actually sits, not assume it is evenly distributed.
Third: make follow-up work visible. If an engineer spends three hours the morning after an incident improving the runbook, filing root-cause tickets, and adding alert context, those three hours should be tracked somewhere that influences how resourcing decisions are made. If they are not, you are hiding the true cost of on-call from the people responsible for staffing it.
| Dimension | Equal rotation | Fair rotation |
|---|---|---|
| What is measured | Weeks on-call per engineer | Disrupted hours, resolution speed, follow-up work |
| Alert routing | Same queue for all engineers | Routed by system familiarity where stakes are high |
| Resolution quality | Not tracked; all closes look equal | Visible through runbook updates, ticket quality |
| Knowledge assumption | Engineers are interchangeable | System knowledge is unevenly distributed and matters |
| Follow-up work | Not counted as on-call burden | Counted; influences rotation frequency |
What the burnout signal is actually telling you
When your most reliable engineers say they are exhausted and your dashboard shows equal rotation, the instinct is to look elsewhere. Sprint pace, maybe. A difficult cross-functional dynamic. Some personal situation. Equal numbers on a dashboard are reassuring. They suggest the system is working.
Check the P95 resolution time per engineer over the last six months. Look at who opens post-incident tickets. Count who has generated items in the alert improvement backlog. In most teams, these three checks point to the same cluster of people.
Those engineers are not complaining because they got unlucky with incident timing. They are carrying a structural load that the rotation metric cannot see.
“Equal rotation is a scheduling decision. Fair rotation is a system design problem.”
The policy is straightforward to write and implement. The design requires acknowledging that engineers are not interchangeable inputs in a rotation queue, that system knowledge concentrates rather than distributes evenly, and that the engineers most likely to flag the problem are the ones least likely to be taken seriously when the dashboard says everything is balanced.
Fix the metric first. The rotation will follow.
Frequently asked questions
Related reading
Technical debt is a useful shorthand. It's also why nothing gets fixed.
The 'technical debt backlog' meeting that goes nowhere is a vocabulary failure. Most things engineers call technical debt are four distinct problems — and fixing any of them starts with naming the right one.
Most AI strategy decks are written backwards
AI strategy decks that list capabilities by department feel comprehensive and systematically land on the wrong priorities. The fix is not a better use-case inventory — it is a constraints map.
Hiring senior engineers in a market that’s split in two
General software engineering roles are down ~36-49% while AI/ML openings are up 59%. The result is two candidate pools with almost no overlap — and most job descriptions accidentally fish in both.