Write a postmortem that someone outside your team will actually read
Most postmortems fail their readers not because the analysis is wrong, but because the document assumes one audience when it has three.
A 500-line Confluence page. Seven bullet points under Root Cause Analysis. Eleven action items, eight of them assigned to Team. One month later, the page has four emoji reactions and the monitoring alert that triggered the incident still fires every Tuesday at 3 AM.
This is the default postmortem. It documents everything and changes nothing.
The failure is not in the analysis. Most teams that run structured incident reviews understand the technical chain of events by the time the postmortem is written. The failure is in the document design.
The forensic format is written for the people who were in the room
Most postmortem templates evolved from SRE practice, where the goal is institutional memory: every step of the incident recorded so the on-call team can reconstruct what happened if it recurs. That is a legitimate goal. But that goal produces a document that is nearly impossible to read if you were not already part of the incident.
The engineers who built the system know what 'the replication lag spiked at 14:32' means. Their manager does not. The on-call engineer who joins six months from now does not. The customer success manager trying to explain the outage to a frustrated enterprise customer definitely does not.
Most postmortem guides do not acknowledge this. They assume one reader, and it is the reader who was already there.
You have three readers, not one
Before writing the first sentence, name who will actually read the document.
The incident team: the people who were on the call or in the Slack thread. They need the forensic detail: exact timestamps, specific queries, the alerting gap. They are also the most likely to read the whole document.
Leadership and stakeholders: your VP of Engineering, your CEO if the outage was significant, the customer success team. They need to know what happened to users, for how long, and why the situation is now different. Not 14:32 and replication lag. Customer impact and remediation.
The future engineer: the person joining in eight months who hits the same class of problem. This is the reader most postmortem templates neglect. Their question is: what decision was made, and why, that led here?
These three readers want different things. A single 500-line document serves none of them well.
The two-document approach, in one document
The solution is not to write three separate documents. It is to structure one document so each reader can find what they need quickly and stop reading when they have it.
Put the executive brief at the top. Write it last. Cap it at 200 words. This is the only section leadership needs to read. Below the brief: the forensic record, with the full technical timeline, root cause, and contributing factors. At the bottom, not scattered throughout: action items, each with an owner and a due date.
The structural separation does two things. It tells each audience where to look. And it forces the writer to think about customer impact separately from technical explanation — which produces better prose in both sections.
Rewriting the postmortem: before and after
The prose problem in most postmortems is chronological drift: the document follows the sequence of the incident rather than the sequence of understanding.
| Section | Forensic version (written for the incident team) | Readable version (written for leadership and the future engineer) |
|---|---|---|
| Opening | At 14:31 UTC, the primary database experienced elevated replication lag, triggering an alert at 14:34 which was acknowledged by the on-call engineer. | For 47 minutes on 12 May, approximately 3,200 users received errors when placing orders. |
| Root cause | TRANSACTION_ISOLATION was set to READ_UNCOMMITTED in the replica config, causing dirty reads under concurrent write load. | A database configuration change from the previous week behaved correctly under normal load but failed under the write pattern produced by a marketing campaign. |
| Action item | Improve monitoring for database issues. | Add alert for write queue depth > 10,000 on the order-creation topic, paged to on-call within 2 minutes — @sre-team, due 30 May. |
The single most useful rewrite: lead with customer impact, not the technical event. 'Our database had elevated replication lag' is the technical event. 'Order placement was unavailable for 3,200 users for 47 minutes' is the customer impact. Start with the second. Explain the first below it.
The forensic detail is still in the document. It lives in the timeline section, where the incident team will find it. It does not need to appear in the opening sentences, which are the ones every reader actually reads.
Action items: where postmortems go to die
Action items that live only in the postmortem document do not get done. The fix is mechanical: create the Jira, Linear, or GitHub issues the day the postmortem is published, and link to them from the document. The postmortem is the narrative record; the tracker is where work lives. Keep them connected.
Three other constraints that matter:
- Be specific. 'Improve monitoring' is not an action item. 'Add an alert for write queue depth above 10,000 messages on the order-creation topic, paged to on-call within two minutes' is an action item.
- One named owner per item. Not Team. Not DevOps. A person who is aware they own it.
- Cap the list. A postmortem with eleven action items will complete zero of them. Three specific, owned items with dates will outperform eleven vague ones in every organisation that has tried both.
When engineers see that postmortems produce action items that never get done, they learn the postmortem is a ritual to survive, not something to engage with honestly. The next postmortem gets shallower. The monitoring gap that caused the original incident stays unfixed. The team responds slightly slower in the next incident because the institutional learning that should have happened did not.
The 90-second executive read
“Write the executive brief after everything else is done. Put it at the top. It should answer four questions in 200 words or fewer.”
What broke, in plain English, customer-first framing. Scope: number of users affected and for how long. Why: the underlying cause, not the proximate one. 'A configuration change that was not validated under campaign-level traffic' is the underlying cause. 'Replication lag' is the proximate one. What is different now: one to three specific changes, not aspirational statements.
'We will improve our testing practices' is not a change — it is an aspiration. 'We have added a load test against campaign-level traffic to the deployment checklist, and it runs in CI before every production deploy' is a change.
If you have written this brief and a non-engineer cannot understand what happened to customers from it, rewrite before publishing. The purpose of the executive brief is not to protect the engineering team. It is to give a stakeholder a complete picture in 90 seconds.
One more thing: who writes it
The engineer closest to the incident should not write the postmortem alone. They have too much context — the forensic version is the only version they can see clearly. A second engineer who was present but not leading the response, or an engineering manager, often catches where the document assumes knowledge the reader does not have.
Pair-writing the executive brief specifically is worth the twenty minutes it takes. One person writes a draft, another reads it aloud and flags where they need to stop and ask a question. Every place they stop is a sentence to rewrite.
The postmortem that actually gets read is not longer or more detailed than the forensic one. It is structured for three audiences who encounter it at different times with different questions. Separate the narrative from the forensics. Lead with customer impact. Write the executive brief last and put it first. Create action items as tickets on the day of publication. The analysis your team already does is usually the right analysis — the document is what needs redesigning.
Frequently asked questions
Related reading
Five security patterns that appear in AI-generated code — and why code review usually misses them
AI-generated codebases have 2.5x more critical vulnerabilities than human-written code. The useful finding: five predictable patterns that standard code review is not designed to catch.
Hiring senior engineers in a market that’s split in two
General software engineering roles are down ~36-49% while AI/ML openings are up 59%. The result is two candidate pools with almost no overlap — and most job descriptions accidentally fish in both.
The minimum viable security posture for a 10-person SaaS
Most security advice targets enterprises or absolute beginners. Eight controls for a 10-person B2B SaaS team — ranked by how much breach risk each closes per hour of engineering work.