Distributed Systems

The Shape of a Good Incident Postmortem

Most postmortems fail not at the writing but at the framing. The shape of the document is the shape of the lesson — and most documents have the wrong shape.

Most engineering organizations write postmortems the same way. A timeline at the top. Root cause halfway down. Action items at the bottom. A bullet list of contributing factors that, on close reading, contributes very little. The document is signed off, filed in a wiki nobody re-reads, and the same incident — or a close cousin of it — happens again six months later. The ritual was performed. The lesson was not learned.

I have been writing and reviewing postmortems for ten years now, across three companies and four sets of conventions. The pattern I keep hitting is that the shape of the document is the shape of the lesson, and most documents have the wrong shape. They optimize for legibility to the executive who skims them once, not for the engineer who has to absorb them. They emphasize the bug that triggered the outage and underweight the systemic conditions that allowed a single bug to take everything down. And they almost always end before the most important question has been asked.

The bad shape: timeline-first

Here is the pattern I see most often. The postmortem leads with a minute-by-minute timeline:

14:02 — alert fires on api-prod-write-latency
14:04 — on-call acknowledges; investigates database
14:11 — escalates to db-team
14:18 — db-team identifies stuck transaction
14:22 — manual rollback issued
14:27 — latency returns to baseline

This timeline is necessary for a particular kind of audience — the person reconstructing the incident weeks later for a regulatory report, or the SRE writing a runbook patch. It is not, however, the right opening. The reader’s first contact with the document is a narrow procedural log; everything that matters about the incident — the surprise, the model failure, the assumption that turned out to be false — is buried under timestamps.

The result is that a reader who is not the original responder gets a recap of what happened without ever encountering what was believed before the incident. Postmortems are interesting precisely when they reveal the gap between belief and reality. A timeline rarely shows that gap; it shows the actions that closed it.

The better shape: belief-first

Here is the shape I have come to prefer:

  1. What we believed before the incident. Three to five sentences. The mental model the team carried into the day — about capacity, about failure modes, about which dependency could and could not break. Specific. Quotable. Wrong somewhere.
  2. What actually happened. A short narrative — no timestamps yet, just the story. Two to four paragraphs.
  3. Where the belief and the reality diverged. This is the load-bearing section. Name the specific assumption that turned out to be wrong. Quote it from a runbook or a design doc if you can.
  4. Timeline. Now we have earned it. The reader knows the stakes.
  5. Contributing factors. Ranked, not bulleted. The top one matters; the bottom three are filler unless someone defends each one.
  6. What would have prevented this. Not “what did we do” — that’s tactical. What change in the system that allowed this would have prevented it? Code change, process change, alerting change, organizational change.
  7. What we are doing. Owners and dates. Three or fewer items. If you have ten items, you have ten future postmortems.

The discipline of the belief-first shape forces the writer to surface the prior — the model the team was operating under. Once the prior is written down, the question of whether the prior was reasonable becomes answerable. Sometimes the prior was reasonable and the incident was a genuine surprise; that’s a different lesson than “we missed the obvious thing.” A timeline-first postmortem cannot make this distinction.

You don’t really understand a system until you have observed how it fails — and the observation has to make it back into the model.

Charity Majors, paraphrased

The two failure modes of postmortems

Once you start writing in this shape, you notice two characteristic failure modes in the postmortems other people write.

Failure mode one: the mechanism is correct but the lesson is shallow. “We had a stuck transaction; we rolled it back; we will add an alert for stuck transactions.” Mechanically accurate, operationally improved, but the system that allowed a single stuck transaction to cause a 25-minute outage is unchanged. Why was the database not isolated? Why was the on-call surprised? Why did the alert come from latency rather than from the transaction itself? The shallow postmortem stops at the first because and treats the answer as the cause.

Failure mode two: the lesson is deep but unactionable. “Our culture of optimism around schema changes is the underlying issue.” Maybe true. Almost certainly not something the next reader can act on. Deep lessons need to land somewhere — a checklist, a code review template, a hiring criterion, a quarterly goal. If the deep lesson does not become a concrete change in how the team operates next week, it is decoration, no matter how true it is.

A good postmortem walks the narrow path between these. It identifies a systemic condition, then proposes a change small enough to make this quarter and concrete enough to verify next quarter.

A worked example

Suppose a team runs a payments service that emits an outbox event for every successful charge. Downstream consumers — analytics, fraud, customer email — read from the outbox table via change-data-capture. One day, the outbox-publisher process gets stuck. Events back up. After two hours, somebody notices because the daily revenue dashboard is suspiciously empty. The team restarts the publisher; events flow again; everything looks fine.

The shallow postmortem says: “Add a metric for outbox-publisher lag; alert if lag > 5 minutes.” This is good and the team should do it. It is not enough.

The deeper question: the team did not notice for two hours. Why? Because there was no integrator-of-last-resort whose job it was to ask “is the data flowing?” The dashboards showed application metrics — the payments service itself was healthy — and downstream dashboards were owned by other teams. The systemic condition was: no single team owned the end-to-end pipeline health, so when the pipeline broke between teams, nobody noticed. Underneath that is a second condition the team had not named the deadline — there was no agreed answer to how long is too long?, so two hours of silence registered as nothing.

The actionable change is not “add a metric.” It is “designate a pipeline-health owner” — a rotation, a named team, a dashboard that reports the gap between upstream production and downstream consumption. The metric falls out of that ownership. Without the ownership, the metric exists but nobody is on the hook for watching it.

pipeline_health.py python
# Conceptual sketch: a single end-to-end health check
# owned by one team, asserted in one place.
def assert_pipeline_health() -> HealthReport:
    upstream_count = payments.events_emitted_in(last_hour=True)
    downstream_count = analytics.events_consumed_in(last_hour=True)
    lag = upstream_count - downstream_count
    if lag > LAG_THRESHOLD:
        page("pipeline-owner", reason=f"outbox lag={lag}")
    return HealthReport(upstream=upstream_count,
                        downstream=downstream_count,
                        lag=lag)
A flat slate-blue placeholder card representing the pipeline-health ownership pattern
Pipeline health belongs to one rotation, with one named dashboard.

The code is trivial. The hard part — the part that the postmortem must produce — is the answer to “who runs this, and who gets paged when it fails?” That answer is organizational, not technical, and a timeline-first postmortem will almost never reach it.

Postmortems as compounding artifacts

There is a second-order benefit to writing postmortems in the belief-first shape: they accumulate into a useful body of organizational knowledge about distributed systems over time. A pile of timeline-first postmortems is not a knowledge base; it is a chronological archive. Reading ten of them in sequence is exhausting and yields little because each one starts over with new timestamps and new components.

A pile of belief-first postmortems, by contrast, can be re-read for patterns: which beliefs, across the company, have been wrong most often? If you have twelve postmortems and seven of them turn on a wrong belief about backpressure, you have not had seven incidents — you have had one organizational gap manifesting seven times. The pattern is invisible in timeline-first writing because the timelines do not align. It is obvious in belief-first writing because the priors do.

This is why I push teams to write the prior first, even when it feels embarrassing. The embarrassment is the point. The prior is what was wrong, and writing it out makes the wrongness legible. Once it is legible, it becomes correctable — first in this team’s understanding, then in the next team’s via the documents that survive.

What about blame?

The standard answer is “postmortems are blameless.” This is correct, and it is also insufficient on its own. Blameless means the postmortem does not assign fault to an individual; it does not mean the postmortem refrains from naming what was wrong. A postmortem that is so blameless it cannot identify the false belief is not blameless; it is vague, and vagueness is its own kind of harm because it makes the lesson unreachable.

The right move is to be specific about what was wrong while being generous about why a competent person could have believed it. “The runbook said the database was the bottleneck; based on Q3 load tests, this was true. By Q1 of the next year, the bottleneck had quietly shifted to the message broker, and no one re-ran the load test.” This is blameless and specific. It accuses no individual; it names the precise belief that drifted out of sync with reality and the artifact that should have caught the drift.

A postmortem that hits this register — specific about the gap, generous about the cause, concrete about the change — is rare. When you read one, save it. When you write one, share it. They are how teams get less surprised over time.

A short checklist

If I had to compress all of this into a checklist for the writer of the next postmortem:

If the answer to all six is yes, the postmortem is doing its job. If not, it is — like most of the postmortems I have read — a record of an event rather than the seed of a lesson. The first kind is easy to produce and easy to forget. The second kind is harder to write and harder to forget. We need more of the second kind.