Distributed Systems · Northwind Payments (fictional engagement)
An Incident Response Platform for a Mid-Size Payments Org
How we replaced four disjoint paging tools, a Confluence runbook graveyard, and a postmortem template nobody followed with a single platform that engineers actually used during outages.
The challenge
Northwind’s payments engineering org had grown to about 180 engineers across four time zones, and the on-call experience had not kept pace. Pagers fired into one tool, runbooks lived in a Confluence space whose search returned mostly stale drafts, postmortems were written in a Google Doc template nobody loved, and the incident channels in Slack were named after the year and an incrementing integer. By the time a responder had context, the page had often resolved itself — and nobody could tell whether that was good news or a coincidence.
The brief was short: “make on-call less terrible without buying a vendor we’ll regret in two years.”
The approach
We started by walking on-call rotations for three weeks before writing a single line of code. The pattern that emerged was that the connective tissue between paging, runbooks, and postmortems was the broken thing — not any individual tool. Each tool was fine in isolation; the handoffs between them lost context.
We built a small platform around three convictions:
- Incidents are first-class. Every page opens an incident record with a stable URL, a Slack channel, and a status timeline — all wired together, not three things a responder has to manually link.
- Runbooks are queries, not pages. Instead of a Confluence space, we indexed runbook fragments by alert name. When the alert fires, the platform surfaces the matching fragment inline in the channel.
- Postmortems start during the incident. The status timeline becomes the postmortem skeleton — written by the responder, not reconstructed from log archaeology three days later.
The build was three months of focused work. We integrated with the existing pager (PagerDuty) rather than replacing it; the platform was the layer above. Runbooks lived in a Git repo with PR review — the same review hygiene engineers already practiced. Postmortems shipped as Markdown files committed alongside the runbooks they updated.
The outcome
Six months in, the metrics that mattered moved. Time-to-acknowledge dropped about 35% — not because pagers got faster, but because the responder landed in a Slack channel with the runbook fragment already pinned. Time-to-write-postmortem dropped from a median of 9 days to 3 days; the skeleton was already 60% written by the time the incident closed. Most importantly, the rate at which postmortems produced action items that actually shipped tripled — because action items now lived in the same Git repo as the runbooks, and the runbook update was the action item.
The thing I am proudest of is what we did not build. No analytics dashboard. No ML-driven alert grouping. No fancy on-call calendar. The platform did three things and did them in a way the engineers trusted, which is the only metric that mattered.
More in Distributed Systems.