Distributed Systems · Northwind Payments (fictional engagement)

An Incident Response Platform for a Mid-Size Payments Org

How we replaced four disjoint paging tools, a Confluence runbook graveyard, and a postmortem template nobody followed with a single platform that engineers actually used during outages.

The challenge

Northwind’s payments engineering org had grown to about 180 engineers across four time zones, and the on-call experience had not kept pace. Pagers fired into one tool, runbooks lived in a Confluence space whose search returned mostly stale drafts, postmortems were written in a Google Doc template nobody loved, and the incident channels in Slack were named after the year and an incrementing integer. By the time a responder had context, the page had often resolved itself — and nobody could tell whether that was good news or a coincidence.

The brief was short: “make on-call less terrible without buying a vendor we’ll regret in two years.”

The approach

We started by walking on-call rotations for three weeks before writing a single line of code. The pattern that emerged was that the connective tissue between paging, runbooks, and postmortems was the broken thing — not any individual tool. Each tool was fine in isolation; the handoffs between them lost context.

We built a small platform around three convictions:

The build was three months of focused work. We integrated with the existing pager (PagerDuty) rather than replacing it; the platform was the layer above. Runbooks lived in a Git repo with PR review — the same review hygiene engineers already practiced. Postmortems shipped as Markdown files committed alongside the runbooks they updated.

The outcome

Six months in, the metrics that mattered moved. Time-to-acknowledge dropped about 35% — not because pagers got faster, but because the responder landed in a Slack channel with the runbook fragment already pinned. Time-to-write-postmortem dropped from a median of 9 days to 3 days; the skeleton was already 60% written by the time the incident closed. Most importantly, the rate at which postmortems produced action items that actually shipped tripled — because action items now lived in the same Git repo as the runbooks, and the runbook update was the action item.

The thing I am proudest of is what we did not build. No analytics dashboard. No ML-driven alert grouping. No fancy on-call calendar. The platform did three things and did them in a way the engineers trusted, which is the only metric that mattered.