joc-thoughts

The Shape of a Good Incident Postmortem

Jocta Torres — Sat, 26 Apr 2025 00:00:00 GMT

Most engineering organizations write postmortems the same way. A timeline at the top. Root cause halfway down. Action items at the bottom. A bullet list of contributing factors that, on close reading, contributes very little. The document is signed off, filed in a wiki nobody re-reads, and the same incident — or a close cousin of it — happens again six months later. The ritual was performed. The lesson was not learned.

I have been writing and reviewing postmortems for ten years now, across three companies and four sets of conventions. The pattern I keep hitting is that the shape of the document is the shape of the lesson, and most documents have the wrong shape. They optimize for legibility to the executive who skims them once, not for the engineer who has to absorb them. They emphasize the bug that triggered the outage and underweight the systemic conditions that allowed a single bug to take everything down. And they almost always end before the most important question has been asked.

The bad shape: timeline-first

Here is the pattern I see most often. The postmortem leads with a minute-by-minute timeline:

14:02 — alert fires on api-prod-write-latency
14:04 — on-call acknowledges; investigates database
14:11 — escalates to db-team
14:18 — db-team identifies stuck transaction
14:22 — manual rollback issued
14:27 — latency returns to baseline

This timeline is necessary for a particular kind of audience — the person reconstructing the incident weeks later for a regulatory report, or the SRE writing a runbook patch. It is not, however, the right opening. The reader's first contact with the document is a narrow procedural log; everything that matters about the incident — the surprise, the model failure, the assumption that turned out to be false — is buried under timestamps.

Timeline-first is the default template in most incident-management tools. It's the easy thing to generate and the wrong thing to lead with.

The result is that a reader who is not the original responder gets a recap of what happened without ever encountering what was believed before the incident. Postmortems are interesting precisely when they reveal the gap between belief and reality. A timeline rarely shows that gap; it shows the actions that closed it.

The better shape: belief-first

Here is the shape I have come to prefer:

What we believed before the incident. Three to five sentences. The mental model the team carried into the day — about capacity, about failure modes, about which dependency could and could not break. Specific. Quotable. Wrong somewhere.
What actually happened. A short narrative — no timestamps yet, just the story. Two to four paragraphs.
Where the belief and the reality diverged. This is the load-bearing section. Name the specific assumption that turned out to be wrong. Quote it from a runbook or a design doc if you can.
Timeline. Now we have earned it. The reader knows the stakes.
Contributing factors. Ranked, not bulleted. The top one matters; the bottom three are filler unless someone defends each one.
What would have prevented this. Not "what did we do" — that's tactical. What change in the system that allowed this would have prevented it? Code change, process change, alerting change, organizational change.
What we are doing. Owners and dates. Three or fewer items. If you have ten items, you have ten future postmortems.

The discipline of the belief-first shape forces the writer to surface the prior — the model the team was operating under. Once the prior is written down, the question of whether the prior was reasonable becomes answerable. Sometimes the prior was reasonable and the incident was a genuine surprise; that's a different lesson than "we missed the obvious thing." A timeline-first postmortem cannot make this distinction.

You don't really understand a system until you have observed how it fails — and the observation has to make it back into the model.

The two failure modes of postmortems

Once you start writing in this shape, you notice two characteristic failure modes in the postmortems other people write.

Failure mode one: the mechanism is correct but the lesson is shallow. "We had a stuck transaction; we rolled it back; we will add an alert for stuck transactions." Mechanically accurate, operationally improved, but the system that allowed a single stuck transaction to cause a 25-minute outage is unchanged. Why was the database not isolated? Why was the on-call surprised? Why did the alert come from latency rather than from the transaction itself? The shallow postmortem stops at the first because and treats the answer as the cause.

Failure mode two: the lesson is deep but unactionable. "Our culture of optimism around schema changes is the underlying issue." Maybe true. Almost certainly not something the next reader can act on. Deep lessons need to land somewhere — a checklist, a code review template, a hiring criterion, a quarterly goal. If the deep lesson does not become a concrete change in how the team operates next week, it is decoration, no matter how true it is.

A good postmortem walks the narrow path between these. It identifies a systemic condition, then proposes a change small enough to make this quarter and concrete enough to verify next quarter.

A worked example

Suppose a team runs a payments service that emits an outbox event for every successful charge. Downstream consumers — analytics, fraud, customer email — read from the outbox table via change-data-capture. One day, the outbox-publisher process gets stuck. Events back up. After two hours, somebody notices because the daily revenue dashboard is suspiciously empty. The team restarts the publisher; events flow again; everything looks fine.

The shallow postmortem says: "Add a metric for outbox-publisher lag; alert if lag > 5 minutes." This is good and the team should do it. It is not enough.

The deeper question: the team did not notice for two hours. Why? Because there was no integrator-of-last-resort whose job it was to ask "is the data flowing?" The dashboards showed application metrics — the payments service itself was healthy — and downstream dashboards were owned by other teams. The systemic condition was: no single team owned the end-to-end pipeline health, so when the pipeline broke between teams, nobody noticed. Underneath that is a second condition the team had not named the deadline — there was no agreed answer to how long is too long?, so two hours of silence registered as nothing.

The deepest lessons from incidents almost always live in the seams between teams. The team that owns the upstream service notices upstream. The team that owns the downstream service notices downstream. The seam is where the alarm should be — and is usually where it is missing.

The actionable change is not "add a metric." It is "designate a pipeline-health owner" — a rotation, a named team, a dashboard that reports the gap between upstream production and downstream consumption. The metric falls out of that ownership. Without the ownership, the metric exists but nobody is on the hook for watching it.

HealthReport: upstream_count = payments.events_emitted_in(last_hour=True) downstream_count = analytics.events_consumed_in(last_hour=True) lag = upstream_count - downstream_count if lag > LAG_THRESHOLD: page("pipeline-owner", reason=f"outbox lag={lag}") return HealthReport(upstream=upstream_count, downstream=downstream_count, lag=lag)`}

# Conceptual sketch: a single end-to-end health check
# owned by one team, asserted in one place.
def assert_pipeline_health() -> HealthReport:
    upstream_count = payments.events_emitted_in(last_hour=True)
    downstream_count = analytics.events_consumed_in(last_hour=True)
    lag = upstream_count - downstream_count
    if lag > LAG_THRESHOLD:
        page("pipeline-owner", reason=f"outbox lag={lag}")
    return HealthReport(upstream=upstream_count,
                        downstream=downstream_count,
                        lag=lag)

The code is trivial. The hard part — the part that the postmortem must produce — is the answer to "who runs this, and who gets paged when it fails?" That answer is organizational, not technical, and a timeline-first postmortem will almost never reach it.

Postmortems as compounding artifacts

There is a second-order benefit to writing postmortems in the belief-first shape: they accumulate into a useful body of organizational knowledge about distributed systems over time. A pile of timeline-first postmortems is not a knowledge base; it is a chronological archive. Reading ten of them in sequence is exhausting and yields little because each one starts over with new timestamps and new components.

A pile of belief-first postmortems, by contrast, can be re-read for patterns: which beliefs, across the company, have been wrong most often? If you have twelve postmortems and seven of them turn on a wrong belief about backpressure, you have not had seven incidents — you have had one organizational gap manifesting seven times. The pattern is invisible in timeline-first writing because the timelines do not align. It is obvious in belief-first writing because the priors do.

This is why I push teams to write the prior first, even when it feels embarrassing. The embarrassment is the point. The prior is what was wrong, and writing it out makes the wrongness legible. Once it is legible, it becomes correctable — first in this team's understanding, then in the next team's via the documents that survive.

What about blame?

The standard answer is "postmortems are blameless." This is correct, and it is also insufficient on its own. Blameless means the postmortem does not assign fault to an individual; it does not mean the postmortem refrains from naming what was wrong. A postmortem that is so blameless it cannot identify the false belief is not blameless; it is vague, and vagueness is its own kind of harm because it makes the lesson unreachable.

The right move is to be specific about what was wrong while being generous about why a competent person could have believed it. "The runbook said the database was the bottleneck; based on Q3 load tests, this was true. By Q1 of the next year, the bottleneck had quietly shifted to the message broker, and no one re-ran the load test." This is blameless and specific. It accuses no individual; it names the precise belief that drifted out of sync with reality and the artifact that should have caught the drift.

A postmortem that hits this register — specific about the gap, generous about the cause, concrete about the change — is rare. When you read one, save it. When you write one, share it. They are how teams get less surprised over time.

A short checklist

If I had to compress all of this into a checklist for the writer of the next postmortem:

Did you write the prior down — what the team believed before the incident? Three sentences minimum.
Did you name the specific belief that turned out to be wrong, and quote it from a runbook or design doc?
Are your action items concrete enough that a stranger reading the document next quarter can ask "did we do that?"
Is there at most one organizational change in the action items? More than one means none of them will actually happen.
Does the document name an owner for each action item, not just a team?
If you re-read this postmortem in a year alongside ten others, will the pattern show?

If the answer to all six is yes, the postmortem is doing its job. If not, it is — like most of the postmortems I have read — a record of an event rather than the seed of a lesson. The first kind is easy to produce and easy to forget. The second kind is harder to write and harder to forget. We need more of the second kind.

Naming Things Is the Last Hard Problem in Computer Science

Jocta Torres — Fri, 18 Apr 2025 00:00:00 GMT

Phil Karlton's joke — "there are only two hard things in computer science: cache invalidation and naming things" — is repeated so often that the actual claim has been lost. The claim is that naming is hard at the same level cache invalidation is hard: not as a polish step, but as a load-bearing engineering activity.

We have made tremendous progress on cache invalidation. We have CDN purge APIs, vector clocks, content-addressed storage. We have not made remotely as much progress on naming. Naming is still, almost everywhere, a matter of taste, of habit, of whatever the original author happened to type at 2am.

Why bad names cost so much

A bad name is not a lexical error. It is a type error in the social system that maintains the code — one of the recurring themes of the philosophy of computer science, and one we keep underestimating. A function named processItem invites every future caller to project their own theory of what the function does — and one of them will be wrong. A boolean called flag becomes a magnet for special cases. A class called Manager accretes responsibilities until it manages the heat death of the universe.

The cost compounds because naming determines who can change the code. A reader who has to derive intent from the body before they can edit safely is a slow reader. A reader whose grasp of the system depends on memorizing nicknames is a brittle reader. Names are the project's API for its own contributors.

// Both compile. One is a debugging session.
function check(u: User, x: number): boolean { /* ... */ }
function userHasSufficientCredits(user: User, requiredCredits: number): boolean { /* ... */ }

You can argue the second is verbose. You cannot argue the second is unclear at the call site, six months later, in a postmortem.

What "good naming" actually requires

It requires three things, none of which are tooling:

A vocabulary that the team has agreed on. Glossaries are deeply unfashionable and almost always worth the half-day.
A willingness to rename late. Most names are wrong on the first pass; the cost of fixing them goes up monotonically with time.
A house style that resists cleverness. Cute names age badly. Boring names age into prose.

The remaining hard problem

We can cache-invalidate at scale because we treated invalidation as engineering. We can mostly avoid off-by-one errors because we treated them as engineering. The same move is now overdue for training-data provenance — also currently shrugged off as taste, also paying the same compounding bill. Naming is the last large category of bug we have collectively decided to leave to "taste". That decision is more expensive than it looks, and the bill keeps coming due in onboarding time, in ambiguous bug reports, in the slow compounding cost of code nobody quite understands.

We could decide otherwise. It would be cheap. We just don't, yet.

Training-Data Provenance Is the Real Alignment Problem

Jocta Torres — Fri, 11 Apr 2025 00:00:00 GMT

Most public conversation about AI alignment is downstream of a much simpler problem: we generally cannot point to a row of training data and explain, in concrete terms, how it ended up in the corpus. Not "approximately" — concretely. Which crawl, which heuristic, which dedupe pass, which licensing assumption.

Provenance vs. lineage

It helps to separate two ideas. Lineage is the story of how data flows through your pipeline once you have it: cleaning, tokenization, sharding, mixing. Lineage is mostly a tooling problem, and the field is getting better at it.

Provenance is the story of where the data came from in the first place. That story tends to terminate in phrases like "Common Crawl, snapshot 2024-09" or "a partner dataset" or, worst, "we don't remember exactly". Provenance is upstream of lineage, and it's where the genuinely hard ML/AI questions live.

A small, fixable problem

The smallest version of this problem is reproducibility. If a behavior surfaces in eval at version v37 of a model and is gone in v38, the obvious question is: what changed in the data? Today, for many teams, the honest answer is "we changed eight things and re-ran". This is the same shape of failure as an unnamed deadline in a distributed system — without the marker written down, the post-hoc reconstruction is the only debugging surface you have. A provenance graph would let you answer it in one query.

# What we want to be able to say:
diff = corpus_v38.provenance_set() - corpus_v37.provenance_set()
print(diff.sources_added)   # which upstreams entered the mix
print(diff.sources_dropped) # which were filtered out

That doesn't require a new framework. It requires that every source — every URL list, every partner dump, every synthetic generator — emits a stable identifier, and that the corpus retains that identifier per row.

The bigger reason it matters

The bigger reason is that regulation is not going to wait for us to figure this out. The EU AI Act, the various US state-level proposals, the inevitable consent-based opt-outs — every one of these requires the operator to answer a provenance question on demand. "We trained on a public mix" is not going to be enough. "We trained on these 4,217 sources, here is the manifest, here is when each was last refreshed, here is the legal basis we relied on for each" — that is going to be the bar.

Building toward that bar is mostly boring infrastructure: hashing, manifesting, signing, journaling. None of it is research-flavored. All of it pays off the first time someone asks a hard question and you can answer it with a query instead of a Slack thread.

The alignment debate is not going to make sense until we can have it grounded in concrete artifacts. Provenance is what makes the artifact concrete — and concreteness, as in every other corner of engineering, starts with a name we have agreed to.

What Consensus Protocols Teach Us About Deadlines

Jocta Torres — Fri, 04 Apr 2025 00:00:00 GMT

Consensus protocols are usually introduced as algorithms for agreeing on a value. Read enough Raft papers and a different framing emerges: they are primarily algorithms for agreeing when to stop waiting. The value is incidental; the deadline is the contract.

The deadline is the contract

In a single-node system, "now" is unambiguous. The instant you write to disk and fsync returns, the world has moved on. In a distributed system, "now" is a negotiation. Until a quorum acknowledges, the write is suspended in a kind of probabilistic superposition — durable enough to survive most failures, fragile enough that a partition can rewrite it. The whole engineering job is bounding that interval.

Raft's election timeout, Paxos's ballot numbering, ZAB's epoch counters — they are all variants of the same trick: make the deadline a first-class value that every participant can compare. Once a follower decides "I have waited long enough", it doesn't ask permission. It promotes itself, and the protocol absorbs the disagreement.

// Conceptually — pseudocode, not real Raft.
if (now() - lastHeartbeat > electionTimeout) {
  becomeCandidate();
  requestVotesFrom(peers);
}

The interesting part isn't the if. It's that electionTimeout is randomized per node so two followers don't promote in lockstep. The protocol uses entropy as a coordination primitive.

What the deadline buys you

Three things, in the order they matter:

Liveness. A bounded interval after which a stuck cluster will pick a winner, even if the wrong one.
Convergence. Subsequent rounds will rule out the bad winner and replace it. Liveness without convergence is just thrashing.
A debugging surface. When something goes wrong, the timestamps tell you who decided what, and when. No deadline, no story.

The third one is undervalued. A distributed system without explicit deadlines is a system whose failures can only be reasoned about post-hoc by reading logs and squinting. Once the deadline is in the protocol, it's also in the metrics, the traces, and the SLOs.

The lesson generalizes

Most "distributed" problems in application code aren't really about distribution. They are about deadlines we never named. The cron job that sometimes runs twice; the webhook that sometimes fires after the user has logged out; the cache that sometimes serves stale data after a write — these are all consensus failures in miniature, and the unnamed deadline is what turns each one into an incident the team has to write up later. Naming the deadline turns them into design problems instead of incidents.

That, I think, is the durable lesson the consensus literature has for the rest of us. Not the algorithms. The discipline of writing the deadline down.