Distributed Systems
What Consensus Protocols Teach Us About Deadlines
Raft, Paxos, and the surprising lesson that durable agreement is mostly about agreeing when to stop waiting.
Consensus protocols are usually introduced as algorithms for agreeing on a value. Read enough Raft papers and a different framing emerges: they are primarily algorithms for agreeing when to stop waiting. The value is incidental; the deadline is the contract.
The deadline is the contract
In a single-node system, “now” is unambiguous. The instant you write to disk and fsync returns, the world has moved on. In a distributed system, “now” is a negotiation. Until a quorum acknowledges, the write is suspended in a kind of probabilistic superposition — durable enough to survive most failures, fragile enough that a partition can rewrite it. The whole engineering job is bounding that interval.
Raft’s election timeout, Paxos’s ballot numbering, ZAB’s epoch counters — they are all variants of the same trick: make the deadline a first-class value that every participant can compare. Once a follower decides “I have waited long enough”, it doesn’t ask permission. It promotes itself, and the protocol absorbs the disagreement.
// Conceptually — pseudocode, not real Raft.
if (now() - lastHeartbeat > electionTimeout) {
becomeCandidate();
requestVotesFrom(peers);
}
The interesting part isn’t the if. It’s that electionTimeout is randomized per node so two followers don’t promote in lockstep. The protocol uses entropy as a coordination primitive.
What the deadline buys you
Three things, in the order they matter:
- Liveness. A bounded interval after which a stuck cluster will pick a winner, even if the wrong one.
- Convergence. Subsequent rounds will rule out the bad winner and replace it. Liveness without convergence is just thrashing.
- A debugging surface. When something goes wrong, the timestamps tell you who decided what, and when. No deadline, no story.
The third one is undervalued. A distributed system without explicit deadlines is a system whose failures can only be reasoned about post-hoc by reading logs and squinting. Once the deadline is in the protocol, it’s also in the metrics, the traces, and the SLOs.
The lesson generalizes
Most “distributed” problems in application code aren’t really about distribution. They are about deadlines we never named. The cron job that sometimes runs twice; the webhook that sometimes fires after the user has logged out; the cache that sometimes serves stale data after a write — these are all consensus failures in miniature, and the unnamed deadline is what turns each one into an incident the team has to write up later. Naming the deadline turns them into design problems instead of incidents.
That, I think, is the durable lesson the consensus literature has for the rest of us. Not the algorithms. The discipline of writing the deadline down.