Anticipating the inevitable: Pre-mortem and defense in depth
The next outage is already taking shape, hidden until the first alert. It might hide in a supply chain flaw inside a trusted IOS-XR patch, quietly altering routes worldwide. Or it could stem from a single flawed intent policy in an ACI fabric, isolating entire application layers with surgical precision. External forces like wildfires, floods or geopolitical events can force data center evacuations, knocking out power grids and delaying generators for hours. The 2021 Fastly global outage — triggered by one valid config change exposing a hidden bug — shows how fast a CDN can collapse. These scenarios are not speculation; they are probabilities waiting to strike, each with its own failure signature.
Experience reframes the question. Failure is inevitable in infrastructure work. What matters is how it unfolds, how precisely and whether the design anticipates that exact failure mode. Resilience now means shaping failure’s impact, not stopping it. This mindset demands a new ritual: the pre-mortem. In every design review, we assume total failure at peak load. We trace dependencies — transit providers, certificate authorities, undersea cables, even physical access roads. We hunt for shared fate: two “diverse” carriers in the same conduit, a single control plane for multi-region DNS or a vendor update applied globally without validation. Each discovery triggers action: a new peer, a policy rewrite, a satellite link or a dark fiber lease. AWS recommends pre-mortems in its Reliability Pillar.
Two years ago, I sat in a dim network operations center at 3 a.m., cold coffee forgotten, as one BGP update spread chaos via a global transit provider. A peer leaked a default route with lower preference, sucking outbound traffic into oblivion. The backup path was fully functional, yet our policy still favored the tainted route. For 17 minutes, half the internet vanished for users. Customers raged. Executives demanded answers. A swift prefix filter fixed it, but the lesson lingered: redundancy requires not just a second path, but intelligence to choose it wisely and reject the wrong one. That night, I rewrote our change process: no routing policy touches production without simulation, peer review and automated testing.
