Lesson 80 • Caeruleum Velum Mortis
"Blue Screen of Death." This is the software-engineering-disaster equivalent of the butterfly flapping its wings and begetting a hurricane.
A cavalier patch to a broadly integrated cybersecurity platform caused a global outage costing billions.
wroteLast Friday (19 July,) the largest-ever software-initiated global outage hit machines worldwide. Millions of Windows 10 and 11 operating systems used by societally-critical businesses like airlines, banks, supermarkets, police departments, hospitals, TV channels, etc, suddenly crashed with the dreaded “Blue Screen of Death,” and no obvious way to fix them. This was a truly global outage; the US, Europe, Asia, and Australia, were all hit.
Delta Air Lines will have lost $500m and provoked further regulatory scrutiny, still canceling flights a week later. On monitors everywhere, the blue screen of death ostensibly branded the outage courtesy of Microsoft. The reputational damage of an upstream service was a pandemic of service interruption and lag that proportionally cost startups real money, real dignity, and real customers. Users don’t care why a service is slow, just that it is - and just that it’s yours.
Global air travel descended into chaos, and in Alaska the emergency services number stopped working. In the UK, Sky News TV was unable to broadcast, and McDonalds had to close some of its Japanese outlets due to cash registers going down. —
It is a
tenant that “most problems, aren’t” — but this probably isn’t one of those. Who might have died because the emergency services couldn’t respond? Which relationships were irrevocably damaged because someone couldn’t make a flight? Who’s going to lose their job?Congress summoned CrowdStrike’s CEO and we should expect in the wake of his summons one or more engineers will be fired. Assume customer loss and loss of revenue due to SLA infractions will do significant damage to CrowdStrike’s business, resulting in layoffs.
The breadth of disruption demands scapegoats, but it’s clear the problems are systemic. Engineering lacked testing procedures or ignored them. Regulation in the 2000s forced the architectural choice that allowed a security vendor (CrowdStrike) to brick Windows machines in the first place - 20 years later. This is the software-engineering-disaster equivalent of the butterfly flapping its wings and begetting a hurricane.
Many of you might have been forced in the position to answer for lag or an outage in your work but hopefully found a silver lining in the newsworthiness of the event so we might shift blame and save face. Satya Nadella is pointing fingers even at the highest levels. This is not because you or Satya Nadella are petty, but because outages cost tech workers their jobs. It’s a forgivable face-saving instinct.
Blame is a kind of losing-your-shit irrationality we should look sidelong at, however, and this outage — rich with cause, rich with effect, rich with examples of leaders publicly reacting — should provoke a practical opportunity to look inward and ask: what could I have done to prevent the blue screen of death? The answer isn’t “nothing.”
When you blame and criticize others, you are avoiding some truth about yourself. — Nhat Hanh