The BIG Outage: The One System That Can Take Down Everything

The BIG Outage: The One System That Can Take Down Everything

More

Descriptions:

Dave Plumber, a retired Microsoft software engineer whose career spans MS-DOS and Windows 95, challenges the widely held assumption that cloud computing is inherently safer than on-premises infrastructure. The central argument is that cloud doesn’t eliminate failures—it concentrates them, trading many small, isolated incidents for fewer but far larger correlated ones that affect every customer simultaneously.

Drawing on three eras of computing—mainframes, personal computers, and cloud—Plumber explains how modern outages rarely originate from hardware failures, which cloud providers handle well through redundancy. Instead, they stem from human-layer events: configuration changes, software rollouts, certificate rotations, DNS updates, and policy tweaks that look safe in isolation. He introduces the concept of assumed redundancy: when primary systems and their failovers share the same identity provider, deployment pipeline, and DNS infrastructure, they carry the same failure assumptions and fail together.

The video also puts SLA numbers in concrete terms—99.9% uptime still permits 43 minutes of downtime per month—and reframes SLAs as refund policies rather than safety guarantees. Multi-cloud strategies are addressed head-on: true independence across providers requires different tooling, APIs, runbooks, and operational expertise, negating much of what motivated cloud adoption in the first place. The overall message is a call for engineers and infrastructure decision-makers to think carefully about failure domains, the hidden costs of centralized dependencies, and what the word ‘safe’ actually means when evaluating cloud architecture.


📺 Source: Dave’s Garage · Published December 18, 2025
🏷️ Format: Deep Dive

1 Item

Channels