The BIG Outage: The One System That Can Take Down Everything

Foundation Models7 months ago

The BIG Outage: The One System That Can Take Down Everything

Descriptions:

Dave Plumber, a retired Microsoft software engineer whose career spans MS-DOS and Windows 95, challenges the widely held assumption that cloud computing is inherently safer than on-premises infrastructure. The central argument is that cloud doesn’t eliminate failures—it concentrates them, trading many small, isolated incidents for fewer but far larger correlated ones that affect every customer simultaneously.

Drawing on three eras of computing—mainframes, personal computers, and cloud—Plumber explains how modern outages rarely originate from hardware failures, which cloud providers handle well through redundancy. Instead, they stem from human-layer events: configuration changes, software rollouts, certificate rotations, DNS updates, and policy tweaks that look safe in isolation. He introduces the concept of assumed redundancy: when primary systems and their failovers share the same identity provider, deployment pipeline, and DNS infrastructure, they carry the same failure assumptions and fail together.

The video also puts SLA numbers in concrete terms—99.9% uptime still permits 43 minutes of downtime per month—and reframes SLAs as refund policies rather than safety guarantees. Multi-cloud strategies are addressed head-on: true independence across providers requires different tooling, APIs, runbooks, and operational expertise, negating much of what motivated cloud adoption in the first place. The overall message is a call for engineers and infrastructure decision-makers to think carefully about failure domains, the hidden costs of centralized dependencies, and what the word ‘safe’ actually means when evaluating cloud architecture.

📺 Source: Dave’s Garage · Published December 18, 2025
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

Dave’s Garage

Tags

Microsoft

Prev

AI Kernel Generation: What’s working, what’s not, what’s next – Natalie Serrino, Gimlet Labs

AI Kernel Generation: What’s working, what’s not, what’s next – Natalie Serrino, Gimlet Labs

Next

GPT Image 1.5 vs Nano Banana Pro – FULLY Tested

GPT Image 1.5 vs Nano Banana Pro – FULLY Tested

18 Related Posts

Related Posts

25:21

Foundation Models

Deepseek drops another HUGE breakthrough

23 hours ago

09:01

Foundation Models

NVIDIA’s Two-Tower Model Generates Text 2.4x Faster Without Losing Quality

2 days ago

07:27

Foundation Models

This New AI Model Changes Everything

3 days ago

07:14

Foundation Models

Deterministic Infra for Non-Deterministic AI Agents – Nishant Gupta, Meta Superintelligence Labs

5 days ago

20:43

Foundation Models

Building Great Agent Skills: The Missing Manual

5 days ago

14:10

Foundation Models

Your Agent Failed in Prod. Good Luck Reproducing It. – Tisha Chawla & Susheem Koul, Microsoft

5 days ago