the SCARIEST chart in AI

the SCARIEST chart in AI

More

Descriptions:

Wes Roth breaks down what he calls “the scariest chart in AI development history” — a METR (Meter Research) benchmark tracking AI agents’ ability to complete tasks measured in human expert hours. METR, a nonprofit evaluating frontier AI risks, assembled hundreds of tasks across engineering, coding, machine learning, and cybersecurity, using human completion time as the y-axis rather than AI runtime. The result is a capability measure grounded in real labor displacement.

The data shows a sharp acceleration: Claude Opus 4.5 reached approximately the 5-hour mark at 50% success rate, and the recently released Claude Opus 4.6 jumped to 14.5 hours — nearly two full workdays of expert output per session. The doubling pace of AI task capability has also compressed, from every 7 months historically to roughly every 4 months from 2023 onward, suggesting the trend line itself is bending upward.

Roth pairs the benchmark analysis with personal experience: Opus 4.6 completed a complex multi-year accounting reconciliation — work he estimates would cost hundreds of dollars with a CPA — in about 30-40 minutes while he played a video game. He also built and deployed a fully functional AI news aggregator (natural20.com) overnight without supervision, a project he estimates would take a human specialist one to two days. His key observation: many of these 14-hour-equivalent tasks aren’t one-off jobs. When an AI agent completes a complex workflow, it frequently automates that workflow permanently, multiplying real-world leverage well beyond what any benchmark captures.


📺 Source: Wes Roth · Published February 24, 2026
🏷️ Format: Deep Dive

1 Item

Channels