Descriptions:
The AI Daily Brief delivers an in-depth analysis of the latest results from METR (Model Evaluation and Threat Research), whose benchmark tracking AI agent capability has been called one of the most important charts in the global economy. The study measures the complexity of software engineering tasks AI agents can reliably complete, using human engineer time as a proxy for difficulty—not the AI’s actual elapsed time. A task that takes a human coder two hours to complete counts as a two-hour task regardless of how quickly an AI solves it, and the headline metric requires a 50% success rate.
The newly released results show Claude Opus 4.6 achieving a benchmark time horizon of approximately 14.5 hours—more than tripling Opus 4.5’s 4 hours and 49 minutes, the largest single-generation jump in METR’s history. GPT-5.3 Codex reached 6.5 hours. The implied doubling rate for agent capability has accelerated to roughly every 1.5 months, compared to the original 7-month rate observed when the chart launched in early 2025. However, METR itself has heavily caveated these figures: Opus 4.6 has essentially saturated the existing task set, with a confidence interval spanning 8 to 98 hours. Researcher David Re warns the measurement is “extremely noisy” and that a small shift in task distribution could have produced a reading anywhere from 8 to 20 hours.
The episode carefully balances the bull and bear cases, drawing on reactions from investors like Nick Carter, researchers, and even a recent Stanford talk by Bernie Sanders referencing the chart—while noting METR is updating its methodology to address benchmark saturation.
📺 Source: The AI Daily Brief: Artificial Intelligence News · Published February 24, 2026
🏷️ Format: Deep Dive







