Descriptions:
Joel Becker from METR (Model Evaluation and Threat Research) presents the organization’s framework for measuring AI agent task horizons—the maximum duration and complexity of tasks that current AI systems can complete autonomously and reliably. The talk opens with a provocative quantitative argument: task horizons have been growing in a log-linear relationship with compute investment, and if compute scaling were to slow significantly after 2030 (due to power constraints, capital limits, or physical bottlenecks), the implied delay to key AI capability milestones could stretch by years—a materially different picture than most forecasts assume.
Becker discusses the core methodological challenge METR faces: as task horizons expand, constructing evaluation benchmarks that are long enough to be meaningful but short enough to administer becomes increasingly intractable. The team is developing approaches to this problem but characterizes them as early. A significant portion of the conversation questions whether time horizon remains the right primary metric—as reliability and compute efficiency become more important than raw duration, evaluations may need to shift toward measuring equivalent outcomes at lower cost or latency.
The talk also covers applied productivity measurement, including AI assistance for experienced open-source developers and ML engineers working on data pipelines. A detailed exchange illustrates current agent limitations on complex multi-system queries—such as reconstructing deployment timelines across repositories with incomplete telemetry requiring GitHub API calls—even when simpler SQL or Pandas tasks succeed. The discussion is directly relevant for researchers, safety practitioners, and anyone designing evaluation frameworks for frontier AI systems.
📺 Source: AI Engineer · Published January 19, 2026
🏷️ Format: Deep Dive







