How METR measures Long Tasks and Experienced Open Source Dev Productivity – Joel Becker, METR

Foundation Models4 months ago

How METR measures Long Tasks and Experienced Open Source Dev Productivity – Joel Becker, METR

Descriptions:

Joel Becker from METR (Model Evaluation and Threat Research) presents the organization’s framework for measuring AI agent task horizons—the maximum duration and complexity of tasks that current AI systems can complete autonomously and reliably. The talk opens with a provocative quantitative argument: task horizons have been growing in a log-linear relationship with compute investment, and if compute scaling were to slow significantly after 2030 (due to power constraints, capital limits, or physical bottlenecks), the implied delay to key AI capability milestones could stretch by years—a materially different picture than most forecasts assume.

Becker discusses the core methodological challenge METR faces: as task horizons expand, constructing evaluation benchmarks that are long enough to be meaningful but short enough to administer becomes increasingly intractable. The team is developing approaches to this problem but characterizes them as early. A significant portion of the conversation questions whether time horizon remains the right primary metric—as reliability and compute efficiency become more important than raw duration, evaluations may need to shift toward measuring equivalent outcomes at lower cost or latency.

The talk also covers applied productivity measurement, including AI assistance for experienced open-source developers and ML engineers working on data pipelines. A detailed exchange illustrates current agent limitations on complex multi-system queries—such as reconstructing deployment timelines across repositories with incomplete telemetry requiring GitHub API calls—even when simpler SQL or Pandas tasks succeed. The discussion is directly relevant for researchers, safety practitioners, and anyone designing evaluation frameworks for frontier AI systems.

📺 Source: AI Engineer · Published January 19, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

Cursor GitHub GPT-5 LinkedIn Meta METR

Prev

Anthropic Just Added These Features to Claude Code

Anthropic Just Added These Features to Claude Code

Next

Disposable Software: The Trend 90% of People are Getting Wrong

Disposable Software: The Trend 90% of People are Getting Wrong

18 Related Posts

Related Posts

16:23

Foundation Models

Your SaaS Bill Just Got a Second Meter. You’re About to Pay It.

1 hour ago

31:55

Foundation Models

The biggest AI breakthrough in medicine & drug discovery

1 day ago

01:20:07

Foundation Models

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

1 day ago

25:53

Foundation Models

The Trillion Dollar Agentic Workflow Opportunity Is Here

1 day ago

18:37

Foundation Models

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner

2 days ago

20:09

Foundation Models

Pinecone Just Demoted Vector Search. Here’s the Knowledge Layer.

2 days ago