Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

More

Descriptions:

Ash Prabaker and Andrew Wilson, engineers on Anthropic’s applied AI team, delivered a technical deep dive at the AI Engineer conference on what it actually takes to build agents that run reliably for five, six, or more consecutive hours. The talk opens with a benchmark that anchors the conversation: with a minimal scaffold, Claude Opus 3.7 could complete 50% of tasks in roughly one hour; Opus 4.6, one year later, reaches the same threshold at twelve hours — a twelve-fold improvement driven by both model improvements and harness engineering.

The core problems are organized into three categories: context degradation (including “context rot” as sessions deepen and “context sense anxiety” as models near window limits), weak out-of-the-box long-horizon planning, and models’ poor ability to judge their own output. The proposed solution is a generator/evaluator architecture in which a dedicated evaluator agent actually launches and uses the application being built — catching real bugs like FastAPI route ordering issues, Boolean logic errors, and hollow feature implementations that pass unit tests but fail in real interaction.

A central finding is that out-of-the-box Claude makes a poor QA agent due to sycophancy bias, routinely logging bugs as “fix later” and moving on. Anthropic’s solution involves negotiating 27 granular contract criteria between the generator and evaluator at the start of a run, with enough specificity that critique becomes actionable rather than vague. The talk includes a live demo — a multi-hour game build — where the evaluator’s intervention produced dramatically better output than a solo unscaffolded run, illustrating how harness design rather than model capability alone determines production-grade agent reliability.


📺 Source: AI Engineer · Published May 18, 2026
🏷️ Format: Deep Dive

1 Item

Channels

1 Item

Companies