Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Foundation Models2 months ago

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Descriptions:

Ash Prabaker and Andrew Wilson, engineers on Anthropic’s applied AI team, delivered a technical deep dive at the AI Engineer conference on what it actually takes to build agents that run reliably for five, six, or more consecutive hours. The talk opens with a benchmark that anchors the conversation: with a minimal scaffold, Claude Opus 3.7 could complete 50% of tasks in roughly one hour; Opus 4.6, one year later, reaches the same threshold at twelve hours — a twelve-fold improvement driven by both model improvements and harness engineering.

The core problems are organized into three categories: context degradation (including “context rot” as sessions deepen and “context sense anxiety” as models near window limits), weak out-of-the-box long-horizon planning, and models’ poor ability to judge their own output. The proposed solution is a generator/evaluator architecture in which a dedicated evaluator agent actually launches and uses the application being built — catching real bugs like FastAPI route ordering issues, Boolean logic errors, and hollow feature implementations that pass unit tests but fail in real interaction.

A central finding is that out-of-the-box Claude makes a poor QA agent due to sycophancy bias, routinely logging bugs as “fix later” and moving on. Anthropic’s solution involves negotiating 27 granular contract criteria between the generator and evaluator at the start of a run, with enough specificity that critique becomes actionable rather than vague. The talk includes a live demo — a multi-hour game build — where the evaluator’s intervention produced dramatically better output than a solo unscaffolded run, illustrating how harness design rather than model capability alone determines production-grade agent reliability.

📺 Source: AI Engineer · Published May 18, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

1 Item

Companies

No Image Available

Anthropic

Tags

Anthropic Anthropic Agent SDK Boris Claude Code Claude Opus 4.5 Claude Opus 4.6 Playwright SWE-bench

Prev

Vibe Coding a Landing Page? Watch This First

Next

Llama.cpp Just Got MTP – Qwen3.6 27B Runs 2x Faster Locally with Two Flags

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

1 day ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

20:24

Foundation Models

From Agent Traces to Agent Simulations — Rustem Feyzkhanov, Snorkel AI

5 days ago