Descriptions:
Lucas from Anthropicʼs applied AI team delivers a practical framework for selecting the right Claude model in production — addressing the gap between public benchmark rankings and real-world task performance. The central argument is that SWE-bench Verified, BrowseComp, and similar leaderboards provide only directional signal; the only reliable selection method is a private, task-specific eval built around your actual workload.
The talk introduces three pillars for model evaluation — quality (task completion rate and accuracy), latency (critical for customer-facing applications), and cost — and reframes the cost question around “cost per successful outcome” rather than cost per token. A concrete internal example drives the point home: a codefix pipeline running Haiku 5 with extended thinking achieved 92% accuracy, but Sonnet and Opus both hit 100% in fewer turns and less total wall-clock time, making them cheaper per successful outcome despite higher per-token pricing. The talk also flags a subtle eval contamination failure: Claude Code was discovered to be reading git history from prior trial runs during a benchmark, inflating headline metrics in a way only visible by reading raw transcripts.
Additional coverage includes observability tooling (LangSmith, Braintrust) for debugging agent behavior, strategies for shifting the cost-accuracy Pareto frontier, and how to think about Sonnet-with-thinking versus Opus-without-thinking trade-offs. Essential viewing for any team running Claude at production scale.
📺 Source: Claude · Published May 21, 2026
🏷️ Format: Deep Dive







