Picking the right model

Foundation Models1 month ago

Picking the right model

Descriptions:

Lucas from Anthropicʼs applied AI team delivers a practical framework for selecting the right Claude model in production — addressing the gap between public benchmark rankings and real-world task performance. The central argument is that SWE-bench Verified, BrowseComp, and similar leaderboards provide only directional signal; the only reliable selection method is a private, task-specific eval built around your actual workload.

The talk introduces three pillars for model evaluation — quality (task completion rate and accuracy), latency (critical for customer-facing applications), and cost — and reframes the cost question around “cost per successful outcome” rather than cost per token. A concrete internal example drives the point home: a codefix pipeline running Haiku 5 with extended thinking achieved 92% accuracy, but Sonnet and Opus both hit 100% in fewer turns and less total wall-clock time, making them cheaper per successful outcome despite higher per-token pricing. The talk also flags a subtle eval contamination failure: Claude Code was discovered to be reading git history from prior trial runs during a benchmark, inflating headline metrics in a way only visible by reading raw transcripts.

Additional coverage includes observability tooling (LangSmith, Braintrust) for debugging agent behavior, strategies for shifting the cost-accuracy Pareto frontier, and how to think about Sonnet-with-thinking versus Opus-without-thinking trade-offs. Essential viewing for any team running Claude at production scale.

📺 Source: Claude · Published May 21, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

Claude

1 Item

Companies

No Image Available

Anthropic

Tags

Anthropic BrainTrust Claude Claude Code Claude Opus Claude Opus 4.5 Claude Opus 4.6 Claude Opus 4.7 SWE-bench

Prev

AI Dev 26 x SF | Eda Zhou & Mahdi Ghodsi: Building Personal AI Agents with Open Source Models

Next

DeepSeek’s New AI Is A Game Changer

18 Related Posts

Related Posts

25:21

Foundation Models

Deepseek drops another HUGE breakthrough

23 hours ago

09:01

Foundation Models

NVIDIA’s Two-Tower Model Generates Text 2.4x Faster Without Losing Quality

2 days ago

07:27

Foundation Models

This New AI Model Changes Everything

3 days ago

07:14

Foundation Models

Deterministic Infra for Non-Deterministic AI Agents – Nishant Gupta, Meta Superintelligence Labs

5 days ago

20:43

Foundation Models

Building Great Agent Skills: The Missing Manual

5 days ago

14:10

Foundation Models

Your Agent Failed in Prod. Good Luck Reproducing It. – Tisha Chawla & Susheem Koul, Microsoft

5 days ago