AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

More

Descriptions:

Ara Khan, speaking at the AI Dev 26 x SF event hosted by DeepLearningAI, argues that most developers are fundamentally wrong about AI evals — and the fix is neither blind faith in leaderboard scores nor pure vibes-based intuition. Drawing on years of experience building coding agents, Khan identifies two failure modes: the “objective metrics camp” that treats benchmark numbers as gospel, and the “taste is king” camp that rejects measurement entirely. She points to benchmark-maxing behavior among AI labs as evidence that high scores on evaluations like SWE-bench don’t translate to real-world model quality.

The talk walks through how Khan’s team eventually adopted Terminal Bench, a Stanford-affiliated benchmark of 89 real-world software engineering problems — including database issues, race conditions, and front-end bugs — chosen specifically because they mirror day-to-day engineering work rather than textbook algorithms like Fibonacci sequences. This contrasts with existing evals that measure skills irrelevant to production workflows.

Khan’s core message is that evals are simultaneously an engineering problem and a philosophy problem. Developers should build evals that reflect their actual agent use cases, interpret results with nuance, and embed evaluation directly into the agent development loop. Whether you’re building a coding agent, a shopping agent, or any complex production workflow, evals remain one of the most critical tools for iterative improvement — imperfect but indispensable.


📺 Source: DeepLearningAI · Published May 22, 2026
🏷️ Format: Deep Dive

1 Item

Channels