AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

Foundation Models2 months ago

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

Descriptions:

Ara Khan, speaking at the AI Dev 26 x SF event hosted by DeepLearningAI, argues that most developers are fundamentally wrong about AI evals — and the fix is neither blind faith in leaderboard scores nor pure vibes-based intuition. Drawing on years of experience building coding agents, Khan identifies two failure modes: the “objective metrics camp” that treats benchmark numbers as gospel, and the “taste is king” camp that rejects measurement entirely. She points to benchmark-maxing behavior among AI labs as evidence that high scores on evaluations like SWE-bench don’t translate to real-world model quality.

The talk walks through how Khan’s team eventually adopted Terminal Bench, a Stanford-affiliated benchmark of 89 real-world software engineering problems — including database issues, race conditions, and front-end bugs — chosen specifically because they mirror day-to-day engineering work rather than textbook algorithms like Fibonacci sequences. This contrasts with existing evals that measure skills irrelevant to production workflows.

Khan’s core message is that evals are simultaneously an engineering problem and a philosophy problem. Developers should build evals that reflect their actual agent use cases, interpret results with nuance, and embed evaluation directly into the agent development loop. Whether you’re building a coding agent, a shopping agent, or any complex production workflow, evals remain one of the most critical tools for iterative improvement — imperfect but indispensable.

📺 Source: DeepLearningAI · Published May 22, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

DeepLearningAI

Tags

Anthropic Claude Code Claude Opus 4.5 Cline Cursor DeepSeek Meta Modal OpenAI SWE-bench

Prev

This is absolutely CRAZY

Next

printf is Actually a Secret Virtual Machine – And a Giant Security Hole!

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

23 hours ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

20:24

Foundation Models

From Agent Traces to Agent Simulations — Rustem Feyzkhanov, Snorkel AI

5 days ago