Evals Are Broken, Use Them Anyway — Ara Khan, Cline

Foundation Models2 months ago

Evals Are Broken, Use Them Anyway — Ara Khan, Cline

Descriptions:

Ara Khan, an engineer on the Cline team, delivers a pointed critique of how the AI industry uses evaluation benchmarks — and why most practitioners are getting it wrong in one of two directions. The talk, presented at AI Engineer, argues that both benchmark maximalists (who treat leaderboard numbers as ground truth) and vibe-driven engineers (who dismiss evals entirely) are misusing a genuinely useful tool.

Khan offers three practical heuristics: don’t trust model-reported benchmark numbers at face value (citing a Meta benchmark-gaming incident from that morning); stay current on frontier model rankings without chasing every new release since the top model changes every few months; and recognize that most standard evals don’t test what real-world coding agents actually do. He also references TerminalBench from Stanford as a parallel effort to build more realistic agent evaluation infrastructure.

The centerpiece of the talk is Cline’s own evaluation project: working from a large dataset of opt-in user coding sessions, the team manually cleaned and structured real programming problems to create evals grounded in actual developer workflows. Khan explains why single-turn LLM evals (binary answers, narrow search space) don’t translate to agents: a meaningful coding agent eval must assess a full chain of file reads, environment setup, dependency installation, test execution, and regression checking — a multi-step trajectory where both success and side effects matter. Practitioners building or selecting coding agents will come away with a more calibrated view of when benchmark scores are useful signal and when to build their own.

📺 Source: AI Engineer · Published June 06, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

Claude Code Cline Codex Cursor Gemini Kimi Meta Modal OpenAI SWE-bench

Prev

AI Not Holding Back Companies From Hiring: Yale Budget Lab

Next

BLS-Mini-Code-1.0: Testing Cohere’s Secret Coding Model Locally

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

24 hours ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

23:13

Foundation Models

Evaling Video Slop — Maor Bril, Character.ai

5 days ago