Evals Are Broken, Use Them Anyway — Ara Khan, Cline

Evals Are Broken, Use Them Anyway — Ara Khan, Cline

More

Descriptions:

Ara Khan, an engineer on the Cline team, delivers a pointed critique of how the AI industry uses evaluation benchmarks — and why most practitioners are getting it wrong in one of two directions. The talk, presented at AI Engineer, argues that both benchmark maximalists (who treat leaderboard numbers as ground truth) and vibe-driven engineers (who dismiss evals entirely) are misusing a genuinely useful tool.

Khan offers three practical heuristics: don’t trust model-reported benchmark numbers at face value (citing a Meta benchmark-gaming incident from that morning); stay current on frontier model rankings without chasing every new release since the top model changes every few months; and recognize that most standard evals don’t test what real-world coding agents actually do. He also references TerminalBench from Stanford as a parallel effort to build more realistic agent evaluation infrastructure.

The centerpiece of the talk is Cline’s own evaluation project: working from a large dataset of opt-in user coding sessions, the team manually cleaned and structured real programming problems to create evals grounded in actual developer workflows. Khan explains why single-turn LLM evals (binary answers, narrow search space) don’t translate to agents: a meaningful coding agent eval must assess a full chain of file reads, environment setup, dependency installation, test execution, and regression checking — a multi-step trajectory where both success and side effects matter. Practitioners building or selecting coding agents will come away with a more calibrated view of when benchmark scores are useful signal and when to build their own.


📺 Source: AI Engineer · Published June 06, 2026
🏷️ Format: Deep Dive

1 Item

Channels