Descriptions:
Mia Glaese, VP of Research at OpenAI overseeing the Codex, human data, and alignment teams, and Olivia Watkins from OpenAI’s Frontier Evals team sit down with Latent Space to make an official case for retiring SWE-Bench Verified — the coding benchmark OpenAI itself helped create — as a meaningful measure of model progress. The benchmark, built through a costly campaign of nearly 100 expert software engineers triply reviewing ~500 real-world GitHub tasks, has been saturated: top models now cluster so tightly that 0.1% score differences are being treated as decisive capability gaps.
More damaging is contamination. OpenAI built a “contamination auditor agent” that probes target models with open-ended questions to surface evidence of training data overlap. Applied to SWE-Bench Verified, it found widespread contamination across OpenAI, Anthropic Claude, and Google Gemini models — including instances of models reproducing ground-truth patches verbatim and citing internal task IDs. SWE-Bench Pro, from Scale AI, is proposed as the replacement: tasks estimated at 1–4+ expert-hours, multi-repo diversity, multiple programming languages, and demonstrably low contamination in early audits.
The conversation also addresses what an ideal agentic coding benchmark should actually measure — open-ended architectural decisions, multi-file changes, underspecified problem statements — and why the field needs proactive retirement policies before leaderboard saturation renders benchmarks useless. Essential listening for anyone tracking the state of AI coding evaluation.
📺 Source: Latent Space · Published February 23, 2026
🏷️ Format: Interview







