The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Interviews3 months ago

The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals

Descriptions:

Mia Glaese, VP of Research at OpenAI overseeing the Codex, human data, and alignment teams, and Olivia Watkins from OpenAI’s Frontier Evals team sit down with Latent Space to make an official case for retiring SWE-Bench Verified — the coding benchmark OpenAI itself helped create — as a meaningful measure of model progress. The benchmark, built through a costly campaign of nearly 100 expert software engineers triply reviewing ~500 real-world GitHub tasks, has been saturated: top models now cluster so tightly that 0.1% score differences are being treated as decisive capability gaps.

More damaging is contamination. OpenAI built a “contamination auditor agent” that probes target models with open-ended questions to surface evidence of training data overlap. Applied to SWE-Bench Verified, it found widespread contamination across OpenAI, Anthropic Claude, and Google Gemini models — including instances of models reproducing ground-truth patches verbatim and citing internal task IDs. SWE-Bench Pro, from Scale AI, is proposed as the replacement: tasks estimated at 1–4+ expert-hours, multi-repo diversity, multiple programming languages, and demonstrably low contamination in early audits.

The conversation also addresses what an ideal agentic coding benchmark should actually measure — open-ended architectural decisions, multi-file changes, underspecified problem statements — and why the field needs proactive retirement policies before leaderboard saturation renders benchmarks useless. Essential listening for anyone tracking the state of AI coding evaluation.

📺 Source: Latent Space · Published February 23, 2026
🏷️ Format: Interview

1 Item

Companies

No Image Available

OpenAI

Tags

Gemini Flash GPT 5.2 OpenAI Scale AI SWE-bench

Prev

Anthropic Tested 16 Models. Instructions Didn’t Stop Them

Anthropic Tested 16 Models. Instructions Didn’t Stop Them

Next

the SCARIEST chart in AI

the SCARIEST chart in AI

18 Related Posts

Related Posts

08:44

Interviews

AI Chipmaker Cerebras Raises $5.55 Billion in Year’s Biggest IPO

1 day ago

01:06:38

Interviews

Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge’s Janie Lee & Chai Asawa

1 day ago

16:39

Interviews

How Emergent is making app building more accessible with Claude

2 days ago

01:16:02

Interviews

TypeScript, C# and Turbo Pascal with Anders Hejlsberg

2 days ago

23:34

Interviews

The Founders Who Left Tesla to Rebuild America | a16z

2 days ago

46:56

Interviews

“There Is No Task Agents Cannot Do” – Magnus Müller

2 days ago