[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Interviews5 months ago

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Descriptions:

John Yang, creator of SWE-bench, sits down with the Latent Space podcast at NeurIPS 2025 to survey the state of coding evaluations heading into 2026. Yang traces SWE-bench’s trajectory from its October 2023 launch through Cognition’s Devin release—which he credits with igniting the current benchmark arms race—and walks through the official extensions: multimodal and multilingual variants now covering nine programming languages (JavaScript, Rust, Java, Ruby, C, and others) across 40 repositories, a deliberate move to address longstanding criticism about the original benchmark’s heavy Django/Python bias.

A large portion of the conversation focuses on Code Clash, Yang’s new evaluation framework designed to fix what he sees as fundamental flaws in unit-test-based benchmarks. Rather than isolated task instances, Code Clash runs LLMs in competitive programming tournaments where each model maintains and iteratively improves its own codebase across multiple rounds before the codebases are pitted head-to-head and scored by an Elo-style judge. The goal is to evaluate long-horizon development on codebases where prior model decisions have real consequences.

Yang also surveys the broader ecosystem, calling out Terminal Bench, SecBench (cybersecurity-focused), and user-simulation benchmarks like TauBench and WebArena as notable 2025 efforts. He discusses the inherent cost-versus-completeness tension in agentic eval design and shares his thinking on where curation methodology for the next generation of benchmarks is heading. An essential listen for anyone building or evaluating coding agents.

📺 Source: Latent Space · Published December 31, 2025
🏷️ Format: Podcast

Tags

Devin SWE-bench

Prev

I was using Claude Code wrong… then I discovered this

I was using Claude Code wrong… then I discovered this

Next

How I Added Vector Search to my Course (Postgres + PGVector )

How I Added Vector Search to my Course (Postgres + PGVector )

18 Related Posts

Related Posts

08:44

Interviews

AI Chipmaker Cerebras Raises $5.55 Billion in Year’s Biggest IPO

1 day ago

01:06:38

Interviews

Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge’s Janie Lee & Chai Asawa

1 day ago

16:39

Interviews

How Emergent is making app building more accessible with Claude

2 days ago

01:16:02

Interviews

TypeScript, C# and Turbo Pascal with Anders Hejlsberg

2 days ago

23:34

Interviews

The Founders Who Left Tesla to Rebuild America | a16z

2 days ago

46:56

Interviews

“There Is No Task Agents Cannot Do” – Magnus Müller

2 days ago