[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

More

Descriptions:

John Yang, creator of SWE-bench, sits down with the Latent Space podcast at NeurIPS 2025 to survey the state of coding evaluations heading into 2026. Yang traces SWE-bench’s trajectory from its October 2023 launch through Cognition’s Devin release—which he credits with igniting the current benchmark arms race—and walks through the official extensions: multimodal and multilingual variants now covering nine programming languages (JavaScript, Rust, Java, Ruby, C, and others) across 40 repositories, a deliberate move to address longstanding criticism about the original benchmark’s heavy Django/Python bias.

A large portion of the conversation focuses on Code Clash, Yang’s new evaluation framework designed to fix what he sees as fundamental flaws in unit-test-based benchmarks. Rather than isolated task instances, Code Clash runs LLMs in competitive programming tournaments where each model maintains and iteratively improves its own codebase across multiple rounds before the codebases are pitted head-to-head and scored by an Elo-style judge. The goal is to evaluate long-horizon development on codebases where prior model decisions have real consequences.

Yang also surveys the broader ecosystem, calling out Terminal Bench, SecBench (cybersecurity-focused), and user-simulation benchmarks like TauBench and WebArena as notable 2025 efforts. He discusses the inherent cost-versus-completeness tension in agentic eval design and shares his thinking on where curation methodology for the next generation of benchmarks is heading. An essential listen for anyone building or evaluating coding agents.


📺 Source: Latent Space · Published December 31, 2025
🏷️ Format: Podcast