Finally a good benchmark (DeepSWE)

Finally a good benchmark (DeepSWE)

More

Descriptions:

Matthew Berman breaks down DeepSWE, a new long-horizon software engineering benchmark released by data-curve.ai that claims to fix the most significant flaws in today’s leading coding evaluations, particularly SWEBench Pro.

The benchmark’s four core advances are contamination-free task design (every problem is written from scratch, not adapted from public GitHub commits or pull requests), broad repository coverage across 91 active open-source projects spanning TypeScript, Go, Python, JavaScript, and Rust, real-world prompt complexity (prompts are half the length of SWEBench’s but solutions require 5.5x more code), and dramatically improved verification accuracy. On that last point, SWEBench Pro’s verifier produces false positives 8.5% of the time and false negatives 24% of the time; DeepSWE brings those rates down to 0.3% and 1.1% respectively โ€” a substantial methodological improvement that changes which models look best.

The resulting leaderboard differs significantly from other benchmarks: GPT 5.5 Extra High leads by more than 15 percentage points over Anthropic’s Opus 4.7, a gap Berman says aligns with what engineers have been reporting anecdotally. Gemini 3.5 Flash scores around 28%, with a long tail of other models below. Berman argues DeepSWE better captures how developers actually use agentic coders โ€” by describing desired behavior rather than over-specifying implementation steps โ€” making it a more meaningful proxy for real-world coding agent utility.


๐Ÿ“บ Source: Matthew Berman ยท Published May 27, 2026
๐Ÿท๏ธ Format: Benchmark Test

1 Item

Channels