Finally a good benchmark (DeepSWE)

Benchmarks2 months ago

Finally a good benchmark (DeepSWE)

Descriptions:

Matthew Berman breaks down DeepSWE, a new long-horizon software engineering benchmark released by data-curve.ai that claims to fix the most significant flaws in today’s leading coding evaluations, particularly SWEBench Pro.

The benchmark’s four core advances are contamination-free task design (every problem is written from scratch, not adapted from public GitHub commits or pull requests), broad repository coverage across 91 active open-source projects spanning TypeScript, Go, Python, JavaScript, and Rust, real-world prompt complexity (prompts are half the length of SWEBench’s but solutions require 5.5x more code), and dramatically improved verification accuracy. On that last point, SWEBench Pro’s verifier produces false positives 8.5% of the time and false negatives 24% of the time; DeepSWE brings those rates down to 0.3% and 1.1% respectively — a substantial methodological improvement that changes which models look best.

The resulting leaderboard differs significantly from other benchmarks: GPT 5.5 Extra High leads by more than 15 percentage points over Anthropic’s Opus 4.7, a gap Berman says aligns with what engineers have been reporting anecdotally. Gemini 3.5 Flash scores around 28%, with a long tail of other models below. Berman argues DeepSWE better captures how developers actually use agentic coders — by describing desired behavior rather than over-specifying implementation steps — making it a more meaningful proxy for real-world coding agent utility.

📺 Source: Matthew Berman · Published May 27, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

Matthew Berman

Tags

Anthropic Claude 4.5 Haiku Claude Opus 4.7 Composer 2.5 DeepSWE Gemini Flash 3.5 GPT-55 OpenAI Theo

Prev

Beating the AI Doom Cycle

Next

Microsoft Lens in ComfyUI: Tiny Model, Big Images|5 Lens Tests: Realism, Text, Prompts & More

18 Related Posts

Related Posts

16:29

Benchmarks

Opus 5 vs GPT-5.6 On Polymarket Predictions — Week 1

24 hours ago

11:15

Benchmarks

Single Photo vs. Character Sheet: The LTX 2.3 Best Face ID Secret

24 hours ago

13:14

Benchmarks

Qwen-Audio-3.0-TTS Tested: 16 Languages, Instruction Control & Emotion Tags

6 days ago

21:31

Benchmarks

Is Kimi K3 Really That Good?! (Don’t Just Believe The Hype)

6 days ago

10:49

Benchmarks

Ling 3.0 Flash: A Production-Scale Coding Agentic Model

7 days ago

08:48

Benchmarks

Catmind-1.2b: A Reasoning Model that Thinks in Cat Stories

1 week ago