SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

Foundation Models2 months ago

SWE-rebench: Lessons from Evaluating Coding Agents — Ibragim Badertdinov, Nebius

Descriptions:

Ibragim Badertdinov, an AI researcher at Nebius with an unconventional background—a trained dentist turned NeurIPS and ICML author—presented the operational lessons behind SWE-rebench at the AI Engineer conference. SWE-rebench is a continuously refreshed leaderboard that evaluates 30 coding models monthly on real-world software engineering tasks drawn from popular open-source GitHub repositories, with a hard time-split design to prevent benchmark contamination: only issues and pull requests from the prior month are used, so no task can leak into a model’s training data before evaluation.

Every task in SWE-rebench is packaged as a Docker image—often 1 to 10 gigabytes—containing a working repository environment and a verifier that checks test transitions from fail-to-pass (the core fix) and pass-to-pass (regression stability). Badertdinov was candid about infrastructure failure modes: flaky external dependencies, images defaulting to 1970s system timestamps, and inconsistent network access have all invalidated evaluation runs. His recommendation: treat benchmark infrastructure with the same rigor as model selection, and define retry policies that cleanly separate model errors from infrastructure errors.

The talk also covers caching strategies that meaningfully reduce per-run compute cost, the agent scaffold used for reference runs with Claude Opus 4.6 (described as a minimalist ReAct-style setup), and manual verification—roughly a full working day per monthly batch—to confirm tasks are genuinely solvable yet challenging. Reference harnesses from Claude Code, Codex, and Genie are included for cross-system comparison.

📺 Source: AI Engineer · Published June 04, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

Claude Code Claude Opus 4.6 Codex Docker Gemini GitHub Nebius SWE-bench

Prev

AI Financing Is an Arms Race, Says GoldenTree’s Tananbaum

Next

Mellum2: JetBrains’ New Coding Model – vLLM + MCP Tool Use Locally

18 Related Posts

Related Posts

21:09

Foundation Models

Persona Engineering: A Field Guide to AI Synthetic Personas — Ishan Anand, InsightSciences.ai

1 day ago

21:39

Foundation Models

Serving 2 Million Models Without Melting: Scaling the Hugging Face Hub — Arek Borucki, Hugging Face

2 days ago

06:40

Foundation Models

AMD Releases First Ever AI model: Instella-MoE-16B-A3B-Think

2 days ago

24:01

Foundation Models

US AI Dominance Is Over: Here’s Why

3 days ago

17:31

Foundation Models

The Messy Reality of Scale: Synthetic Data and Pre-Training — Marah Abdin & Robert McHardy, poolside

4 days ago

23:13

Foundation Models

Evaling Video Slop — Maor Bril, Character.ai

5 days ago