Descriptions:
Ibragim Badertdinov, an AI researcher at Nebius with an unconventional background—a trained dentist turned NeurIPS and ICML author—presented the operational lessons behind SWE-rebench at the AI Engineer conference. SWE-rebench is a continuously refreshed leaderboard that evaluates 30 coding models monthly on real-world software engineering tasks drawn from popular open-source GitHub repositories, with a hard time-split design to prevent benchmark contamination: only issues and pull requests from the prior month are used, so no task can leak into a model’s training data before evaluation.
Every task in SWE-rebench is packaged as a Docker image—often 1 to 10 gigabytes—containing a working repository environment and a verifier that checks test transitions from fail-to-pass (the core fix) and pass-to-pass (regression stability). Badertdinov was candid about infrastructure failure modes: flaky external dependencies, images defaulting to 1970s system timestamps, and inconsistent network access have all invalidated evaluation runs. His recommendation: treat benchmark infrastructure with the same rigor as model selection, and define retry policies that cleanly separate model errors from infrastructure errors.
The talk also covers caching strategies that meaningfully reduce per-run compute cost, the agent scaffold used for reference runs with Claude Opus 4.6 (described as a minimalist ReAct-style setup), and manual verification—roughly a full working day per monthly batch—to confirm tasks are genuinely solvable yet challenging. Reference harnesses from Claude Code, Codex, and Genie are included for cross-system comparison.
📺 Source: AI Engineer · Published June 04, 2026
🏷️ Format: Deep Dive






