Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

More

Descriptions:

Naman Jain, a researcher at Cursor, traces four years of progress in coding benchmark design at the AI Engineer conference — from evaluating single-line Pandas completions to assessing models on multi-hour code optimization tasks across real-world repositories. The talk distills hard-won lessons from projects spanning code completion, competitive programming (CodeBench), repository-level question answering, and a new frontier benchmark for full-codebase performance optimization.

Three persistent challenges anchor the discussion: data contamination (models trained on the full internet have likely seen benchmark problems on Stack Overflow or GitHub), brittle test suites (test cases that pass semantically incorrect solutions), and difficulty miscalibration (benchmarks clustered at either 80%+ or sub-1% pass rates, both of which provide minimal signal for improvement). Jain describes how dynamic evaluation sets — periodically refreshed with problems released after a model’s training cutoff — address contamination while allowing difficulty distributions to be recalibrated as model capabilities advance.

The talk concludes with a new code-optimization benchmark built by crawling real performance-related commits from codebases like llama.cpp. Models are asked to generate patches that both pass equivalence checks against the human patch and improve runtime on defined workloads — a construct-valid, real-world-grounded task that Jain argues is essential as models push into longer-horizon, lower-level engineering work involving C, C++, and Rust.


📺 Source: AI Engineer · Published December 15, 2025
🏷️ Format: Deep Dive

1 Item

Channels