Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

Foundation Models5 months ago

Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

Descriptions:

Naman Jain, a researcher at Cursor, traces four years of progress in coding benchmark design at the AI Engineer conference — from evaluating single-line Pandas completions to assessing models on multi-hour code optimization tasks across real-world repositories. The talk distills hard-won lessons from projects spanning code completion, competitive programming (CodeBench), repository-level question answering, and a new frontier benchmark for full-codebase performance optimization.

Three persistent challenges anchor the discussion: data contamination (models trained on the full internet have likely seen benchmark problems on Stack Overflow or GitHub), brittle test suites (test cases that pass semantically incorrect solutions), and difficulty miscalibration (benchmarks clustered at either 80%+ or sub-1% pass rates, both of which provide minimal signal for improvement). Jain describes how dynamic evaluation sets — periodically refreshed with problems released after a model’s training cutoff — address contamination while allowing difficulty distributions to be recalibrated as model capabilities advance.

The talk concludes with a new code-optimization benchmark built by crawling real performance-related commits from codebases like llama.cpp. Models are asked to generate patches that both pass equivalence checks against the human patch and improve runtime on defined workloads — a construct-valid, real-world-grounded task that Jain argues is essential as models push into longer-horizon, lower-level engineering work involving C, C++, and Rust.

📺 Source: AI Engineer · Published December 15, 2025
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

DeepSeek GitHub GitHub Copilot Google

Prev

Why AI-Native Companies Are Deleting Software You're Still Paying For (The $56K Lesson)

Why AI-Native Companies Are Deleting Software You're Still Paying For (The $56K Lesson)

Next

OpenAI Researcher QUITS — Says the Company Is Hiding the Truth – (It Actually Worse Than You Think)

OpenAI Researcher QUITS — Says the Company Is Hiding the Truth – (It Actually Worse Than You Think)

18 Related Posts

Related Posts

16:23

Foundation Models

Your SaaS Bill Just Got a Second Meter. You’re About to Pay It.

1 hour ago

31:55

Foundation Models

The biggest AI breakthrough in medicine & drug discovery

1 day ago

01:20:07

Foundation Models

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

1 day ago

25:53

Foundation Models

The Trillion Dollar Agentic Workflow Opportunity Is Here

1 day ago

18:37

Foundation Models

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner

2 days ago

20:09

Foundation Models

Pinecone Just Demoted Vector Search. Here’s the Knowledge Layer.

2 days ago