Benchmarking semantic code retrieval on Claude Code — Kuba Rogut, Turbopuffer

Benchmarks2 months ago

Benchmarking semantic code retrieval on Claude Code — Kuba Rogut, Turbopuffer

Descriptions:

Kuba Rogut of Turbopuffer presents original benchmark results comparing three code retrieval strategies for Claude Code: the default agentic search (grep-based file exploration), windowed grep, and semantic search powered by Turbopuffer’s serverless vector database with Voyage Code embeddings.

The headline finding: Claude Code’s default approach achieves 65% file precision — roughly one in three files it reads is irrelevant to the task. Adding windowed grep improves this to around 80%, and combining windowed grep with semantic search pushes file precision to 87%. The benchmark spans 50 tasks drawn from a Claude Code evaluation suite and measures file-level precision, file-level recall, and line-level recall across all three conditions. Rogut introduces TurboGrep, an open-source CLI that chunks a codebase using a tree-splitter library, embeds it with the Voyage Code model, and indexes it into Turbopuffer for fast semantic retrieval at query time.

The talk contextualizes the infrastructure investment using Cursor’s published research — which found a 24% relative improvement in answer accuracy and a 2.6% increase in code retention in large codebases from semantic search — and explains the “cached compute” argument: upfront embedding cost amortizes across every agent session running against the same codebase, delivering compounding token savings at scale. For teams building on Claude Code, evaluating vector database options, or designing RAG pipelines for code-heavy workloads, this is a rigorous, data-backed starting point with reproducible tooling.

📺 Source: AI Engineer · Published June 03, 2026
🏷️ Format: Benchmark Test

1 Item

Channels

No Image Available

AI Engineer

Tags

Anthropic Boris Claude Code Cursor Notion

Prev

The Next $100B Market: Selling to AI Agents

Next

AI Engineer Melbourne 2026 Keynote Livestream | Day 2

18 Related Posts

Related Posts

16:29

Benchmarks

Opus 5 vs GPT-5.6 On Polymarket Predictions — Week 1

1 day ago

11:15

Benchmarks

Single Photo vs. Character Sheet: The LTX 2.3 Best Face ID Secret

1 day ago

13:14

Benchmarks

Qwen-Audio-3.0-TTS Tested: 16 Languages, Instruction Control & Emotion Tags

6 days ago

21:31

Benchmarks

Is Kimi K3 Really That Good?! (Don’t Just Believe The Hype)

6 days ago

10:49

Benchmarks

Ling 3.0 Flash: A Production-Scale Coding Agentic Model

1 week ago

08:48

Benchmarks

Catmind-1.2b: A Reasoning Model that Thinks in Cat Stories

1 week ago