Descriptions:
Mikael Cata, president and head of AI at Replit, presents a detailed account of how his team evaluates and iteratively improves the Replit Agent at production scale. The talk argues that standard AI benchmarks like SWE-bench and HumanEval are insufficient for Replit’s use case — where users expect a complete working application from a natural language prompt alone, with no framework preferences, no tests written, and no scaffolding provided. This extreme vibe-coding scenario demands a fundamentally different evaluation philosophy.
Replit’s approach rests on two pillars. The first is offline evaluation using VibeBench, an open-source benchmark built around 20 real applications with end-to-end prompts that Replit is actively inviting the community to extend. These benchmarks act as a binary gate before shipping new agent versions. The second pillar is online evaluation driven by millions of live production sessions daily — instrumented with span monitoring, user sentiment analysis on incoming prompts, and publish rates as a strong positive signal of agent quality. AB testing ties the two pillars together and is described as the single most reliable mechanism for steady, honest progress.
A standout finding is that models degrade significantly when extending their own previously generated code — what Cata calls the “slop on slop” or “vibe bomb” scenario — making intermediate testing checkpoints essential. Replit has open-sourced VibeBench to invite harder prompts and broader coverage, and the talk offers a practical template for any agent builder looking to build rigorous, production-grounded evaluation infrastructure.
📺 Source: Claude · Published May 08, 2026
🏷️ Format: Deep Dive







