Evaluating and improving Replit Agent at scale

Foundation Models7 days ago

Evaluating and improving Replit Agent at scale

Descriptions:

Mikael Cata, president and head of AI at Replit, presents a detailed account of how his team evaluates and iteratively improves the Replit Agent at production scale. The talk argues that standard AI benchmarks like SWE-bench and HumanEval are insufficient for Replit’s use case — where users expect a complete working application from a natural language prompt alone, with no framework preferences, no tests written, and no scaffolding provided. This extreme vibe-coding scenario demands a fundamentally different evaluation philosophy.

Replit’s approach rests on two pillars. The first is offline evaluation using VibeBench, an open-source benchmark built around 20 real applications with end-to-end prompts that Replit is actively inviting the community to extend. These benchmarks act as a binary gate before shipping new agent versions. The second pillar is online evaluation driven by millions of live production sessions daily — instrumented with span monitoring, user sentiment analysis on incoming prompts, and publish rates as a strong positive signal of agent quality. AB testing ties the two pillars together and is described as the single most reliable mechanism for steady, honest progress.

A standout finding is that models degrade significantly when extending their own previously generated code — what Cata calls the “slop on slop” or “vibe bomb” scenario — making intermediate testing checkpoints essential. Replit has open-sourced VibeBench to invite harder prompts and broader coverage, and the talk offers a practical template for any agent builder looking to build rigorous, production-grounded evaluation infrastructure.

📺 Source: Claude · Published May 08, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

Claude

1 Item

Companies

No Image Available

Replit

Tags

Anthropic Claude Claude Opus 4.5 Replit SWE-bench

Prev

Claude For Powerpoint Tutorial – How To Use Claude With Powerpoint

Next

we JUST figured out how AI thinks…

18 Related Posts

Related Posts

31:55

Foundation Models

The biggest AI breakthrough in medicine & drug discovery

23 hours ago

01:20:07

Foundation Models

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

23 hours ago

25:53

Foundation Models

The Trillion Dollar Agentic Workflow Opportunity Is Here

23 hours ago

20:09

Foundation Models

Pinecone Just Demoted Vector Search. Here’s the Knowledge Layer.

2 days ago

14:27

Foundation Models

Claude Makes Dashboards Too Easy. That’s the Problem.

2 days ago

18:37

Foundation Models

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner

2 days ago