Why building eval platforms is hard — Phil Hetzel, Braintrust

Foundation Models2 weeks ago

Why building eval platforms is hard — Phil Hetzel, Braintrust

Descriptions:

Phil Hetzel, head of solutions engineering at Braintrust, delivers a structured conference talk on the practical challenges of building evaluation platforms for AI agents — drawing on his vantage point across Braintrust’s entire customer base and twelve years of prior consulting work at KPMG and Slalom. The core argument: without rigorous evals, shipping AI agents to production carries serious brand, compliance, and operational risk, and most teams plateau at a stage of eval maturity that produces reports rather than enabling real iteration.

Hetzel maps out a progression of eval platform stages: spreadsheet-based logging (fast to start, impossible to scale), custom-coded UIs with basic databases like Neon (more accessible, still anchored in reporting), and finally true experimentation environments where both technical and non-technical stakeholders can compare agent configurations side by side and surface scored results. At each stage he identifies the specific bottleneck that prevents teams from moving faster — and is candid about where the “vibe-coded eval UI” phase breaks down in practice.

A recurring theme is that evals are fundamentally a team sport. Domain experts and product stakeholders who understand users closely have knowledge that engineers can’t replicate, and the best eval platforms are designed to bring those voices in without requiring them to touch a spreadsheet or an SDK. The talk is a practical roadmap for any AI team navigating the gap between working proof-of-concept and confident production deployment, with Braintrust’s platform serving as one concrete example of the destination.

📺 Source: AI Engineer · Published April 28, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

AI Engineer

Tags

BrainTrust Claude Code Codex Databricks Notion

Prev

OpenAI Drops Exclusivity Deal with Microsoft | Bloomberg Tech 4/27/2026

Next

Poolside Laguna XS.2: New Open Weight Coding Model Tested Locally with vLLM

18 Related Posts

Related Posts

16:23

Foundation Models

Your SaaS Bill Just Got a Second Meter. You’re About to Pay It.

1 hour ago

31:55

Foundation Models

The biggest AI breakthrough in medicine & drug discovery

1 day ago

01:20:07

Foundation Models

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

1 day ago

25:53

Foundation Models

The Trillion Dollar Agentic Workflow Opportunity Is Here

1 day ago

18:37

Foundation Models

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner

2 days ago

20:09

Foundation Models

Pinecone Just Demoted Vector Search. Here’s the Knowledge Layer.

2 days ago