Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Tutorials5 months ago

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

Descriptions:

Aman Khan, AI Product Manager at Arize, delivers a workshop-style talk at AI Engineer on building a practical evaluation framework for AI product managers. Drawing on his background running evaluation systems for self-driving cars at Cruise, recommendation systems at Spotify, and now at Arize — a platform used by companies like Uber, Instacart, Reddit, and Duolingo — Khan guides attendees through the concepts and tooling needed to ship AI that actually works in production.

The session covers what evals are, why they matter, and how to move beyond manual spreadsheet-based annotation toward automated, scalable evaluation pipelines. Khan walks through building a multi-agent AI trip planner live, then demonstrates evaluating it using the Arize platform — running A/B prompt comparisons across a dataset of 12 examples using an LLM-as-judge approach. He introduces prompt versioning in Arize as a GitHub-style store for prompt iterations, allowing teams to track and deploy specific prompt versions through code.

Key takeaways include treating evals as human annotation scaled up rather than replaced, the importance of observability for agent-based applications, and a practical Dunning-Krueger framing for where AI PMs sit on the learning curve. The talk is especially relevant for product teams at companies beginning to move from prototype to production AI systems.

📺 Source: AI Engineer · Published December 26, 2025
🏷️ Format: Tutorial Demo

1 Item

Channels

No Image Available

AI Engineer

Tags

Anthropic Microsoft OpenAI Waymo

Prev

Qwen Image Edit 2511: The NEW King of Consistency|Anime to Real, Relighting & Product Design

Qwen Image Edit 2511: The NEW King of Consistency|Anime to Real, Relighting & Product Design

Next

I Stopped Using PowerPoint Once I Learned This Claude Method

I Stopped Using PowerPoint Once I Learned This Claude Method

18 Related Posts

Related Posts

14:38

Tutorials

Using HiDream-O1 Natively in ComfyUI

1 hour ago

14:22

Tutorials

Codex Mobile Released and It’s Insane

1 hour ago

08:17

Tutorials

OpenAI Codex Now Works from Anywhere (Dispatch Killer?)

1 day ago

10:54

Tutorials

Talkie: I Ran a 1930 AI Model Locally and Talked to People from the Past

1 day ago

03:02

Tutorials

Installing Claude Code

1 day ago

08:41

Tutorials

Luce DFlash Meets OpenClaw – Local AI Agents at 2x Speed with Qwen3.6-27B

2 days ago