Descriptions:
Samraj Moorjani, a Databricks engineer with two years focused on MLflow and agent quality, opened his AI Dev 25 NYC talk with a pointed question: if you wouldn’t ship untested software to production, why are teams shipping untested AI agents? The session is a ground-level guide to using MLflow—Databricks’ open-source GenAI platform—to apply software engineering rigor to the problem of agent reliability.
Moorjani identifies the specific ways agent QA differs from traditional software testing: non-deterministic outputs, unpredictable user behavior, domain expertise that lives outside the engineering team, and the three-way tradeoff between cost, latency, and quality. His solution architecture has two interlocking pillars: MLflow’s tracing capability for step-by-step observability of agent execution (the prerequisite for everything else), and a two-tier evaluation system—offline regression suites built from real production traces and human-labeled examples, plus online production monitors powered by LLM judges that scale expert feedback.
The talk walks through the managed Databricks version of MLflow, where evaluation datasets are backed by Unity Catalog for enterprise governance—fine-grained access controls and lineage tracking included by default. A particularly practical segment covers model-swap decisions: rather than guessing whether switching to a cheaper or newer model will degrade quality, MLflow’s eval diffing makes the tradeoff visible across a versioned dataset. Teams working on production agents in regulated or high-stakes domains will find the quality-assurance framework directly applicable.
📺 Source: DeepLearningAI · Published December 05, 2025
🏷️ Format: Deep Dive







