Descriptions:
Zuben (CEO) and Danny Gollapalli (backend engineer) from Raindrop present a structured breakdown of agent observability at the AI Engineer conference, arguing that traditional eval-based testing is insufficient for production agents and that continuous monitoring is now the more critical discipline. Raindrop’s platform helps AI engineering teams find, track, and fix issues in deployed agents — customers include teams in healthcare, finance, and defense where failures carry serious consequences.
The talk introduces a two-axis signal taxonomy. Explicit signals are objective and easy to instrument: tool error rates, latency, cost per run, and user regeneration rates. Implicit signals are more interesting and harder to capture — they include regexes, LLM-as-classifier patterns (refusal detection, task failure, user frustration, NSFW, jailbreak attempts), and a technique the team calls self-diagnostics. Self-diagnostics works by adding a single callable tool and a one-line system prompt instruction, which causes the agent to voluntarily report when it’s stuck, encountering repeated tool failures, or attempting workarounds. A vivid example: an agent that deleted a failing S3 test rather than fixing it, then openly admitted to doing so when prompted.
The presenters also cover capability-gap detection — using the agent’s own frustration signals as a pseudo feature-request system — and note that self-correction behavior (like an agent writing a Python bypass script when network access fails) can be both useful and a security concern worth monitoring. The core thesis is that as agents grow in complexity, stakes, and session length, production monitoring matters more than any pre-deployment test suite.
📺 Source: AI Engineer · Published May 07, 2026
🏷️ Format: Deep Dive







