Descriptions:
Jean-Marie John-Mathews, a researcher at JustCatch, presents a systematic approach to red teaming LLM applications at AI Dev 26 San Francisco, opening with a real-world failure that generated widespread attention: a Chipotle chatbot that went viral after users successfully prompted it off-topic — a reputational incident representative of the broader class of risks JustCatch helps enterprises prevent.
The talk identifies why standard LLM-as-judge evaluation frameworks break down for agentic systems: agents can produce correct outputs through wrong reasoning, the most consequential failures often occur inside invisible tool calls, and static golden datasets cannot capture the dynamic multi-turn patterns where real exploitation typically occurs. Two concrete examples drive the point home — a frustrated customer whose agent repeatedly asks for rephrasing instead of escalating to a human, and a CRM update that silently omits a required field in a tool call’s input parameters with no visible error in the conversation log.
JustCatch’s open-source testing framework addresses these gaps by letting developers describe desired agent behavior in plain natural language, then automatically generating versioned, reproducible test cases that integrate into CI/CD pipelines. A live demo using Claude’s coding assistant shows the workflow applied to a RAG documentation agent built on JustCatch’s own docs. The tool is positioned as accessible to teams without dedicated red-teamers, with an enterprise version serving large banks alongside a publicly available open-source library.
📺 Source: DeepLearningAI · Published May 20, 2026
🏷️ Format: Tutorial Demo







