Spec-Driven Testing for Agents With A Brain the Size of A Planet — Steven Willmott, SafeIntelligence

Spec-Driven Testing for Agents With A Brain the Size of A Planet — Steven Willmott, SafeIntelligence

More

Descriptions:

Steven Willmott, CEO of SafeIntelligence, delivers a conference talk at AI Engineer on spec-driven testing for AI agents — arguing that the standard ML approach of measuring F1 accuracy on a dataset is deeply insufficient for production agent deployments. His company has spent three years applying formal verification techniques to vision and tabular models and has recently extended that methodology to language model agents.

The talk introduces a five-component specification framework: ground-truth example datasets; explicit business rules (such as “never issue a discount over 10%”); domain ontologies (e.g., the specific destinations an airline agent should know about); internal company terminology and domain knowledge that differs from general usage; and robustness requirements that test agent stability under input variation — typos, rephrasing, adversarial prompts. Willmott draws on a parallel from vision model testing, where a runway-detection model must be validated under fog, low light, and camera shake, not just clean images.

A counterintuitive thread runs through the talk: larger, smarter models are not automatically safer. More capable models are better at parsing and executing jailbreaks embedded in poems or indirect instructions, while smaller, more constrained models may simply fail to understand the attack. This creates a meaningful design tradeoff between capability and attack surface for teams building automated, customer-facing agents. SafeIntelligence announced a new LLM-focused testing product at the conference.


📺 Source: AI Engineer · Published May 31, 2026
🏷️ Format: Deep Dive

1 Item

Channels