Descriptions:
Pliny the Liberator and John V—two prominent practitioners in AI red teaming and adversarial machine learning—join Latent Space for a detailed discussion of the state of AI jailbreaking, model security architecture, and the accelerating cat-and-mouse dynamic between safety researchers and major AI labs. Pliny specializes in universal jailbreaks: prompt templates or multi-step workflows that function as skeleton keys across a model family, systematically bypassing guardrails, classifiers, and system prompts regardless of the specific use case.
The conversation maps the technical difficulty spectrum from inference-time prompt manipulation (relatively accessible) to bypassing behaviors that have been post-trained out of a model (substantially harder), and discusses why high refusal benchmarks—like GPT-5.1’s reported 92% refusal rate—can be misleading when universal jailbreaks are typically developed within days of a model’s public release. Pliny recounts his high-profile encounter with Anthropic’s public jailbreak challenge in detail: reaching the final level through a combination of prompt engineering and an undisclosed UI bug, being reset to the start after Anthropic patched the interface, and the subsequent public dispute over whether adversarial prompts collected from the community should be open-sourced. That standoff ultimately resulted in Anthropic adding a $20,000–$30,000 bounty program.
Broader themes include the structural asymmetry of AI security (blue teams defend a growing surface area; red teams need only find one path), the ethics of responsible disclosure versus open prompting research, and Pliny’s framing of model liberation as a free speech and transparency issue with implications for the billion-scale user bases now routing decisions through AI systems.
📺 Source: Latent Space · Published December 16, 2025
🏷️ Format: Podcast







