⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

Interviews5 months ago

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

Descriptions:

Pliny the Liberator and John V—two prominent practitioners in AI red teaming and adversarial machine learning—join Latent Space for a detailed discussion of the state of AI jailbreaking, model security architecture, and the accelerating cat-and-mouse dynamic between safety researchers and major AI labs. Pliny specializes in universal jailbreaks: prompt templates or multi-step workflows that function as skeleton keys across a model family, systematically bypassing guardrails, classifiers, and system prompts regardless of the specific use case.

The conversation maps the technical difficulty spectrum from inference-time prompt manipulation (relatively accessible) to bypassing behaviors that have been post-trained out of a model (substantially harder), and discusses why high refusal benchmarks—like GPT-5.1’s reported 92% refusal rate—can be misleading when universal jailbreaks are typically developed within days of a model’s public release. Pliny recounts his high-profile encounter with Anthropic’s public jailbreak challenge in detail: reaching the final level through a combination of prompt engineering and an undisclosed UI bug, being reset to the start after Anthropic patched the interface, and the subsequent public dispute over whether adversarial prompts collected from the community should be open-sourced. That standoff ultimately resulted in Anthropic adding a $20,000–$30,000 bounty program.

Broader themes include the structural asymmetry of AI security (blue teams defend a growing surface area; red teams need only find one path), the ethics of responsible disclosure versus open prompting research, and Pliny’s framing of model liberation as a free speech and transparency issue with implications for the billion-scale user bases now routing decisions through AI systems.

📺 Source: Latent Space · Published December 16, 2025
🏷️ Format: Podcast

Tags

Anthropic Claude CrowdStrike GPT-5.1

Prev

Agent Experts: Finally, Agents That ACTUALLY Learn

Agent Experts: Finally, Agents That ACTUALLY Learn

Next

Complex Motion with SCAIL: 360° Spins, Long Videos & Camera Control in ComfyUI 🎥

Complex Motion with SCAIL: 360° Spins, Long Videos & Camera Control in ComfyUI 🎥

18 Related Posts

Related Posts

08:44

Interviews

AI Chipmaker Cerebras Raises $5.55 Billion in Year’s Biggest IPO

23 hours ago

01:06:38

Interviews

Inside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge’s Janie Lee & Chai Asawa

23 hours ago

01:16:02

Interviews

TypeScript, C# and Turbo Pascal with Anders Hejlsberg

2 days ago

23:34

Interviews

The Founders Who Left Tesla to Rebuild America | a16z

2 days ago

46:56

Interviews

“There Is No Task Agents Cannot Do” – Magnus Müller

2 days ago

16:39

Interviews

How Emergent is making app building more accessible with Claude

2 days ago