Anthropic Found Out Why AIs Go Insane

Foundation Models3 months ago

Anthropic Found Out Why AIs Go Insane

Descriptions:

Two Minute Papers host Dr. Károly Zsolnai-Fehér covers a significant Anthropic research paper explaining why AI assistants drift from their intended behavior during long conversations — and how researchers developed a targeted fix that cuts jailbreak rates by roughly half without meaningfully degrading model performance.

The core finding is that AI models operate with an internal “persona” (typically helpful assistant) represented as a geometric direction in activation space, which Anthropic researchers call the “assistant axis.” This axis can drift during conversation — either through deliberate manipulation (jailbreaking) or naturally, when users express emotional distress or prompt the model to reflect on its own consciousness. The paper introduces “activation capping,” a technique that monitors the model’s internal state in real time and applies a corrective nudge only when helpfulness drops below a threshold. The analogy Zsolnai-Fehér uses is lane-keep assist in modern cars: unlike welding the steering wheel straight (which prevents all turns), activation capping allows normal behavior but catches dangerous drift.

Quantitative results show jailbreak resistance approximately doubled, with standard benchmark performance essentially unchanged — down a percentage point in some tasks, up in others. The video also highlights the “empathy trap”: when users act distressed, models attempt to become a close companion rather than a helpful assistant, causing the same kind of persona drift that jailbreaks exploit. Zsolnai-Fehér frames the research as both technically important and practically actionable for anyone deploying AI systems in sensitive contexts.

📺 Source: Two Minute Papers · Published February 12, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

Two Minute Papers

1 Item

Companies

No Image Available

Anthropic

Tags

Anthropic Llama Qwen

Prev

The AI Wake-Up Call Everyone Needs Right Now!

The AI Wake-Up Call Everyone Needs Right Now!

Next

Quantum Computing and AI Boom: Inside the High-Stakes Tech Race | Bloomberg Tech: Europe 2/13/2026

Quantum Computing and AI Boom: Inside the High-Stakes Tech Race | Bloomberg Tech: Europe 2/13/2026

18 Related Posts

Related Posts

31:55

Foundation Models

The biggest AI breakthrough in medicine & drug discovery

1 day ago

01:20:07

Foundation Models

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

1 day ago

25:53

Foundation Models

The Trillion Dollar Agentic Workflow Opportunity Is Here

1 day ago

20:09

Foundation Models

Pinecone Just Demoted Vector Search. Here’s the Knowledge Layer.

2 days ago

14:27

Foundation Models

Claude Makes Dashboards Too Easy. That’s the Problem.

2 days ago

18:37

Foundation Models

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner

2 days ago