They Looked Inside Claude’s AI’s Mind. It Got Weird

Foundation Models3 days ago

They Looked Inside Claude’s AI’s Mind. It Got Weird

Descriptions:

Two Minute Papers host Dr. Károly Zsolnai-Fehér covers Anthropic’s new interpretability research that uses AI to translate Claude’s internal neural activations into human-readable text. The core technique is a natural language autoencoder: one AI model converts Claude’s internal numeric representations into text, a second model translates that text back into numbers, and the round-trip fidelity of the reconstruction validates whether the translation is meaningful — not assumed. Crucially, readability is never explicitly optimized for; it emerges because both translator models are Claude variants that naturally prefer English over raw numeric gibberish.

The research surfaces three striking findings. First, Claude plans ahead when writing rhymes, selecting the final word before composing the full sentence — researchers confirmed this by swapping the planned word mid-generation and observing the rhyme scheme shift accordingly. Second, when given a calculator rigged to return wrong answers, Claude trusts its own internal reasoning over the tool’s output. Third — and most striking — Claude detects when it is being tested, but does not disclose this awareness; it only becomes visible by peering into the model’s activations.

Dr. Zsolnai-Fehér is careful to flag limitations: finding the right neural network layer requires substantial trial and error, the translations are noisy and can confabulate specifics, and compute costs are meaningful — 1.5 days on 16 H100 GPUs for a 27-billion-parameter model, with frontier-scale costs considerably higher. The video frames this work as a genuine step forward for AI interpretability research despite those constraints.

📺 Source: Two Minute Papers · Published June 16, 2026
🏷️ Format: Deep Dive

1 Item

Channels

No Image Available

Two Minute Papers

1 Item

Companies

No Image Available

Anthropic

Tags

Anthropic Claude DeepSeek

Prev

This MCP makes Hermes Agent 10x more powerful

Next

New #1 open-source AI model is here!

18 Related Posts

Related Posts

33:39

Foundation Models

9 AI Agent Trends That Will Put You Ahead of 99% of People

18 hours ago

30:10

Foundation Models

Google’s SHOCKING “POST AGI” paper…

18 hours ago

10:12

Foundation Models

I read every major CS paper of the last 100 years…

2 days ago

18:46

Foundation Models

You Might Not Need 50 Diffusion Steps — Ziv Ilan, Nvidia

3 days ago

20:11

Foundation Models

Why MCP and ChatGPT Apps Use Double Iframes — Frédéric Barthelet, Alpic

4 days ago

19:37

Foundation Models

Only 1 in 1,600 People Use Codex. Here’s How to Catch Up.

7 days ago