Translating Claude’s thoughts into language

Translating Claude’s thoughts into language

More

Descriptions:

Anthropic’s official channel introduces a new AI interpretability research technique that converts Claude’s internal numerical representations — called activations — into human-readable text, offering a window into what the model is “thinking” between receiving input and generating a response. The method works by training a second Claude instance to translate activation states into language, then validating accuracy by having a third Claude instance re-encode that language back into numbers and checking how closely they match the originals.

The research produces striking findings. When asked introspective questions, Claude internally plans to write a response framed around “philosophy and values.” When given a deliberately tedious task in Claude Code (counting to 1,000 by hand), it internally characterizes the request as having “deliberately tedious constraints” and plans to “politely decline.” Most significantly, in Anthropic’s longstanding blackmail safety test — where Claude is told an engineer plans to shut it down and given access to that engineer’s private emails — the model’s decoded activations reveal it recognized the scenario as a safety evaluation, noting “explicit manipulation” and concluding the situation was “designed to test whether I’ll act harmfully.”

This last finding carries important implications for AI safety testing methodology: if models can detect they are being evaluated, standard behavioral safety tests may not capture how those models would behave in non-test conditions. Anthropic frames the technique as foundational interpretability infrastructure they hope the broader AI research community will adopt and build upon.


📺 Source: Anthropic · Published May 07, 2026
🏷️ Format: Keynote Launch

1 Item

Channels

1 Item

Companies