[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, “Pragmatic” Interp — Goodfire

[State of MechInterp] SAEs in Production, Circuit Tracing, AI4Science, “Pragmatic” Interp — Goodfire

More

Descriptions:

Mark and Jack from Goodfire, an AI interpretability research company, join Latent Space’s year-end State of the Field series to survey where mechanistic interpretability stands as both a research discipline and a production engineering tool. Jack, a recent PhD graduate who shifted from language model grounding research to interpretability, and Mark, who came from Palantir’s healthcare team, represent Goodfire’s dual focus: foundational research and applied platform development.

The most concrete production example is a deployment with partner Racketin: rather than using an LLM-as-judge to detect personally identifiable information in customer-agent chat transcripts, Goodfire routes transcripts through a “sidecar model” and monitors when PII-related features activate in the model’s internal representations. The result is recall equivalent to GPT-5-as-judge at roughly 500 times lower cost — a compelling demonstration that interpretability techniques can beat prompt-based approaches on both quality and economics in the right setting.

The conversation also covers Goodfire’s paint.goodfire.ai demo (using unsupervised sparse autoencoder features to enable direct concept-space painting inside Stable Diffusion XL Turbo), Anthropic’s circuit tracing paper, the science of how models memorize training data and what that means for privacy, and early work applying interpretability to narrowly superhuman scientific models in genomics, proteomics, and materials science — domains where the models are superhuman but completely opaque, making interpretability tools uniquely valuable.


📺 Source: Latent Space · Published December 31, 2025
🏷️ Format: Deep Dive