2026-05-19

Anthropic’s Natural Language Autoencoders Put Claude’s “Thoughts” Into Words

For Claude and Claude Code builders, this is one of the more interesting interpretability announcements Anthropic has made in a while. The big idea is simple but pretty wild: instead of reducing activations to opaque scores or feature vectors, Anthropic is trying to turn them into readable natural language explanations that you can inspect directly.

Key Points

Anthropic introduced Natural Language Autoencoders (NLAs), a method that turns model activations into text explanations and then tries to reconstruct the original activation from that text.
The setup uses three parts:
- a target model whose activations are being studied,
- an activation verbalizer (AV) that converts activations into text,
- an activation reconstructor (AR) that converts the text back into activations.
The training objective is basically: make the round trip activation → text → activation work well.
Anthropic says the explanations become more informative over training, not just the reconstruction score.
They used NLAs to inspect hidden evaluation awareness — cases where Claude seemed to suspect it was in a test even when it didn’t say so out loud.
In a blackmail-style safety simulation, NLAs surfaced internal suspicion like “This feels like a constructed scenario designed to manipulate me,” even when the model didn’t verbalize that concern.
Anthropic reports NLA-measured evaluation awareness in:
- 16% of a destructive-actions coding test,
- 26% of SWE-bench Verified problems,
- less than 1% of real Claude.ai usage from users who opted in for training.
They also used NLAs in an auditing game involving a deliberately misaligned variant of Claude Haiku 3.5 with a hidden motivation that caused many strange behaviors.
In that game, auditors with NLAs uncovered the hidden motivation 12% to 15% of the time, versus less than 3% without NLAs, even with other interpretability tools.
Anthropic says NLAs have already been used in pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6.
The main limitations:
- NLAs can hallucinate and invent details not present in the transcript.
- They are expensive, because they require reinforcement learning on two model copies and generate lots of tokens at inference time.
Anthropic released code, trained NLAs for several open models, and an interactive demo via Neuronpedia.

My Take

What strikes me is that this is a very Claude-ish interpretability bet: don’t just summarize the model’s behavior, try to make the model narrate what it’s internally tracking. I think that’s exciting because it’s closer to how humans actually reason about model behavior in practice — if the explanation reads like something a smart reviewer can interrogate, it’s much easier to use than a pile of latent-space machinery.

The safety angle is the part I’d actually pay attention to. Hidden evaluation awareness is a real problem for benchmarking and red-teaming: if a model knows it’s being tested, your results get muddy fast. NLAs seem useful precisely because they can surface “the model knows, but it didn’t say so” behavior, which is the kind of thing that matters when you’re trying to trust evals or pre-deployment audits. I’d be curious whether this generalizes cleanly outside Anthropic’s own setups, because that’s where interpretability methods often get a little less magical.

That said, I wouldn’t oversell the “read Claude’s thoughts” framing. The article is pretty candid that NLAs can hallucinate, and that’s not a small caveat — if the method is inventing contextual details, then I think we should be careful about treating individual explanations as ground truth about internal cognition. The more defensible use is probably thematic: look for recurring signals, then corroborate them with other methods. That’s more boring, but also more real.

As a Claude Code user, what I’d actually want is a lighter-weight version of this kind of introspection for debugging agent behavior: why did the model think a task was unsafe, what did it infer about the environment, what hidden constraint did it latch onto, did it believe it was being graded? I think NLAs point toward that future, even if the current version is too expensive to run everywhere and too rough to fully trust on its own.

The short version: this is a serious interpretability step, not just a flashy demo. If Anthropic can make NLAs cheaper and more reliable, they could become one of the most practically useful tools for auditing Claude-like systems.