Anthropic’s Natural Language Autoencoders Put Claude’s “Thoughts” Into Words
For Claude and Claude Code builders, this is one of the more interesting interpretability announcements Anthropic has made in a while. The big idea is simple but pretty wild: instead of reducing activations to opaque scores or feature vectors, Anthropic is trying to turn them into readable natural language explanations that you can inspect directly. Anthropic introduced Natural Language Autoencoders (NLAs), a method that turns model activations into text explanations and then tries to recons
papoo.work