interpretability 関連記事まとめ（1件）

Anthropic’s Natural Language Autoencoders Put Claude’s “Thoughts” Into Words

For Claude and Claude Code builders, this is one of the more interesting interpretability announcements Anthropic has made in a while. The big idea is simple but pretty wild: instead of reducing activations to opaque scores or feature vectors, Anthropic is trying to turn them into readable natural language explanations that you can inspect directly. Anthropic introduced Natural Language Autoencoders (NLAs), a method that turns model activations into text explanations and then tries to recons

papoo.work

#interpretability