News

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

  • ianchanning--Transformer-circuits.pub
  • published date: 2026-05-10 05:47:02 UTC

None

We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer… [+157600 chars]