Skip to main content

Research Repository

Advanced Search

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Leask, Patrick; Bussmann, Bart; Pearce, Michael T; Isaac Bloom, Joseph; Tigges, Curt; Al Moubayed, Noura; Sharkey, Lee; Nanda, Neel

Sparse Autoencoders Do Not Find Canonical Units of Analysis Thumbnail


Authors

Patrick Leask patrick.leask@durham.ac.uk
PGR Student Doctor of Philosophy

Bart Bussmann

Michael T Pearce

Joseph Isaac Bloom

Curt Tigges

Lee Sharkey

Neel Nanda



Abstract

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: novel latents, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs - SAEs trained on the decoder matrix of another SAE - we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing "Einstein'" decomposes into "scientist", "Germany", and "famous person". To train meta-SAEs we introduce BatchTopK SAEs, an improved variant of the popular TopK SAE method, that only enforces a fixed average sparsity. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/

Citation

Leask, P., Bussmann, B., Pearce, M. T., Isaac Bloom, J., Tigges, C., Al Moubayed, N., Sharkey, L., & Nanda, N. (2025, April). Sparse Autoencoders Do Not Find Canonical Units of Analysis. Presented at The Thirteenth International Conference on Learning Representations, Singapore

Presentation Conference Type Conference Paper (published)
Conference Name The Thirteenth International Conference on Learning Representations
Start Date Apr 24, 2025
End Date Apr 26, 2025
Acceptance Date Jan 22, 2025
Online Publication Date Jan 22, 2025
Publication Date Jan 22, 2025
Deposit Date Feb 10, 2025
Publicly Available Date Feb 12, 2025
Peer Reviewed Peer Reviewed
Keywords sparse autoencoders, mechanistic interpretability
Public URL https://durham-repository.worktribe.com/output/3475474
Publisher URL https://openreview.net/forum?id=9ca9eHNrdH

Files





You might also like



Downloadable Citations