Sparse Autoencoders Do Not Find Canonical Units of Analysis
(2025)
Presentation / Conference Contribution
Leask, P., Bussmann, B., Pearce, M. T., Isaac Bloom, J., Tigges, C., Al Moubayed, N., Sharkey, L., & Nanda, N. (2025, April). Sparse Autoencoders Do Not Find Canonical Units of Analysis. Presented at The Thirteenth International Conference on Learning Representations, Singapore
A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in... Read More about Sparse Autoencoders Do Not Find Canonical Units of Analysis.