Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
(2025)
Presentation / Conference Contribution
Leask, P., & Al Moubayed, N. (2025, July). Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models. Presented at International Conference on Machine Learning (ICML 2025), Vancouver, Canada
Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivat... Read More about Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models.