Patrick Leask patrick.leask@durham.ac.uk
PGR Student Doctor of Philosophy
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Leask, Patrick; Al Moubayed, Noura
Authors
Dr Noura Al Moubayed noura.al-moubayed@durham.ac.uk
Associate Professor
Abstract
Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activation models (ITDAs). ITDAs are constructed by greedily sampling activations into a dictionary based on an error threshold on their matching pursuit reconstruction. ITDAs can be trained in 1% of the time of SAEs, allowing us to cheaply train them on Llama-3.1 70B and 405B. ITDA dictionaries also enable cross-model comparisons, and outperform existing methods like CKA, SVCCA, and a relative representation method on a benchmark of representation similarity. Code available at https://github.com/pleask/itda
Citation
Leask, P., & Al Moubayed, N. (2025, July). Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models. Presented at International Conference on Machine Learning (ICML 2025), Vancouver, Canada
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | International Conference on Machine Learning (ICML 2025) |
Start Date | Jul 13, 2025 |
End Date | Jul 19, 2025 |
Acceptance Date | May 26, 2025 |
Deposit Date | Jun 1, 2025 |
Peer Reviewed | Peer Reviewed |
Series Title | Proceedings of Machine Learning Research |
Series ISSN | 2640-3498 |
Public URL | https://durham-repository.worktribe.com/output/4012842 |
Publisher URL | https://proceedings.mlr.press/ |
External URL | https://icml.cc/Conferences/2025 |
This file is under embargo due to copyright reasons.
You might also like
Sparse Autoencoders Do Not Find Canonical Units of Analysis
(2025)
Presentation / Conference Contribution
Sparse Autoencoders Do Not Find Canonical Units of Analysis
(2025)
Presentation / Conference Contribution