Department of Computer Science

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models (2025)
Presentation / Conference Contribution
Leask, P., & Al Moubayed, N. (2025, July). Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models. Presented at International Conference on Machine Learning (ICML 2025), Vancouver, Canada

Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivat... Read More about Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models.

Sparse Autoencoders Do Not Find Canonical Units of Analysis (2025)
Presentation / Conference Contribution
Leask, P., Bussmann, B., Pearce, M., Bloom, J., Tigges, C., Al Moubayed, N., Sharkey, L., & Nanda, N. (2025, April). Sparse Autoencoders Do Not Find Canonical Units of Analysis. Presented at ICLR2025: The Thirteenth International Conference on Learning Representations, Singapore

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in... Read More about Sparse Autoencoders Do Not Find Canonical Units of Analysis.

Sparse Autoencoders Do Not Find Canonical Units of Analysis (2025)
Presentation / Conference Contribution
Leask, P., Bussmann, B., Pearce, M. T., Isaac Bloom, J., Tigges, C., Al Moubayed, N., Sharkey, L., & Nanda, N. (2025, April). Sparse Autoencoders Do Not Find Canonical Units of Analysis. Presented at The Thirteenth International Conference on Learning Representations, Singapore

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in... Read More about Sparse Autoencoders Do Not Find Canonical Units of Analysis.

The variable relationship between the National Early Warning Score on admission to hospital, the primary discharge diagnosis and in-hospital mortality Authors information (2025)
Journal Article
Holland, M., Kellett, J., Boulitsakis-Logothetis, S., Watson, M., Al Moubayed, N., & Green, D. (online). The variable relationship between the National Early Warning Score on admission to hospital, the primary discharge diagnosis and in-hospital mortality Authors information. Internal and Emergency Medicine, https://doi.org/10.1007/s11739-024-03828-9

Background: Patients with an elevated admission National Early Warning Score (NEWS) are more likely to die while in hospital. However, it is not known if this increased mortality risk is the same for all diagnoses. The aim of this study was to determ... Read More about The variable relationship between the National Early Warning Score on admission to hospital, the primary discharge diagnosis and in-hospital mortality Authors information.

Outputs (4)