Marc North marc.north@durham.ac.uk
PGR Student Doctor of Philosophy
Beyond Syntax: How Do LLMs Understand Code?
North, Marc; Atapour-Abarghouei, Amir; Bencomo, Nelly
Authors
Dr Amir Atapour-Abarghouei amir.atapour-abarghouei@durham.ac.uk
Assistant Professor
Dr Nelly Bencomo nelly.bencomo@durham.ac.uk
Associate Professor
Abstract
Within software engineering research, Large Language Models (LLMs) are often treated as 'black boxes', with only their inputs and outputs being considered. In this paper, we take a machine interpretability approach to examine how LLMs internally represent and process code. We focus on variable declaration and function scope, training classifier probes on the residual streams of LLMs as they process code written in different programming languages to explore how LLMs internally represent these concepts across different programming languages. We also look for specific attention heads that support these representations and examine how they behave for inputs of different languages. Our results show that LLMs have an understanding-and internal representation-of language-independent coding semantics that goes beyond the syntax of any specific programming language, using the same internal components to process code, regardless of the programming language that the code is written in. Furthermore, we find evidence that these language-independent semantic components exist in the middle layers of LLMs and are supported by language-specific components in the earlier layers that parse the syntax of specific languages and feed into these later semantic components. Finally, we discuss the broader implications of our work, particularly in relation to concerns that AI, with its reliance on large datasets to learn new programming languages, might limit innovation in programming language design. By demonstrating that LLMs have a language-independent representation of code, we argue that LLMs may be able to flexibly learn the syntax of new programming languages while retaining their semantic understanding of universal coding concepts. In doing so, LLMs could promote creativity in future programming language design, providing tools that augment rather than constrain the future of software engineering.
Citation
North, M., Atapour-Abarghouei, A., & Bencomo, N. (2025, April). Beyond Syntax: How Do LLMs Understand Code?. Presented at 2025 IEEE/ACM International Conference on Software Engineering ICSE, Ottawa , Canada
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | 2025 IEEE/ACM International Conference on Software Engineering ICSE |
Start Date | Apr 27, 2025 |
End Date | May 3, 2025 |
Acceptance Date | Dec 11, 2024 |
Deposit Date | Feb 3, 2025 |
Publisher | Institute of Electrical and Electronics Engineers |
Peer Reviewed | Peer Reviewed |
Keywords | Index Terms-Mechanistic interpretability; Large Language Models (LLMs); Software engineering |
Public URL | https://durham-repository.worktribe.com/output/3465850 |
Publisher URL | https://ieeexplore.ieee.org/xpl/conhome/1000691/all-proceedings |
This file is under embargo due to copyright reasons.
You might also like
Code Gradients: Towards Automated Traceability of LLM-Generated Code
(2024)
Presentation / Conference Contribution
INCLG: Inpainting for Non-Cleft Lip Generation with a Multi-Task Image Processing Network
(2023)
Journal Article
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search