Dean L Slack
Video Prediction of Dynamic Physical Simulations with Pixel-Space Spatiotemporal Transformers
Slack, Dean L; Hudson, G Thomas; Winterbottom, Thomas; Al Moubayed, Noura
Authors
G Thomas Hudson
Thomas Winterbottom thomas.i.winterbottom@durham.ac.uk
KTP Associate in Machine Learning
Dr Noura Al Moubayed noura.al-moubayed@durham.ac.uk
Associate Professor
Abstract
Inspired by the performance and scalability of autoregressive large language models, transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modelling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilising continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. Additionally, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find this generalises to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modelling of videos via a simple, parameter-efficient, and interpretable approach.
Citation
Slack, D. L., Hudson, G. T., Winterbottom, T., & Al Moubayed, N. (online). Video Prediction of Dynamic Physical Simulations with Pixel-Space Spatiotemporal Transformers. IEEE Transactions on Neural Networks and Learning Systems, https://doi.org/10.1109/TNNLS.2025.3585949
Journal Article Type | Article |
---|---|
Acceptance Date | Jun 25, 2025 |
Online Publication Date | Jul 22, 2025 |
Deposit Date | Jul 3, 2025 |
Publicly Available Date | Jul 25, 2025 |
Journal | IEEE Transactions on Neural Networks and Learning Systems |
Print ISSN | 2162-237X |
Electronic ISSN | 2162-2388 |
Publisher | Institute of Electrical and Electronics Engineers |
Peer Reviewed | Peer Reviewed |
DOI | https://doi.org/10.1109/TNNLS.2025.3585949 |
Keywords | Index Terms-Video prediction; spatiotemporal transformers; pixel-space modelling; physics modelling; autoregressive models; hierarchical video transformers |
Public URL | https://durham-repository.worktribe.com/output/4252396 |
Files
Accepted Journal Article
(4.3 Mb)
PDF
You might also like
Bilinear Fusion of Commonsense Knowledge with Attention-Based NLI Models
(2020)
Book Chapter
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search