Siwei Wu
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Wu, Siwei; Li, Yizhi; Zhu, Kang; Zhang, Ge; Liang, Yiming; Ma, Kaijing; Xiao, Chenghao; Zhang, Haoran; Yang, Bohao; Chen, Wenhu; Huang, Wenhao; Al Moubayed, Noura; Fu, Jie; Lin, Chenghua
Authors
Yizhi Li
Kang Zhu
Ge Zhang
Yiming Liang
Kaijing Ma
ChengHao Xiao chenghao.xiao@durham.ac.uk
PGR Student Doctor of Philosophy
Haoran Zhang
Bohao Yang
Wenhu Chen
Wenhao Huang
Dr Noura Al Moubayed noura.al-moubayed@durham.ac.uk
Associate Professor
Jie Fu
Chenghua Lin
Abstract
Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairs. However, current benchmarks for evaluating MMIR performance on image-text pairs overlook the scientific domain, which has characteristics that are distinct from generic data, as the captions of scientific charts and tables usually describe experimental results or scientific principles, rather than human activity or scenery. To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging corpora of open-access research papers to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs extracted from figures and tables with detailed captions from scientific documents. We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of baseline retrieval systems. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2. Additionally, we perform optical character recognition (OCR) on the images and exploit this text to improve the capability of VLMs on the SciMMIR task. Our findings offer useful insights for MMIR in the scientific domain, including the influence of pre-training and fine-tuning settings, the effects of different visual and textual encoders, and the impact of OCR information. All our data and code are made publicly available.
Citation
Wu, S., Li, Y., Zhu, K., Zhang, G., Liang, Y., Ma, K., Xiao, C., Zhang, H., Yang, B., Chen, W., Huang, W., Al Moubayed, N., Fu, J., & Lin, C. (2024, August). SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval. Presented at ACL 2024: Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | ACL 2024: Annual Meeting of the Association for Computational Linguistics |
Start Date | Aug 11, 2024 |
Acceptance Date | Jul 1, 2024 |
Publication Date | 2024-08 |
Deposit Date | May 27, 2025 |
Publicly Available Date | May 29, 2025 |
Publisher | Association for Computational Linguistics |
Peer Reviewed | Peer Reviewed |
Pages | 12560-12574 |
Book Title | Findings of the Association for Computational Linguistics: ACL 2024 |
DOI | https://doi.org/10.18653/v1/2024.findings-acl.746 |
Public URL | https://durham-repository.worktribe.com/output/3964621 |
Files
Published Conference Paper
(627 Kb)
PDF
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
You might also like
Fine-grained Main Ideas Extraction and Clustering of Online Course Reviews
(2022)
Book Chapter
Length is a Curse and a Blessing for Document-level Semantics
(2023)
Presentation / Conference Contribution
Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
(2025)
Presentation / Conference Contribution