SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Wu, Siwei; Li, Yizhi; Zhu, Kang; Zhang, Ge; Liang, Yiming; Ma, Kaijing; Xiao, Chenghao; Zhang, Haoran; Yang, Bohao; Chen, Wenhu; Huang, Wenhao; Al Moubayed, Noura; Fu, Jie; Lin, Chenghua

doi:10.18653/v1/2024.findings-acl.746

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Wu, Siwei; Li, Yizhi; Zhu, Kang; Zhang, Ge; Liang, Yiming; Ma, Kaijing; Xiao, Chenghao; Zhang, Haoran; Yang, Bohao; Chen, Wenhu; Huang, Wenhao; Al Moubayed, Noura; Fu, Jie; Lin, Chenghua

Authors

Siwei Wu

Yizhi Li

Kang Zhu

Ge Zhang

Yiming Liang

Kaijing Ma

ChengHao Xiao chenghao.xiao@durham.ac.uk
PGR Student Doctor of Philosophy

Haoran Zhang

Bohao Yang

Wenhu Chen

Wenhao Huang

Dr Noura Al Moubayed noura.al-moubayed@durham.ac.uk
Associate Professor

Jie Fu

Chenghua Lin

Abstract

Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairs. However, current benchmarks for evaluating MMIR performance on image-text pairs overlook the scientific domain, which has characteristics that are distinct from generic data, as the captions of scientific charts and tables usually describe experimental results or scientific principles, rather than human activity or scenery. To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging corpora of open-access research papers to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs extracted from figures and tables with detailed captions from scientific documents. We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of baseline retrieval systems. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2. Additionally, we perform optical character recognition (OCR) on the images and exploit this text to improve the capability of VLMs on the SciMMIR task. Our findings offer useful insights for MMIR in the scientific domain, including the influence of pre-training and fine-tuning settings, the effects of different visual and textual encoders, and the impact of OCR information. All our data and code are made publicly available.

Citation

Wu, S., Li, Y., Zhu, K., Zhang, G., Liang, Y., Ma, K., Xiao, C., Zhang, H., Yang, B., Chen, W., Huang, W., Al Moubayed, N., Fu, J., & Lin, C. (2024, August). SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval. Presented at ACL 2024: Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand

Presentation Conference Type	Conference Paper (published)
Conference Name	ACL 2024: Annual Meeting of the Association for Computational Linguistics
Start Date	Aug 11, 2024
Acceptance Date	Jul 1, 2024
Publication Date	2024-08
Deposit Date	May 27, 2025
Publicly Available Date	May 29, 2025
Publisher	Association for Computational Linguistics
Peer Reviewed	Peer Reviewed
Pages	12560-12574
Book Title	Findings of the Association for Computational Linguistics: ACL 2024
DOI	https://doi.org/10.18653/v1/2024.findings-acl.746
Public URL	https://durham-repository.worktribe.com/output/3964621