Skip to main content

Research Repository

Advanced Search

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

Wu, Siwei; Li, Yizhi; Zhu, Kang; Zhang, Ge; Liang, Yiming; Ma, Kaijing; Xiao, Chenghao; Zhang, Haoran; Yang, Bohao; Chen, Wenhu; Huang, Wenhao; Al Moubayed, Noura; Fu, Jie; Lin, Chenghua

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval Thumbnail


Authors

Siwei Wu

Yizhi Li

Kang Zhu

Ge Zhang

Yiming Liang

Kaijing Ma

ChengHao Xiao chenghao.xiao@durham.ac.uk
PGR Student Doctor of Philosophy

Haoran Zhang

Bohao Yang

Wenhu Chen

Wenhao Huang

Jie Fu

Chenghua Lin



Abstract

Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairs. However, current benchmarks for evaluating MMIR performance on image-text pairs overlook the scientific domain, which has characteristics that are distinct from generic data, as the captions of scientific charts and tables usually describe experimental results or scientific principles, rather than human activity or scenery. To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging corpora of open-access research papers to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs extracted from figures and tables with detailed captions from scientific documents. We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of baseline retrieval systems. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2. Additionally, we perform optical character recognition (OCR) on the images and exploit this text to improve the capability of VLMs on the SciMMIR task. Our findings offer useful insights for MMIR in the scientific domain, including the influence of pre-training and fine-tuning settings, the effects of different visual and textual encoders, and the impact of OCR information. All our data and code are made publicly available.

Citation

Wu, S., Li, Y., Zhu, K., Zhang, G., Liang, Y., Ma, K., Xiao, C., Zhang, H., Yang, B., Chen, W., Huang, W., Al Moubayed, N., Fu, J., & Lin, C. (2024, August). SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval. Presented at ACL 2024: Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand

Presentation Conference Type Conference Paper (published)
Conference Name ACL 2024: Annual Meeting of the Association for Computational Linguistics
Start Date Aug 11, 2024
Acceptance Date Jul 1, 2024
Publication Date 2024-08
Deposit Date May 27, 2025
Publicly Available Date May 29, 2025
Publisher Association for Computational Linguistics
Peer Reviewed Peer Reviewed
Pages 12560-12574
Book Title Findings of the Association for Computational Linguistics: ACL 2024
DOI https://doi.org/10.18653/v1/2024.findings-acl.746
Public URL https://durham-repository.worktribe.com/output/3964621

Files





You might also like



Downloadable Citations