Skip to main content

Research Repository

Advanced Search

Length is a Curse and a Blessing for Document-level Semantics

Xiao, Chenghao; Li, Yizhi; Hudson, G Thomas; Lin, Chenghua; Al Moubayed, Noura

Length is a Curse and a Blessing for Document-level Semantics Thumbnail


Authors

ChengHao Xiao chenghao.xiao@durham.ac.uk
PGR Student Doctor of Philosophy

Yizhi Li

George Hudson g.t.hudson@durham.ac.uk
Post Doctoral Research Associate

Chenghua Lin



Abstract

In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, LA(SER) 3 : length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark. Our code is publicly available.

Citation

Xiao, C., Li, Y., Hudson, G. T., Lin, C., & Al Moubayed, N. (2023, December). Length is a Curse and a Blessing for Document-level Semantics. Presented at The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore

Presentation Conference Type Conference Paper (published)
Conference Name The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Start Date Dec 6, 2023
End Date Dec 10, 2023
Acceptance Date Nov 9, 2023
Publication Date 2023
Deposit Date Nov 23, 2023
Publicly Available Date Dec 8, 2023
Pages 1385-1396
Public URL https://durham-repository.worktribe.com/output/1948282
Publisher URL https://aclanthology.org/venues/emnlp/

Files





You might also like



Downloadable Citations