ChengHao Xiao chenghao.xiao@durham.ac.uk
PGR Student Doctor of Philosophy
Length is a Curse and a Blessing for Document-level Semantics
Xiao, Chenghao; Li, Yizhi; Hudson, G Thomas; Lin, Chenghua; Al Moubayed, Noura
Authors
Yizhi Li
George Hudson g.t.hudson@durham.ac.uk
Post Doctoral Research Associate
Chenghua Lin
Dr Noura Al Moubayed noura.al-moubayed@durham.ac.uk
Associate Professor
Abstract
In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, LA(SER) 3 : length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark. Our code is publicly available.
Citation
Xiao, C., Li, Y., Hudson, G. T., Lin, C., & Al Moubayed, N. (2023, December). Length is a Curse and a Blessing for Document-level Semantics. Presented at The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) |
Start Date | Dec 6, 2023 |
End Date | Dec 10, 2023 |
Acceptance Date | Nov 9, 2023 |
Publication Date | 2023 |
Deposit Date | Nov 23, 2023 |
Publicly Available Date | Dec 8, 2023 |
Pages | 1385-1396 |
Public URL | https://durham-repository.worktribe.com/output/1948282 |
Publisher URL | https://aclanthology.org/venues/emnlp/ |
Files
Published Conference Paper
(404 Kb)
PDF
Licence
http://creativecommons.org/licenses/by/4.0/
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
You might also like
Fine-grained Main Ideas Extraction and Clustering of Online Course Reviews
(2022)
Book Chapter
MuLD: The Multitask Long Document Benchmark
(2022)
Presentation / Conference Contribution
Explainable text-tabular models for predicting mortality risk in companion animals
(2024)
Journal Article
Downloadable Citations
About Durham Research Online (DRO)
Administrator e-mail: dro.admin@durham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search