Abstract
Modern scientific articles extend beyond plain text, with publishers such as Elsevier frequently supplementing their papers with videos and graphical abstracts (infographic summaries) to enhance the reading experience. Currently, paper summarizers capable of effectively fusing two or more modalities are scarce, and the complexity of integrating these diverse modalities, often with missing or incomplete data, calls for advanced modelling technologies. This paper introduces Hier-SciSum, a new model for Multimodal Paper Summarization (MPS), Hier-SciSum incorporates a Hierarchical Multimodal Fusion (HMF) module, which effectively integrates diverse modalities by first capturing pairwise intrinsic cross-modality correlations through attention mechanisms and then refining these relationships with cross-attention masking. This hierarchical approach allows for a progressive understanding of both low-level pairwise relationships and higher-level integrated representations. Extensive experiments conducted on a newly introduced MPS dataset showcase our model's effectiveness. Impressively, Hier-SciSum generates high-quality summaries, outperforming both uni- and multi-modality models.
Original language | English |
---|---|
Title of host publication | 2024 International Conference on Engineering and Emerging Technologies (ICEET) |
Edition | 2024 |
DOIs | |
Publication status | Published - Dec 2024 |
Event | 10th International Conference on Engineering and Emerging Technologies, ICEET 2024 - Dubai, United Arab Emirates Duration: 27 Dec 2024 → 28 Dec 2024 |
Publication series
Name | International Conference on Engineering and Emerging Technologies, ICEET |
---|---|
ISSN (Print) | 2409-2983 |
Conference
Conference | 10th International Conference on Engineering and Emerging Technologies, ICEET 2024 |
---|---|
Country/Territory | United Arab Emirates |
City | Dubai |
Period | 27/12/24 → 28/12/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Funding
The work is supported by the Hong Kong RGC ECS (LU23200223/130393) and Internal Grants of Lingnan University, Hong Kong (code: LWP20018/871232, DR23A9/101194, DB23B5/102083, DB23AI/102070 and 102241).
Keywords
- Hierarchical Multimodal Fusion
- MPS Dataset
- Multimodal Paper Summarization (MPS)
- Quadmodal Attention