Multimodal Paper Summarization with Hierarchical Fusion

Zusheng TAN, Xinyi ZHONG, Billy CHIU*

*Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Referred Conference Paperpeer-review

Abstract

Modern scientific articles extend beyond plain text, with publishers such as Elsevier frequently supplementing their papers with videos and graphical abstracts (infographic summaries) to enhance the reading experience. Currently, paper summarizers capable of effectively fusing two or more modalities are scarce, and the complexity of integrating these diverse modalities, often with missing or incomplete data, calls for advanced modelling technologies. This paper introduces Hier-SciSum, a new model for Multimodal Paper Summarization (MPS), Hier-SciSum incorporates a Hierarchical Multimodal Fusion (HMF) module, which effectively integrates diverse modalities by first capturing pairwise intrinsic cross-modality correlations through attention mechanisms and then refining these relationships with cross-attention masking. This hierarchical approach allows for a progressive understanding of both low-level pairwise relationships and higher-level integrated representations. Extensive experiments conducted on a newly introduced MPS dataset showcase our model's effectiveness. Impressively, Hier-SciSum generates high-quality summaries, outperforming both uni- and multi-modality models.

Original languageEnglish
Title of host publication2024 International Conference on Engineering and Emerging Technologies (ICEET)
Edition2024
DOIs
Publication statusPublished - Dec 2024
Event10th International Conference on Engineering and Emerging Technologies, ICEET 2024 - Dubai, United Arab Emirates
Duration: 27 Dec 202428 Dec 2024

Publication series

NameInternational Conference on Engineering and Emerging Technologies, ICEET
ISSN (Print)2409-2983

Conference

Conference10th International Conference on Engineering and Emerging Technologies, ICEET 2024
Country/TerritoryUnited Arab Emirates
CityDubai
Period27/12/2428/12/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Funding

The work is supported by the Hong Kong RGC ECS (LU23200223/130393) and Internal Grants of Lingnan University, Hong Kong (code: LWP20018/871232, DR23A9/101194, DB23B5/102083, DB23AI/102070 and 102241).

Keywords

  • Hierarchical Multimodal Fusion
  • MPS Dataset
  • Multimodal Paper Summarization (MPS)
  • Quadmodal Attention

Fingerprint

Dive into the research topics of 'Multimodal Paper Summarization with Hierarchical Fusion'. Together they form a unique fingerprint.

Cite this