SMSMO : Learning to generate multimodal summary for scientific papers

Xinyi ZHONG, Zusheng TAN, Shen GAO, Jing LI, Jiaxing SHEN, Jingyu JI, Jeff TANG, Billy CHIU*

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

Nowadays, publishers like Elsevier increasingly use graphical abstracts (i.e., a pictorial paper summary) along with textual abstracts to facilitate scientific paper readings. In such a case, automatically identifying a representative image and generating a suitable textual summary for individual papers can help editors and readers save time, facilitating them in reading and understanding papers. To tackle the case, we introduce the dataset for Scientific Multimodal Summarization with Multimodal Output (SMSMO). Unlike other multimodal tasks which performed on generic, medium-size contents (e.g., news), SMSMO needs to tackle longer multimodal contents in papers, with finer-grained multimodality interactions and semantic alignments between images and text. For this, we propose a cross-modality, multi-task learning summarizer (CMT-Sum). It captures the intra- and inter-modality interactions between images and text through a cross-fusion module; and models the finer-grained image-text semantic alignment by jointly generating the text summary, selecting the key image and matching the text and image. Extensive experiments conducted on two newly introduced datasets on the SMSMO task showcase our model’s effectiveness.
Original languageEnglish
Article number112908
Number of pages16
JournalKnowledge-Based Systems
Volume310
Early online date30 Dec 2024
DOIs
Publication statusE-pub ahead of print - 30 Dec 2024

Bibliographical note

Publisher Copyright:
© 2024

Funding

The work is supported by LEO Dr David P. Chan Institute of Data Science, the Hong Kong RGC ECS (LU23200223/130393), the Lam Woo Research Fund (LWP20018/871232), the Direct Grant (DR23A9/101194), the Faculty Research Grants (DB23B5/102083 and DB23AI/102070), the Research Seed Fund (102241) of Lingnan University, Hong Kong, the National Science Foundation of China (62476070), Shenzhen Science and Technology Program (JCYJ20241202123503005, GXWD20231128103232001) and Department of Science and Technology of Guangdong (2024A1515011540).

Keywords

  • Multi-task
  • Multimodal scientific summarization
  • Cross-modality fusion

Fingerprint

Dive into the research topics of 'SMSMO : Learning to generate multimodal summary for scientific papers'. Together they form a unique fingerprint.

Cite this