Enhancing Large Language Models for Scientific Multimodal Summarization with Multimodal Output

Zusheng TAN, Xinyi ZHONG*, Jing-Yu JI, Wei JIANG, Billy CHIU*

*Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Referred Conference Paperpeer-review

Abstract

The increasing integration of multimedia such as videos and graphical abstracts in scientific publications necessitates advanced summarization techniques. This paper introduces Uni-SciSum, a framework for Scientific Multimodal Summarization with Multimodal Output (SMSMO), addressing the challenges of fusing heterogeneous data sources (e.g., text, images, video, audio) and outputting multimodal summary within a unified architecture. UniSciSum leverages the power of large language models (LLMs) and extends its capability to cross-modal understanding through BridgeNet, a query-based transformer that fuses diverse modalities into a fixed-length embedding. A two-stage training process, involving modalto-modal pre-training and cross-modal instruction tuning, aligns different modalities with summaries and optimizes for multimodal summary generation. Experiments on two new SMSMO datasets show Uni-SciSum outperforms uni- and multi-modality methods, advancing LLM applications in the increasingly multimodal realm of scientific communication.
Original languageEnglish
Title of host publicationProceedings of the 31st International Conference on Computational Linguistics: Industry Track
EditorsOwen RAMBOW, Leo WANNER, Marianna APIDIANAKI, Hend AL-KHALIFA, Barbara DI EUGENIO, Steven SCHOCKAERT, Kareem DARWISH, Apoorv AGARWAL
PublisherAssociation for Computational Linguistics
ISBN (Print)9798891761971
Publication statusPublished - Jan 2025
EventThe 31st International Conference on Computational Linguistics: Industry Track - Abu Dhabi, United Arab Emirates
Duration: 19 Jan 202524 Jan 2025
https://coling2025.org/

Conference

ConferenceThe 31st International Conference on Computational Linguistics: Industry Track
Abbreviated titleCOLING 2025
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period19/01/2524/01/25
Internet address

Fingerprint

Dive into the research topics of 'Enhancing Large Language Models for Scientific Multimodal Summarization with Multimodal Output'. Together they form a unique fingerprint.

Cite this