End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Yuting RAN, Bin FANG*, Lei CHEN, Xuekai WEI, Weizhi XIAN, Mingliang ZHOU

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global-local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.

Original languageEnglish
Article number2450074
JournalJournal of Circuits, Systems and Computers
Volume33
Issue number4
Early online date7 Sept 2023
DOIs
Publication statusPublished - 15 Mar 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2024 World Scientific Publishing Company.

Keywords

  • end-to-end
  • global-local representations
  • multimodal encoder
  • parallel encoder
  • transformer
  • Video captioning

Fingerprint

Dive into the research topics of 'End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning'. Together they form a unique fingerprint.

Cite this