End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Yuting RAN, Bin FANG*, Lei CHEN, Xuekai WEI, Weizhi XIAN, Mingliang ZHOU

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review


In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global-local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.

Original languageEnglish
Article number2450074
JournalJournal of Circuits, Systems and Computers
Issue number4
Early online date7 Sept 2023
Publication statusPublished - 15 Mar 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2024 World Scientific Publishing Company.


  • end-to-end
  • global-local representations
  • multimodal encoder
  • parallel encoder
  • transformer
  • Video captioning


Dive into the research topics of 'End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning'. Together they form a unique fingerprint.

Cite this