Transformer with a Parallel Decoder for Image Captioning

Peilang WEI, Xu LIU, Jun LUO, Huayan PU, Xiaoxu HUANG, Shilong WANG, Huajun CAO, Shouhong YANG*, Xu ZHUANG, Jason WANG, Hong YUE, Cheng JI, Mingliang ZHOU

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review


In this paper, a parallel decoder and a word group prediction module are proposed to speed up decoding and improve the effect of captions. The features of the image extracted by the encoder are linearly projected to different word groups, and then a unique relaxed mask matrix is designed to improve the decoding speed and the caption effect. First, since image captioning is composed of many words, sentences can also be broken down into word groups or words according to their syntactic structure, and we achieve this function through constituency parsing. Second, we make full use of the extracted features to predict the size of word groups. Then, a new embedding representing the information of the word is proposed based on word embedding. Finally, with the help of word groups, we design a mask matrix to modify the decoding process so that each step of the model can produce one or more words in parallel. Experiments on public datasets demonstrate that our method can reduce the time complexity while maintaining competitive performance.

Original languageEnglish
Article number2354029
JournalInternational Journal of Pattern Recognition and Artificial Intelligence
Issue number1
Publication statusPublished - Jan 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2024 World Scientific Publishing Company.


  • constituency parsing
  • Image captioning
  • time complexity
  • transformer
  • word groups


Dive into the research topics of 'Transformer with a Parallel Decoder for Image Captioning'. Together they form a unique fingerprint.

Cite this