Learning cross-modality features for image caption generation


Research output: Journal PublicationsJournal Article (refereed)peer-review

3 Citations (Scopus)


Image captioning is a challenging task in the research area of vision and language. Traditionally in a deep learning-based image captioning model, two types of input features are utilized for generating the token of the current inference step, including the attended visual feature and the previous word embedding. However, the sentence level embeddings are ignored in this typical working pipeline for captioning. In this paper, we propose Intrinsic Cross-Modality Captioning (ICMC), a new method to improve image captioning with sentence level embedding and Cross-Modality Alignment. The novelty of our proposed model mainly comes from the text encoder and the Cross-Modality module. In the feature encoding stage, we use an adaptation module to map the global visual features to the joint domain. In the decoding stage we then use the adapted features to guide the visual attention process with the RCNN features. With the proposed method we not only attend to the visual features and previous word for captions but also include the sentence level clues from the ground truths at training phase. The evaluation on the benchmark of MSCOCO and extensive ablation studies are performed to validate the effectiveness of the proposed method.
Original languageEnglish
Pages (from-to)2059–2070
JournalInternational Journal of Machine Learning and Cybernetics
Issue number7
Early online date25 Mar 2022
Publication statusPublished - Jul 2022
Externally publishedYes

Bibliographical note

This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).


  • Cross-modality alignment
  • Image captioning
  • Sentence embeddings


Dive into the research topics of 'Learning cross-modality features for image caption generation'. Together they form a unique fingerprint.

Cite this