Projects per year
Abstract
Image captioning is a challenging task in the research area of vision and language. Traditionally in a deep learning-based image captioning model, two types of input features are utilized for generating the token of the current inference step, including the attended visual feature and the previous word embedding. However, the sentence level embeddings are ignored in this typical working pipeline for captioning. In this paper, we propose Intrinsic Cross-Modality Captioning (ICMC), a new method to improve image captioning with sentence level embedding and Cross-Modality Alignment. The novelty of our proposed model mainly comes from the text encoder and the Cross-Modality module. In the feature encoding stage, we use an adaptation module to map the global visual features to the joint domain. In the decoding stage we then use the adapted features to guide the visual attention process with the RCNN features. With the proposed method we not only attend to the visual features and previous word for captions but also include the sentence level clues from the ground truths at training phase. The evaluation on the benchmark of MSCOCO and extensive ablation studies are performed to validate the effectiveness of the proposed method.
Original language | English |
---|---|
Pages (from-to) | 2059–2070 |
Journal | International Journal of Machine Learning and Cybernetics |
Volume | 13 |
Issue number | 7 |
Early online date | 25 Mar 2022 |
DOIs | |
Publication status | Published - Jul 2022 |
Externally published | Yes |
Bibliographical note
This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).Keywords
- Cross-modality alignment
- Image captioning
- Sentence embeddings
Fingerprint
Dive into the research topics of 'Learning cross-modality features for image caption generation'. Together they form a unique fingerprint.Projects
- 1 Active
-
Adaptive Dynamic Range Enhancement Oriented to High Dynamic Display (面向高動態顯示的自適應動態範圍增強)
KWONG, S. T. W. (PI), KUO, C.-C. J. (CoI), WANG, S. (CoI) & ZHANG, X. (CoI)
Research Grants Council (HKSAR)
1/01/21 → 31/12/24
Project: Grant Research