Contrastive semantic similarity learning for image captioning evaluation

Chao ZENG, Sam KWONG, Tiesong ZHAO, Hanli WANG

Research output: Journal PublicationsJournal Article (refereed)peer-review

5 Citations (Scopus)


Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most current captioning metrics rely on token-level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric I2 CE (Intrinsic Image Captioning Evaluation). For learning the evaluation metric, we develop three progressive model structures capturing the sentence level representations–single branch model, dual branches model, and triple branches model. For evaluation of the proposed metric, we select one automatic captioning model and collect human scores on the quality of the generated captions. We introduce a statistical test on the correlation between human scores and metric scores. Our proposed metric I2 CE achieves the Spearman correlation value of 51.42, which is better than the score of 41.95 achieved by one recently proposed BERT-based metric. The result is also better than the conventional rule-based metrics. Extensive results on the Composite-coco dataset and PASCAL-50S also validate the effectiveness of our proposed metric. The proposed metric could serve as a novel indicator of the intrinsic information between captions, which complements the existing ones.
Original languageEnglish
Pages (from-to)913-930
JournalInformation Sciences
Early online date26 Jul 2022
Publication statusPublished - Sept 2022
Externally publishedYes

Bibliographical note

This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).


  • Auto-encoder
  • Contrastive learning
  • Image captioning evaluation
  • Sentence representations


Dive into the research topics of 'Contrastive semantic similarity learning for image captioning evaluation'. Together they form a unique fingerprint.

Cite this