Multi-level network based on transformer encoder for fine-grained image–text matching

Lei YANG, Yong FENG*, Mingliang ZHOU*, Xiancai XIONG, Yongheng WANG, Baohua QIANG

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

1 Citation (Scopus)


Enabling image–text matching is important to understand both vision and language. Existing methods utilize the cross-attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image–text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at

Original languageEnglish
Pages (from-to)1981-1994
Number of pages14
JournalMultimedia Systems
Issue number4
Early online date10 Apr 2023
Publication statusPublished - Aug 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.


  • Fine-grained information
  • Image–text matching
  • Multi-level network
  • Transformer encoder


Dive into the research topics of 'Multi-level network based on transformer encoder for fine-grained image–text matching'. Together they form a unique fingerprint.

Cite this