TIAR: Text-Image-Audio Retrieval with weighted multimodal re-ranking

Peide CHI, Yong FENG*, Mingliang ZHOU, Xian-cai XIONG, Yong-heng WANG, Bao-hua QIANG

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

1 Citation (Scopus)


Cross-modal retrieval has developed remarkably recently and received extensive attention as an essential method for multimodal interaction study. However, most existing models are limited to one of the applications in cross-modal retrieval, i.e., text-image retrieval, and neglect the audio modality, which is widely distributed in data and can be integrated into the models to improve retrieval performance. To address this issue, we propose a text-image-audio cross-modal retrieval (TIAR) model that, given any or two modalities, implements the retrieval of the remaining modalities. TIAR consists of three modal-specific encoders to extract the features and a cross-modal encoder to generate joint contextualized representations for all modalities. To evaluate our model, we present two new cross-modal retrieval tasks, named cross-unimodal and cross-bimodal retrieval, that are applicable to three modalities. Then, during testing, we propose a weighted multimodal re-ranking (WMR) algorithm which integrates comprehensive ranking information in the similarity matrices of all tasks to improve the performance without additional training. The experiment results show that TIAR-WMR outperforms state-of-the-art models in traditional text-image retrieval on Flickr30k, COCO, and ADE20k datasets. Moreover, the retrieval performance of TIAR-WMR is further boosted in the two proposed tasks when two input modalities are integrated. The code is available at https://github.com/PeideChi/TIAR .

Original languageEnglish
Pages (from-to)22898-22916
Number of pages19
JournalApplied Intelligence
Issue number19
Early online date4 Jul 2023
Publication statusPublished - Oct 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.


  • Audio retrieval
  • Cross-modal retrieval
  • Fusion learning
  • Multimedia


Dive into the research topics of 'TIAR: Text-Image-Audio Retrieval with weighted multimodal re-ranking'. Together they form a unique fingerprint.

Cite this