Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

  • Fuxiang HUANG
  • , Lei ZHANG*
  • , Xiaowei FU
  • , Suqi SONG
  • *Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

13 Citations (Scopus)

Abstract

Mixed-Modal Image Retrieval (MMIR) as a flexible search paradigm has attracted wide attention. However, previous approaches always achieve limited performance, due to two critical factors are seriously overlooked. 1) The contribution of image and text modalities is different, but incorrectly treated equally. 2) There exist inherent labeling noises in describing users' intentions with text in web datasets from diverse real-world scenarios, giving rise to overfitting. We propose a Dynamic Weighted Combiner (DWC) to tackle the above challenges, which includes three merits. First, we propose an Editable Modality De-equalizer (EMD) by taking into account the contribution disparity between modalities, containing two modality feature editors and an adaptive weighted combiner. Second, to alleviate labeling noises and data bias, we propose a dynamic soft-similarity label generator (SSG) to implicitly improve noisy supervision. Finally, to bridge modality gaps and facilitate similarity learning, we propose a CLIP-based mutual enhancement module alternately trained by a mixed-modality contrastive loss. Extensive experiments verify that our proposed model significantly outperforms state-of-the-art methods on real-world datasets. The source code is available at https://github.com/fuxianghuang1/DWC.
Original languageEnglish
Pages (from-to)2303-2311
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume38
Issue number3
DOIs
Publication statusPublished - 25 Mar 2024
Externally publishedYes
EventThe 38th Annual AAAI Conference on Artificial Intelligence - Vancouver, Canada
Duration: 20 Feb 202427 Feb 2024

Bibliographical note

Publisher Copyright:
Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Funding

This work was partially supported by National Key R&D Program of China (2021YFB3100800), National Natural Science Fund of China (62271090, 61771079), Chongqing Natural Science Fund (cstc2021jcyj-jqX0023) and National Youth Talent Project. This work is also supported by Huawei computational power of Chongqing Artificial Intelligence Innovation Center.

Keywords

  • CV: Image and Video Retrieval
  • ML: Multimodal Learning
  • ML: Representation Learning

Fingerprint

Dive into the research topics of 'Dynamic Weighted Combiner for Mixed-Modal Image Retrieval'. Together they form a unique fingerprint.

Cite this