A Multimodal Contrastive Network with Unbiased Distillation for Knowledge-based VQA

Zihan HU*, Ruoyao DING, Haoran XIE, Zhenguo YANG

*Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Referred Conference Paperpeer-review

Abstract

In this paper, we propose a multimodal contrastive network with unbiased distillation (MCUD) for knowledge-based VQA, which consists of contrastive sample construction (CSC), unbiased contrastive distillation (UCD), and hierarchical reasoning (HR) modules. Specifically, CSC constructs contrastive samples by transforming the knowledge corpus and adopting entropy-adjusted answer frequencies to identify the unbiased samples. Additionally, UCD employs a dual-branch feature extractor architecture to separately encode the knowledge corpus and image-question pairs into a shared embedding space. Subsequently, the knowledge-driven contrastive learning is supposed to bridge the modality gap between the knowledge corpus and the cross-modality of image-question pairs. Furthermore, the teacher model in UCD utilizes different distillation strategies for biased and unbiased samples, guiding the student model to establish a generalized unbiased representation. Finally, the HR module conducts chain-of-thought outputs, sequentially locates the contextual sentences in the knowledge corpus, generates the rationale, and infers the answer. Extensive experiments with two datasets, E-VQA and ScienceQA, demonstrate the effectiveness and outperformance of our method.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350359312
DOIs
Publication statusPublished - 2024
Event2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan
Duration: 30 Jun 20245 Jul 2024

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/TerritoryJapan
CityYokohama
Period30/06/245/07/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Keywords

  • contrastive and distillation learning
  • knowledge bias
  • knowledge-based visual question answering

Fingerprint

Dive into the research topics of 'A Multimodal Contrastive Network with Unbiased Distillation for Knowledge-based VQA'. Together they form a unique fingerprint.

Cite this