Abstract
In this paper, we propose a multimodal contrastive network with unbiased distillation (MCUD) for knowledge-based VQA, which consists of contrastive sample construction (CSC), unbiased contrastive distillation (UCD), and hierarchical reasoning (HR) modules. Specifically, CSC constructs contrastive samples by transforming the knowledge corpus and adopting entropy-adjusted answer frequencies to identify the unbiased samples. Additionally, UCD employs a dual-branch feature extractor architecture to separately encode the knowledge corpus and image-question pairs into a shared embedding space. Subsequently, the knowledge-driven contrastive learning is supposed to bridge the modality gap between the knowledge corpus and the cross-modality of image-question pairs. Furthermore, the teacher model in UCD utilizes different distillation strategies for biased and unbiased samples, guiding the student model to establish a generalized unbiased representation. Finally, the HR module conducts chain-of-thought outputs, sequentially locates the contextual sentences in the knowledge corpus, generates the rationale, and infers the answer. Extensive experiments with two datasets, E-VQA and ScienceQA, demonstrate the effectiveness and outperformance of our method.
Original language | English |
---|---|
Title of host publication | 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
ISBN (Electronic) | 9798350359312 |
DOIs | |
Publication status | Published - 2024 |
Event | 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan Duration: 30 Jun 2024 → 5 Jul 2024 |
Publication series
Name | Proceedings of the International Joint Conference on Neural Networks |
---|
Conference
Conference | 2024 International Joint Conference on Neural Networks, IJCNN 2024 |
---|---|
Country/Territory | Japan |
City | Yokohama |
Period | 30/06/24 → 5/07/24 |
Bibliographical note
Publisher Copyright:© 2024 IEEE.
Keywords
- contrastive and distillation learning
- knowledge bias
- knowledge-based visual question answering