Contribution-aware Dynamic Multi-modal Balance for Audio-Visual Speech Separation

  • Xinmeng XU
  • , Weiping TU*
  • , Yuhong YANG
  • , Jizhen LI
  • , Yiqun ZHANG
  • , Hongyang CHEN
  • *Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

Recent developments in audio-visual speech separation (AVSS) highlight the importance of visual information in improving the extraction of clean speech from noisy audio. However, the audio modality often dominates the learning process because of its direct connection to the output. This leads networks to focus more on audio-related parameters, while the visual modality is underutilized. This issue is especially noticeable in cases where the visual input is of low quality, as the visual modality provides limited useful information. To address this problem, we propose the Contribution-aware Multimodal Balance Network (CAMB-Net). This approach uses the dominant modality to guide the extraction and processing of low-contribution features in the weaker modality. CAMB-Net introduces two main innovations: (1) it identifies the dominant and weak modalities by analyzing the similarity between each modality's features and the fused features, then divides the weak modality's features into high- and low-contribution groups for refinement; and (2) it leverages cross-modal features to guide the extraction of low-contribution features within the weak modality, with high-contribution features from the weak modality influencing the guidance process. We evaluated CAMB-Net on datasets with both high- and low-quality video inputs. The results show significant improvements, with gains of 0.4 dB in SDRi, 0.5 dB in SI-SNRi, and 0.05 in PESQ under high-quality video conditions compared to the state-of-the-art IIANet model. In more challenging scenarios with visual occlusion in LRS3, CAMB-Net achieves strong results, reaching 15.2 dB in SDRi, 15.6 dB in SI-SNRi, and 3.19 in PESQ, demonstrating its robustness.
Original languageEnglish
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Multimedia
DOIs
Publication statusE-pub ahead of print - 16 Jan 2026
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 1999-2012 IEEE.

Keywords

  • Audio-visual speech separation
  • cosine similarity
  • high-contribution and low-contribution features
  • multimodal balance

Fingerprint

Dive into the research topics of 'Contribution-aware Dynamic Multi-modal Balance for Audio-Visual Speech Separation'. Together they form a unique fingerprint.

Cite this