Abstract
Recent developments in audio-visual speech separation (AVSS) highlight the importance of visual information in improving the extraction of clean speech from noisy audio. However, the audio modality often dominates the learning process because of its direct connection to the output. This leads networks to focus more on audio-related parameters, while the visual modality is underutilized. This issue is especially noticeable in cases where the visual input is of low quality, as the visual modality provides limited useful information. To address this problem, we propose the Contribution-aware Multimodal Balance Network (CAMB-Net). This approach uses the dominant modality to guide the extraction and processing of low-contribution features in the weaker modality. CAMB-Net introduces two main innovations: (1) it identifies the dominant and weak modalities by analyzing the similarity between each modality's features and the fused features, then divides the weak modality's features into high- and low-contribution groups for refinement; and (2) it leverages cross-modal features to guide the extraction of low-contribution features within the weak modality, with high-contribution features from the weak modality influencing the guidance process. We evaluated CAMB-Net on datasets with both high- and low-quality video inputs. The results show significant improvements, with gains of 0.4 dB in SDRi, 0.5 dB in SI-SNRi, and 0.05 in PESQ under high-quality video conditions compared to the state-of-the-art IIANet model. In more challenging scenarios with visual occlusion in LRS3, CAMB-Net achieves strong results, reaching 15.2 dB in SDRi, 15.6 dB in SI-SNRi, and 3.19 in PESQ, demonstrating its robustness.
| Original language | English |
|---|---|
| Pages (from-to) | 1-13 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Multimedia |
| DOIs | |
| Publication status | E-pub ahead of print - 16 Jan 2026 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 1999-2012 IEEE.
Keywords
- Audio-visual speech separation
- cosine similarity
- high-contribution and low-contribution features
- multimodal balance
Fingerprint
Dive into the research topics of 'Contribution-aware Dynamic Multi-modal Balance for Audio-Visual Speech Separation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver