Abstract
Speech enhancement (SE) often struggles with residual noise and speech distortion when speech and noise share overlapping feature representations. To enhance the discriminability between speech and noise, many existing methods introduce explicit noise modeling as a way to provide stronger contrastive supervision. However, their effectiveness is fundamentally limited by the unpredictable and unstructured nature of noise signals. In this work, we propose the Positive-Negative Features Decomposition and Fusion SE Network (PN-DeFuSE-Net). Instead of modeling noise explicitly, PN-DeFuSE-Net decomposes the input features into speech-dominant (target-positive) and noise-dominant (target-negative) components to enable more effective separation. These features interact through guided correction, allowing the model to recover residual speech from noise-dominant regions while suppressing residual noise in speech-dominant areas. To support this architecture, we introduce three key modules: (1) a self-similarity-guided Decomposition Module that captures static and dynamic patterns across perceptual frequency bands; (2) a Convolutional Back-Projection Module (CBPM) that enhances fine-grained speech details through residual compensation; and (3) a Multi-sparsity Back-projection Conformer Module (MBCM) that refines time-frequency dependencies with sparsity-aware attention. Extensive evaluations across diverse noise environments demonstrate that PN-DeFuSE-Net achieves substantial improvements in speech quality and intelligibility, significantly reducing speech distortion and residual noise compared to prior models.
| Original language | English |
|---|---|
| Pages (from-to) | 4856-4869 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Speech and Audio Processing |
| Volume | 33 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
Bibliographical note
The associate editor coordinating the review of this article and approving it for publication was Dr. Xiao-Lei Zhang.Funding
This work was supported in part by the National Nature Science Foundation of China under Grant 62171326 and Grant 62071342 and in part by Hubei Provincial Science and Technology Plan Project under Grant 2025CSA057.
Keywords
- back-projection
- convolutional module
- decomposition and fusion
- Monaural speech enhancement
- positive and negative features
- self-attention
Fingerprint
Dive into the research topics of 'Interactive Target Positive and Negative Features Modeling for Monaural Speech Enhancement'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver