Abstract
For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time-and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder features, while 2) multi-head cross-attention module which are inserted in the skip connections allows a fine recovery in the decoder by filtering out uncorrelated features. Experimental results illustrate that the U-Former obtains consistently better performance than recent models of PESQ, STOI, and SSNR scores.
| Original language | English |
|---|---|
| Title of host publication | 2022 26th International Conference on Pattern Recognition, ICPR 2022 |
| Publisher | IEEE |
| Pages | 663-669 |
| Number of pages | 7 |
| ISBN (Electronic) | 9781665490627 |
| DOIs | |
| Publication status | Published - 2022 |
| Externally published | Yes |
| Event | 26TH International Conference on Pattern Recognition - Montreal, Canada Duration: 21 Aug 2022 → 25 Aug 2022 |
Publication series
| Name | Proceedings - International Conference on Pattern Recognition |
|---|---|
| Volume | 2022-August |
| ISSN (Print) | 1051-4651 |
Conference
| Conference | 26TH International Conference on Pattern Recognition |
|---|---|
| Country/Territory | Canada |
| City | Montreal |
| Period | 21/08/22 → 25/08/22 |
Bibliographical note
Publisher Copyright:© 2022 IEEE.
Keywords
- long-term contexts
- multi-head cross-attention
- multi-head self-attention
- sequence-to-sequence mapping
- supervised speech enhancement
- Transformer
- U-net structure
Fingerprint
Dive into the research topics of 'U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver