Skip to main navigation Skip to search Skip to main content

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

  • Xinmeng XU
  • , Jianjun HAO*
  • *Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

For supervised speech enhancement, contextual information is important for accurate spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, this paper treats the speech enhancement as sequence-to-sequence mapping, and propose a novel monaural speech enhancement U-net structure based on Transformer, dubbed U-Former. The key idea is to model long-term correlations and dependencies, which are crucial for accurate noisy speech modeling, through the multi-head attention mechanisms. For this purpose, U-Former incorporates multi-head attention mechanisms at two levels: 1) a multi-head self-attention module which calculate the attention map along both time-and frequency-axis to generate time and frequency sub-attention maps for leveraging global interactions between encoder features, while 2) multi-head cross-attention module which are inserted in the skip connections allows a fine recovery in the decoder by filtering out uncorrelated features. Experimental results illustrate that the U-Former obtains consistently better performance than recent models of PESQ, STOI, and SSNR scores.
Original languageEnglish
Title of host publication2022 26th International Conference on Pattern Recognition, ICPR 2022
PublisherIEEE
Pages663-669
Number of pages7
ISBN (Electronic)9781665490627
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event26TH International Conference on Pattern Recognition - Montreal, Canada
Duration: 21 Aug 202225 Aug 2022

Publication series

NameProceedings - International Conference on Pattern Recognition
Volume2022-August
ISSN (Print)1051-4651

Conference

Conference26TH International Conference on Pattern Recognition
Country/TerritoryCanada
CityMontreal
Period21/08/2225/08/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

Keywords

  • long-term contexts
  • multi-head cross-attention
  • multi-head self-attention
  • sequence-to-sequence mapping
  • supervised speech enhancement
  • Transformer
  • U-net structure

Fingerprint

Dive into the research topics of 'U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention'. Together they form a unique fingerprint.

Cite this