Leveraging Sound Local and Global Features for Language-Queried Target Sound Extraction

  • Xinmeng XU
  • , Yiqun ZHANG
  • , Yuhong YANG
  • , Weiping TU*
  • *Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

Language-queried target sound extraction is a fundamental audio-language task that aims to estimate the audio signal of the target sound event class by a natural language expression in a sound mixture. One of the key challenges of this task is leveraging the language expression to highlight the target sound features in the noisy mixture interpretably. In this paper, we leverage language expression to guide the model to extract the most informative features of the target sound event by adaptively using local and global features, and we present a novel language-aware synergic attention network (LASA-Net) for language-queried target sound extraction, as the first attempt to leverage local and global operations using language representation to extract target sound in single or multiple sound source environments. In particular, language-aware synergic attention consists of a local operation submodule, a global operation submodule, and an interaction submodule, in which local and global operation submodules extract sound local and global features while the interaction submodule adaptively selects the most discriminative features with the guidance of linguistic features. In addition, we introduce a linguistic-acoustic fusion module that leverages the well-proven correlation modeling power of self-attention for excavating helpful multi-modal contexts. Extensive experiments demonstrate that our proposed LASA-Net is able to achieve state-of-the-art performance while maintaining an attractive computational complexity.
Original languageEnglish
Title of host publicationNeural Information Processing: 30th International Conference, ICONIP 2023, Proceedings
EditorsBiao LUO, Long CHENG, Zheng-Guang WU, Hongyi LI, Chaojie LI
PublisherSpringer Singapore
Pages367-379
Number of pages13
ISBN (Electronic)9789819980703
ISBN (Print)9789819980697
DOIs
Publication statusPublished - 2024
Externally publishedYes

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14450 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Bibliographical note

Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

Funding

Foundation of China (No. 62071342, No.62171326), the Special Fund of Hubei Luo-jia Laboratory (No. 220100019), the Hubei Province Technological Innovation Major Project (No. 2021BAA034) and the Fundamental Research Funds for the Central Universities (No.2042023kf1033).

Keywords

  • Language-aware synergic attention
  • Language-queried Target Sound Extraction
  • Linguistic-acoustic fusion module
  • Local and global operation

Fingerprint

Dive into the research topics of 'Leveraging Sound Local and Global Features for Language-Queried Target Sound Extraction'. Together they form a unique fingerprint.

Cite this