Abstract
Language-queried target sound extraction is a fundamental audio-language task that aims to estimate the audio signal of the target sound event class by a natural language expression in a sound mixture. One of the key challenges of this task is leveraging the language expression to highlight the target sound features in the noisy mixture interpretably. In this paper, we leverage language expression to guide the model to extract the most informative features of the target sound event by adaptively using local and global features, and we present a novel language-aware synergic attention network (LASA-Net) for language-queried target sound extraction, as the first attempt to leverage local and global operations using language representation to extract target sound in single or multiple sound source environments. In particular, language-aware synergic attention consists of a local operation submodule, a global operation submodule, and an interaction submodule, in which local and global operation submodules extract sound local and global features while the interaction submodule adaptively selects the most discriminative features with the guidance of linguistic features. In addition, we introduce a linguistic-acoustic fusion module that leverages the well-proven correlation modeling power of self-attention for excavating helpful multi-modal contexts. Extensive experiments demonstrate that our proposed LASA-Net is able to achieve state-of-the-art performance while maintaining an attractive computational complexity.
| Original language | English |
|---|---|
| Title of host publication | Neural Information Processing: 30th International Conference, ICONIP 2023, Proceedings |
| Editors | Biao LUO, Long CHENG, Zheng-Guang WU, Hongyi LI, Chaojie LI |
| Publisher | Springer Singapore |
| Pages | 367-379 |
| Number of pages | 13 |
| ISBN (Electronic) | 9789819980703 |
| ISBN (Print) | 9789819980697 |
| DOIs | |
| Publication status | Published - 2024 |
| Externally published | Yes |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 14450 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Bibliographical note
Publisher Copyright:© 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
Funding
Foundation of China (No. 62071342, No.62171326), the Special Fund of Hubei Luo-jia Laboratory (No. 220100019), the Hubei Province Technological Innovation Major Project (No. 2021BAA034) and the Fundamental Research Funds for the Central Universities (No.2042023kf1033).
Keywords
- Language-aware synergic attention
- Language-queried Target Sound Extraction
- Linguistic-acoustic fusion module
- Local and global operation
Fingerprint
Dive into the research topics of 'Leveraging Sound Local and Global Features for Language-Queried Target Sound Extraction'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver