Abstract
Despite the effectiveness of Segment Anything Model (SAM) based methods in Few-Shot Segmentation (FSS) tasks, our closer examination of their prompt encoding mechanism reveals that these methods rely solely on visual information to generate a single type of prompt. Consequently, they suffer from semantic granularity representation bias and a loss of spatial information. To address these limitations, this paper introduces an innovative multi-modal prompt encoder, enabling SAM to leverage both annotated reference images and textual descriptions of class names as segmentation prompts. This approach generates text prompts, dense visual prompts, and sparse visual prompts, spanning multiple modalities and granularities. These prompts provide enhanced representations of the target class, capturing both abstract semantics and specific details, while ensuring granularity appropriateness. When our multi-modal prompt encoder is integrated with SAM's image encoder and mask decoder, the overall model is referred to as MM-Prompt. To validate its effectiveness, we conducted extensive empirical studies on the PASCAL-5i and COCO-20i datasets. The experimental results demonstrate that MM-Prompt achieves state-of-the-art performance in FSS tasks, highlighting its substantial potential and value in this domain.
| Original language | English |
|---|---|
| Title of host publication | MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025 |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 3067-3075 |
| Number of pages | 9 |
| ISBN (Electronic) | 9798400720352 |
| DOIs | |
| Publication status | Published - 27 Oct 2025 |
| Event | 33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland Duration: 27 Oct 2025 → 31 Oct 2025 |
Publication series
| Name | MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025 |
|---|
Conference
| Conference | 33rd ACM International Conference on Multimedia, MM 2025 |
|---|---|
| Country/Territory | Ireland |
| City | Dublin |
| Period | 27/10/25 → 31/10/25 |
Bibliographical note
Publisher Copyright:© 2025 ACM.
Funding
This work was supported in part by the National Natural Science Foundation of China Grant 62471278, Grant 62302141 and Grant 62331003, in part by the Taishan Scholar Project of Shandong Province under Grant tsqn202306079, and in part by the Research Grants Council of the Hong Kong Special Administrative Region, China under Grant STG5/E-103/24-R, and in part by the Fundamental Research Funds for the Central Universities Grant JZ2024HGTB0255.
Keywords
- few-shot learning
- multi-modal
- segment anything
- segmentation