MM-Prompt: Multi-modality and Multi-granularity Prompts for Few-Shot Segmentation

  • Hang XIONG
  • , Runmin CONG*
  • , Jinpeng CHEN
  • , Chen ZHANG
  • , Feng LI
  • , Huihui BAI
  • , Sam KWONG
  • *Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

Despite the effectiveness of Segment Anything Model (SAM) based methods in Few-Shot Segmentation (FSS) tasks, our closer examination of their prompt encoding mechanism reveals that these methods rely solely on visual information to generate a single type of prompt. Consequently, they suffer from semantic granularity representation bias and a loss of spatial information. To address these limitations, this paper introduces an innovative multi-modal prompt encoder, enabling SAM to leverage both annotated reference images and textual descriptions of class names as segmentation prompts. This approach generates text prompts, dense visual prompts, and sparse visual prompts, spanning multiple modalities and granularities. These prompts provide enhanced representations of the target class, capturing both abstract semantics and specific details, while ensuring granularity appropriateness. When our multi-modal prompt encoder is integrated with SAM's image encoder and mask decoder, the overall model is referred to as MM-Prompt. To validate its effectiveness, we conducted extensive empirical studies on the PASCAL-5i and COCO-20i datasets. The experimental results demonstrate that MM-Prompt achieves state-of-the-art performance in FSS tasks, highlighting its substantial potential and value in this domain.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages3067-3075
Number of pages9
ISBN (Electronic)9798400720352
DOIs
Publication statusPublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Bibliographical note

Publisher Copyright:
© 2025 ACM.

Funding

This work was supported in part by the National Natural Science Foundation of China Grant 62471278, Grant 62302141 and Grant 62331003, in part by the Taishan Scholar Project of Shandong Province under Grant tsqn202306079, and in part by the Research Grants Council of the Hong Kong Special Administrative Region, China under Grant STG5/E-103/24-R, and in part by the Fundamental Research Funds for the Central Universities Grant JZ2024HGTB0255.

Keywords

  • few-shot learning
  • multi-modal
  • segment anything
  • segmentation

Fingerprint

Dive into the research topics of 'MM-Prompt: Multi-modality and Multi-granularity Prompts for Few-Shot Segmentation'. Together they form a unique fingerprint.

Cite this