Abstract
Motion expression video segmentation aims to segment objects based on input motion descriptions. Compared with traditional referring video object segmentation, it focuses on motion and multi-object expressions and is more challenging. Previous works achieved it by simply injecting text information into the video instance segmentation (VIS) model. However, this requires retraining the entire model and optimization is difficult. In this work, we propose DMVS, a simple framework constructed on the existing query-based VIS model, emphasizing decoupling the task into video instance segmentation and motion expression understanding. Firstly, we use a frozen video instance segmenter to extract object-specific contexts and convert them into frame-level and video-level queries. Secondly, we interact two levels of queries with static and motion cues, respectively, to further encode visually enhanced motion expressions. Furthermore, we propose a novel query initialization strategy that uses video queries guided by classification priors to initialize motion queries, greatly reducing the difficulty of optimization. Without bells and whistles, DMVS achieves state-of-the-art performance on the MeViS dataset at a lower training cost. Extensive experiments verify the effectiveness and efficiency of our framework.
| Original language | English |
|---|---|
| Title of host publication | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
| Pages | 13821-13831 |
| DOIs | |
| Publication status | Published - 13 Aug 2025 |
| Event | The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 - Music City Center, Nashville, United States Duration: 11 Jun 2025 → 15 Jun 2025 https://cvpr.thecvf.com/ |
Conference
| Conference | The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 |
|---|---|
| Abbreviated title | CVPR 2025 |
| Country/Territory | United States |
| City | Nashville |
| Period | 11/06/25 → 15/06/25 |
| Internet address |
Fingerprint
Dive into the research topics of 'Decoupled Motion Expression Video Segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver