Dual Swin-transformer based mutual interactive network for RGB-D salient object detection

Chao ZENG, Sam KWONG*, Horace IP

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

1 Citation (Scopus)


Depth information for RGB-D Salient Object Detection(SOD) is important and conventional deep models are usually relied on the CNN feature extractors. The long-range contextual dependencies, dense modeling on the saliency decoder, and multi-task learning assistance are usually ignored. In this work, we propose a Dual Swin-Transformer-based Mutual Interactive Network (DTMINet), aiming to learn contextualized, dense, and edge-aware features for RGB-D SOD. We adopt the Swin-Transformer as the visual backbone to extract contextualized features. A self-attention-based Cross-Modality Interaction module is proposed to strengthen the visual backbone for cross-modal interaction. In addition, a Gated Modality Attention module is designed for cross-modal fusion. At different decoding stages, enhanced with dense connections and progressively merge the multi-level encoding features with the proposed Dense Saliency Decoder. Considering the depth quality issue, a Skip Convolution module is introduced to provide guidance to the RGB modality for the saliency prediction. In addition, we add the edge prediction to the saliency predictor to regularize the learning process. Comprehensive experiments on five standard RGB-D SOD benchmark datasets over four evaluation metrics demonstrate the superiority of the proposed method.
Original languageEnglish
Article number126779
Pages (from-to)126779
Early online date17 Sept 2023
Publication statusPublished - 28 Nov 2023

Bibliographical note

This work is supported by Key Project of Science and Technology Innovation 2030, China supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301 ), the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA) , and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 ( 9042598 ).
Publisher Copyright:
© 2023


  • Dense connection
  • Edge supervision
  • Gated modality attention
  • RGB-D images
  • Salient object detection
  • Self-attention
  • Swin-transformer


Dive into the research topics of 'Dual Swin-transformer based mutual interactive network for RGB-D salient object detection'. Together they form a unique fingerprint.

Cite this