Skip to main navigation Skip to search Skip to main content

Learning Conditional Diffusion Transformer for Salient Object Detection in Optical Remote Sensing Images

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

In recent years, the task of detecting salient objects in optical remote-sensing images has posed a significant and formidable challenge. The existing approaches heavily rely on a limited amount of label saliency masks and usually utilize convolutional neural networks (CNNs) for feature decoding. In this article, we introduce the conditional diffusion transformer network (CDTNet), a novel architecture meticulously designed to learn contextualized and diffusion-guided features for optical remote sensing image salient object detection (ORSI SOD). Our work presents a Transformer-based progressive cross-stage fusion (PCSF) module. This module serves as the decoding unit for saliency prediction, enabling the seamless integration of multiscale features from different stages of the network. Through this fusion, the model can better understand the inner structure of the image and enhance the accuracy of saliency prediction. Moreover, we develop a patch strategy (PS). This strategy is dedicated to fine-grained feature aggregation, allowing the network to focus on detailed information within individual feature patches and thus making better use of transformer layers. In addition, the encoder feature enhancement (EFE) module is applied to enhance the extracted features from the backbone network by utilizing spatial and channel attention. We conduct comprehensive experiments on various benchmark datasets and evaluation metrics. The experimental results unequivocally demonstrate the superiority of the proposed CDTNet over the comparison SOTA methods.
Original languageEnglish
Pages (from-to)1-14
Number of pages14
JournalIEEE Transactions on Cybernetics
Early online date16 Mar 2026
DOIs
Publication statusPublished - 2026

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Funding

This work was supported in part by Hong Kong General Research Fund (GRF) under Grant 13200425; in part by the Research Grants Council (RGC) of Hong Kong Special Administrative Region, China, under Grant STG5/E-103/24-R; in part by the National Research Foundation of Korea under Grant RS-2025-00555463 and Grant RS-2025-25456394; in part by Tianjin Top Scientist Studio Project under Grant 24JRRCRC00030; and in part by Tianjin Belt and Road Joint Laboratory under Grant 24PTLYHZ00250.

Keywords

  • Conditional diffusion model
  • feature patches
  • optical remote sensing images
  • salient object detection
  • transformers

Fingerprint

Dive into the research topics of 'Learning Conditional Diffusion Transformer for Salient Object Detection in Optical Remote Sensing Images'. Together they form a unique fingerprint.

Cite this