TY - JOUR
T1 - Learning Conditional Diffusion Transformer for Salient Object Detection in Optical Remote Sensing Images
AU - ZENG, Chao
AU - ZHANG, Jun
AU - KWONG, Sam
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2026
Y1 - 2026
N2 - In recent years, the task of detecting salient objects in optical remote-sensing images has posed a significant and formidable challenge. The existing approaches heavily rely on a limited amount of label saliency masks and usually utilize convolutional neural networks (CNNs) for feature decoding. In this article, we introduce the conditional diffusion transformer network (CDTNet), a novel architecture meticulously designed to learn contextualized and diffusion-guided features for optical remote sensing image salient object detection (ORSI SOD). Our work presents a Transformer-based progressive cross-stage fusion (PCSF) module. This module serves as the decoding unit for saliency prediction, enabling the seamless integration of multiscale features from different stages of the network. Through this fusion, the model can better understand the inner structure of the image and enhance the accuracy of saliency prediction. Moreover, we develop a patch strategy (PS). This strategy is dedicated to fine-grained feature aggregation, allowing the network to focus on detailed information within individual feature patches and thus making better use of transformer layers. In addition, the encoder feature enhancement (EFE) module is applied to enhance the extracted features from the backbone network by utilizing spatial and channel attention. We conduct comprehensive experiments on various benchmark datasets and evaluation metrics. The experimental results unequivocally demonstrate the superiority of the proposed CDTNet over the comparison SOTA methods.
AB - In recent years, the task of detecting salient objects in optical remote-sensing images has posed a significant and formidable challenge. The existing approaches heavily rely on a limited amount of label saliency masks and usually utilize convolutional neural networks (CNNs) for feature decoding. In this article, we introduce the conditional diffusion transformer network (CDTNet), a novel architecture meticulously designed to learn contextualized and diffusion-guided features for optical remote sensing image salient object detection (ORSI SOD). Our work presents a Transformer-based progressive cross-stage fusion (PCSF) module. This module serves as the decoding unit for saliency prediction, enabling the seamless integration of multiscale features from different stages of the network. Through this fusion, the model can better understand the inner structure of the image and enhance the accuracy of saliency prediction. Moreover, we develop a patch strategy (PS). This strategy is dedicated to fine-grained feature aggregation, allowing the network to focus on detailed information within individual feature patches and thus making better use of transformer layers. In addition, the encoder feature enhancement (EFE) module is applied to enhance the extracted features from the backbone network by utilizing spatial and channel attention. We conduct comprehensive experiments on various benchmark datasets and evaluation metrics. The experimental results unequivocally demonstrate the superiority of the proposed CDTNet over the comparison SOTA methods.
KW - Conditional diffusion model
KW - feature patches
KW - optical remote sensing images
KW - salient object detection
KW - transformers
UR - https://www.scopus.com/pages/publications/105033655995
U2 - 10.1109/TCYB.2026.3667145
DO - 10.1109/TCYB.2026.3667145
M3 - Journal Article (refereed)
C2 - 41838518
SN - 2168-2267
SP - 1
EP - 14
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
ER -