Abstract
Vision Transformer (ViT) and its variants have witnessed a significant success in computer vision. However, their performance may degrade in underwater dense prediction tasks due to challenges like complex underwater environments, quality degradation, and light scattering in underwater images. To solve this problem, we propose the Vision Transformer Underwater-Adapter (ViT-UWA), the first detail-focused and adapted ViT backbone for underwater dense prediction tasks, without requiring task-specific pretraining. In ViT-UWA, we first introduce High-frequency Components Prior (HFCP) to add high-frequency information of underwater images to the plain ViT, which can help recover and capture lost high-frequency information of underwater images. Then, we propose a Detail Aware Module (DAM) to obtain a detail-focused multi-scale convolutional feature pyramid, which can be used in kinds of dense prediction tasks. Through the ViT-DAM Cross Fusion (VDCF), we achieve bidirectional feature cross fusion between ViT and DAM. We evaluate ViT-UWA on multiple underwater dense prediction tasks, including semantic segmentation, instance segmentation, and object detection. With only ImageNet-22K pretraining, our ViT-UWA-B yields state-of-the-art 46.4 box AP and 44.2 mask AP on USIS10K dataset, which demonstrates the superiority of our method.
| Original language | English |
|---|---|
| Pages (from-to) | 4012-4026 |
| Number of pages | 15 |
| Journal | IEEE Transactions on Image Processing |
| Volume | 35 |
| Early online date | 14 Apr 2026 |
| DOIs | |
| Publication status | Published - 2026 |
Bibliographical note
Publisher Copyright:© 2026 IEEE. All rights reserved.
Keywords
- Underwater imagery
- dense prediction
- vision transformer adapter
Fingerprint
Dive into the research topics of 'ViT-UWA: Vision Transformer Underwater-Adapter for Dense Predictions Beneath the Water Surface'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver