Abstract
Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at https://github.com/Yueting-Huang/FAL-ViT.
| Original language | English |
|---|---|
| Pages (from-to) | 5993-6006 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 35 |
| Issue number | 6 |
| Early online date | 28 Jan 2025 |
| DOIs | |
| Publication status | Published - 2025 |
Bibliographical note
Publisher Copyright:© 1991-2012 IEEE.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62176027, in part by the Agency for Science, Technology and Research of Singapore under the Robotics Horizontal Technology Coordinating Office Project under Grant C221518005, in part by Hong Kong General Research Fund-Research Grant Council (GRF-RGC) GRF under Grant 11203820, in part by Chongqing Talent under Grant cstc2024ycjh-bgzxm0082, in part by the Joint Equipment Pre Research and Key Fund Project of the Ministry of Education under Grant 8091B012207, and in part by the Central University Operating Expenses under Grant 2024CDJGF-044.
Keywords
- deep learning
- Fine-grained visual classification
- vision Transformer
Fingerprint
Dive into the research topics of 'An Attention-Locating Algorithm for Eliminating Background Effects in Fine-grained Visual Classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver