TY - JOUR
T1 - An Attention-Locating Algorithm for Eliminating Background Effects in Fine-grained Visual Classification
AU - HUANG, Yueting
AU - HECHEN, Zhenzhe
AU - ZHOU, Mingliang
AU - LI, Zhengguo
AU - KWONG, Sam
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025/1/28
Y1 - 2025/1/28
N2 - Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at https://github.com/Yueting-Huang/FAL-ViT.
AB - Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at https://github.com/Yueting-Huang/FAL-ViT.
KW - deep learning
KW - Fine-grained visual classification
KW - vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85216719751&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2025.3535818
DO - 10.1109/TCSVT.2025.3535818
M3 - Journal Article (refereed)
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -