Abstract
Fine-grained visual categorization is a challenging issue owing to high intra-class and low inter-class variances. Classical approaches rely on pre-trained models or many fine annotations. In this paper, we observe that spatial and frequency information provides distinct image views, and propose a new spatial-frequency feature fusion (SFFF) perspective to handle this challenging issue. Specifically, we design a heterogeneous feature extraction loss function, construct a global and local fusion SFFF network, and propose an importance-sparsity selection strategy. For feature extraction, we focus on the frequency domain feature learning network, extract fine-grained features, and achieve feature complementarity. For feature selection, we propose importance ranking and sparse regularity to constrain spatial-frequency features. For feature fusion, we design a spatial-frequency loss and an inter-layer switching strategy to achieve local-global collaboration. Comparative experiments were performed on popular fine-grained datasets and classic datasets such as CUB200-2011, Stanford Cars, Stanford Dogs, FGVC-Aircraft, and CIFAR100. The effectiveness and outstanding performance of SFFF are confirmed by comparisons with more than 40 state-of-the-art fine-grained categorization methods. Ablation studies and visualizations are provided to facilitate an understanding of our approach.
Original language | English |
---|---|
Pages (from-to) | 2798-2812 |
Number of pages | 15 |
Journal | IEEE Transactions on Circuits and Systems for Video Technology |
Volume | 33 |
Issue number | 6 |
Early online date | 8 Dec 2022 |
DOIs | |
Publication status | Published - Jun 2023 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 1991-2012 IEEE.
Keywords
- deep fusion
- Fine-grained recognition
- frequency domain learning
- training from scratch
- weakly supervised learning