Abstract
Classification and regression are two tasks that most Siamese-based trackers need to handle. However, most of the existing trackers only learn one feature embedding to handle these two types of task, making it difficult to optimize both simultaneously. To solve this problem, this article tries to deeply decouple classification and regression in the model structure. Specifically, two feature extraction backbone networks are used to divide the model into two branches to extract the heterogeneous features suitable for the two tasks, respectively. Inspired by the core idea of transformer, information interaction and fusion between multiple branches are achieved by the cross-attention mechanism, which can fully exploit the deep information dependence between multiple branches. In addition, the concept of channel-level information interaction is proposed by innovatively changing the generation mode of vector groups in the attention module. The experiments show that double Siamese tracker (DST) designed in this article greatly improves the accuracy of classification and regression. DST runs at 60 frames per second (FPS) on GPU, far above the real-time requirement.
Original language | English |
---|---|
Pages (from-to) | 1239-1251 |
Number of pages | 13 |
Journal | IEEE Transactions on Cognitive and Developmental Systems |
Volume | 15 |
Issue number | 3 |
Early online date | 30 Aug 2022 |
DOIs | |
Publication status | Published - 1 Sept 2023 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2016 IEEE.
Funding
This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant 61871445 and Grant 61302156, and in part by the Key Research and Development Foundation Project of Jiangsu Province under Grant BE2016001-4.
Keywords
- Attention mechanism
- Siamese network
- transformer
- visual tracking