Classification and regression are two tasks that most Siamese-based trackers need to handle. However, most of the existing trackers only learn one feature embedding to handle these two types of task, making it difficult to optimize both simultaneously. To solve this problem, this paper tries to deeply decouple classification and regression in the model structure. Specifically, two feature extraction backbone networks are used to divide the model into two branches to extract the heterogeneous features suitable for the two tasks, respectively. Inspired by the core idea of Transformer, information interaction and fusion between multiple branches are achieved by cross-attention mechanism, which can fully exploit the deep information dependence between multiple branches. In addition, the concept of channel-level information interaction is proposed by innovatively changing the generation mode of vector groups in the attention module. The experiments show that Double Siamese Tracker (DST) designed in this paper greatly improves the accuracy of classification and regression. DST runs at 60 FPS (Frames Per Second) on GPU, far above the real-time requirement.
|Journal||IEEE Transactions on Cognitive and Developmental Systems|
|Publication status||E-pub ahead of print - 30 Aug 2022|
Bibliographical notePublisher Copyright:
- Visual trackingCTransformerCSiamese network
- Attention mechanism