TY - JOUR
T1 - Deep Decoupling Classification and Regression for Visual Tracking
AU - HAN, Guang
AU - YANG, Ruiyu
AU - GAO, Hua
AU - KWONG, Sam
N1 - Publisher Copyright:
IEEE
PY - 2022/8/30
Y1 - 2022/8/30
N2 - Classification and regression are two tasks that most Siamese-based trackers need to handle. However, most of the existing trackers only learn one feature embedding to handle these two types of task, making it difficult to optimize both simultaneously. To solve this problem, this paper tries to deeply decouple classification and regression in the model structure. Specifically, two feature extraction backbone networks are used to divide the model into two branches to extract the heterogeneous features suitable for the two tasks, respectively. Inspired by the core idea of Transformer, information interaction and fusion between multiple branches are achieved by cross-attention mechanism, which can fully exploit the deep information dependence between multiple branches. In addition, the concept of channel-level information interaction is proposed by innovatively changing the generation mode of vector groups in the attention module. The experiments show that Double Siamese Tracker (DST) designed in this paper greatly improves the accuracy of classification and regression. DST runs at 60 FPS (Frames Per Second) on GPU, far above the real-time requirement.
AB - Classification and regression are two tasks that most Siamese-based trackers need to handle. However, most of the existing trackers only learn one feature embedding to handle these two types of task, making it difficult to optimize both simultaneously. To solve this problem, this paper tries to deeply decouple classification and regression in the model structure. Specifically, two feature extraction backbone networks are used to divide the model into two branches to extract the heterogeneous features suitable for the two tasks, respectively. Inspired by the core idea of Transformer, information interaction and fusion between multiple branches are achieved by cross-attention mechanism, which can fully exploit the deep information dependence between multiple branches. In addition, the concept of channel-level information interaction is proposed by innovatively changing the generation mode of vector groups in the attention module. The experiments show that Double Siamese Tracker (DST) designed in this paper greatly improves the accuracy of classification and regression. DST runs at 60 FPS (Frames Per Second) on GPU, far above the real-time requirement.
KW - Visual trackingCTransformerCSiamese network
KW - Attention mechanism
UR - http://www.scopus.com/inward/record.url?scp=85137878454&partnerID=8YFLogxK
U2 - 10.1109/TCDS.2022.3202802
DO - 10.1109/TCDS.2022.3202802
M3 - Journal Article (refereed)
AN - SCOPUS:85137878454
SN - 2379-8920
JO - IEEE Transactions on Cognitive and Developmental Systems
JF - IEEE Transactions on Cognitive and Developmental Systems
ER -