Deep Decoupling Classification and Regression for Visual Tracking

Guang HAN, Ruiyu YANG, Hua GAO, Sam KWONG

Research output: Journal PublicationsJournal Article (refereed)peer-review


Classification and regression are two tasks that most Siamese-based trackers need to handle. However, most of the existing trackers only learn one feature embedding to handle these two types of task, making it difficult to optimize both simultaneously. To solve this problem, this paper tries to deeply decouple classification and regression in the model structure. Specifically, two feature extraction backbone networks are used to divide the model into two branches to extract the heterogeneous features suitable for the two tasks, respectively. Inspired by the core idea of Transformer, information interaction and fusion between multiple branches are achieved by cross-attention mechanism, which can fully exploit the deep information dependence between multiple branches. In addition, the concept of channel-level information interaction is proposed by innovatively changing the generation mode of vector groups in the attention module. The experiments show that Double Siamese Tracker (DST) designed in this paper greatly improves the accuracy of classification and regression. DST runs at 60 FPS (Frames Per Second) on GPU, far above the real-time requirement.

Original languageEnglish
JournalIEEE Transactions on Cognitive and Developmental Systems
Publication statusE-pub ahead of print - 30 Aug 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:


  • Visual trackingCTransformerCSiamese network
  • Attention mechanism


Dive into the research topics of 'Deep Decoupling Classification and Regression for Visual Tracking'. Together they form a unique fingerprint.

Cite this