Deep Decoupling Classification and Regression for Visual Tracking

Guang HAN, Ruiyu YANG, Hua GAO, Sam KWONG

Research output: Journal PublicationsJournal Article (refereed)peer-review


Classification and regression are two tasks that most Siamese-based trackers need to handle. However, most of the existing trackers only learn one feature embedding to handle these two types of task, making it difficult to optimize both simultaneously. To solve this problem, this article tries to deeply decouple classification and regression in the model structure. Specifically, two feature extraction backbone networks are used to divide the model into two branches to extract the heterogeneous features suitable for the two tasks, respectively. Inspired by the core idea of transformer, information interaction and fusion between multiple branches are achieved by the cross-attention mechanism, which can fully exploit the deep information dependence between multiple branches. In addition, the concept of channel-level information interaction is proposed by innovatively changing the generation mode of vector groups in the attention module. The experiments show that double Siamese tracker (DST) designed in this article greatly improves the accuracy of classification and regression. DST runs at 60 frames per second (FPS) on GPU, far above the real-time requirement.

Original languageEnglish
Pages (from-to)1239-1251
Number of pages13
JournalIEEE Transactions on Cognitive and Developmental Systems
Issue number3
Early online date30 Aug 2022
Publication statusPublished - 1 Sept 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2016 IEEE.


  • Attention mechanism
  • Siamese network
  • transformer
  • visual tracking


Dive into the research topics of 'Deep Decoupling Classification and Regression for Visual Tracking'. Together they form a unique fingerprint.

Cite this