Deep Fusion Module for Video Action Recognition

Yunyao LI, Zihao ZHENG, Mingliang ZHOU, Guangchao YANG, Xuekai WEI, Huayan PU, Jun LUO

Research output: Journal PublicationsJournal Article (refereed)peer-review


In video action recognition, effective spatiotemporal modeling is crucial. However, traditional two-stream methods face challenges in integrating spatial information from RGB images and temporary information from optical flow without long-range temporal modelling. To address these limitations, we propose the Deep Fusion Module (DFM), which focuses on the deep fusion of spatial and temporal information and consists of two components. First, we propose an Attention Fusion Module (AFM) to effectively fuse the shallow features obtained from a two-stream network, thereby facilitating the integration of spatial and temporal information. Next, we incorporate a SpatioTemporal Module (STM), comprising a ConvGRU and a 1×1 convolution, to model long-range temporal dependency and fuse spatial-temporal features. Experiments on the UCF101 dataset show that our method achieves 96.5% accuracy, outperforming baseline two-stream models by 0.3%.

Original languageEnglish
Article number2450247
JournalJournal of Circuits, Systems and Computers
Publication statusE-pub ahead of print - 3 Apr 2024
Externally publishedYes

Bibliographical note

Publisher Copyright:
© World Scientific Publishing Company.


  • spatiotemporal modeling
  • two-stream
  • Video action recognition


Dive into the research topics of 'Deep Fusion Module for Video Action Recognition'. Together they form a unique fingerprint.

Cite this