Mining Temporal Redundancy Using Long Short-Term Motion Aggregation and Global-Local Decorrelation for Learned Video Compression

  • Feng YUAN
  • , Zhaoqing PAN*
  • , Jianjun LEI
  • , Bo PENG
  • , Haoran XIE
  • , Fu Lee WANG
  • , Sam KWONG
  • *Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

Abstract

The conditional coding paradigm is widely used in learned video compression, which shows superior performance in capturing redundancies within a large context space. However, existing Conditional coding-based Learned Video Compression (C-LVC) methods ignore that the predicted motion vectors usually contain large uncertainty due to complex motions, occlusions, etc., which consequently decrease the accuracy of the generated temporal contexts. In addition, existing C-LVC methods have a weak ability to mine diverse dependencies within the context space, which are closely related to the coding efficiency. To address these issues, an efficient temporal redundancy mining method is proposed to improve the coding efficiency of C-LVC in this paper. To generate accurate temporal contexts, a Long Short-Term Motion Aggregation (LSTMA) model is proposed, in which an LSTMA-based motion estimation module is developed to capture both current and aggregated long short-term motion information to reduce the uncertainty of predicted motion vectors. Based on the dual motion information, an LSTMA-based temporal context mining module is developed to exploit the aggregated long short-term motion information and increase the accuracy of the generated temporal contexts. In order to fully eliminate spatial-temporal redundancies in a video, a Global-Local Information Decorrelation Module (GLIDM)-based context codec is proposed, in which the GLIDM is designed based on the visual state space block (namely vmamba), the residual block, and the squeeze-and-excitation block to effectively capture long-range, short-range spatial-temporal dependencies and channel-wise dependencies. Experimental results demonstrate that our proposed method can effectively improve the coding performance of C-LVC, and outperforms other state-of-the-art LVC methods.
Original languageEnglish
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Early online date27 Nov 2025
DOIs
Publication statusPublished - 2025

Bibliographical note

Publisher Copyright:
© 1991-2012 IEEE.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62322116.

Keywords

  • Learned video compression
  • conditional coding
  • long short-term motion aggregation
  • global-local decorrelation
  • visual state space model
  • vmamba

Fingerprint

Dive into the research topics of 'Mining Temporal Redundancy Using Long Short-Term Motion Aggregation and Global-Local Decorrelation for Learned Video Compression'. Together they form a unique fingerprint.

Cite this