Abstract
The Conditional Coding-based Learned Video Compression (CC-LVC) has become an important paradigm in learned video compression, because it can effectively explore spatial-temporal redundancies within a huge context space. However, existing CC-LVC methods cannot accurately model motion information and efficiently mine contextual correlations for complex regions with non-rigid motions and non-linear deformations. To address these problems, an efficient CC-LVC method is proposed in this paper, which mines spatial-temporal dependencies across multiple motion domains and receptive domains for improving the video coding efficiency. To accurately model complex motions and generate precise temporal contexts, a Multi-domain Motion modeling Network (MMNet) is proposed to capture robust motion information from both spatial and frequency domains. Moreover, a multi-domain context refinement module is developed to discriminatively highlight frequency-domain temporal contexts and adaptively fuse multi-domain temporal contexts, which can effectively mitigate inaccuracies in temporal contexts caused by motion errors. In order to efficiently compress video signals, a Multi-scale Long Short-range Decorrelation Module (MLSDM)-based context codec is proposed, in which an MLSDM is designed to learn long short-range spatial-temporal dependencies and channel-wise correlations across varying receptive domains. Extensive experimental results show that the proposed method significantly outperforms VTM 17.0 and other state-of-the-art learned video compression methods in terms of both PSNR and MS-SSIM.
| Original language | English |
|---|---|
| Pages (from-to) | 808-820 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Broadcasting |
| Volume | 71 |
| Issue number | 3 |
| Early online date | 23 Jul 2025 |
| DOIs | |
| Publication status | Published - Sept 2025 |
Bibliographical note
Publisher Copyright:© 1963-12012 IEEE.
Funding
This work was supported by the National Natural Science Foundation of China under Grant 62322116.
Keywords
- Learned video compression
- conditional coding
- frequency decomposition
- multi-scale long short-range decorrelation
- visual state space block