Cross-Frame Transformer-Based Spatio-Temporal Video Super-Resolution

Wenhui ZHANG, Mingliang ZHOU, Cheng JI, Xiubao SUI*, Junqi BAI

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

13 Citations (Scopus)


In this paper, we explore the spatio-temporal video super-resolution task, which aims to generate a high-resolution and high-frame-rate video from an existing video with low resolution and frame rate. First, we propose an end-to-end spatio-temporal video super-resolution network chiefly composed of cross-frame transformers instead of traditional convolutions. Especially, the cross-frame transformer module divides the input feature sequence into query, key, value matrixes, and then obtains the maximum similarity and similarity coefficient matrixes between neighboring and current feature maps through self-attention processing operations. Next, we propose a multi-level residual reconstruction module, which could make full use of the maximum similarity and similarity coefficient matrixes obtained by the cross-frame transformer, to reconstruct the high resolution and frame rate results from coarse to fine. Qualitative and quantitative evaluation results show that our method offers better performance and requires fewer training parameters compared with the existing two-stage network.

Original languageEnglish
Pages (from-to)359-369
Number of pages11
JournalIEEE Transactions on Broadcasting
Issue number2
Early online date7 Feb 2022
Publication statusPublished - Jun 2022
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 1963-12012 IEEE.


  • cross-frame transformer module
  • multi-level residual reconstruction
  • self-attention
  • spatio-temporal video super-resolution
  • Transformer network
  • video frame interpolation


Dive into the research topics of 'Cross-Frame Transformer-Based Spatio-Temporal Video Super-Resolution'. Together they form a unique fingerprint.

Cite this