CoSTA: Co-training spatial-temporal attention for blind video quality assessment

Fengchuang XING, Yuan-Gen WANG*, Weixuan TANG, Guopu ZHU, Sam KWONG

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review


Self-attention-based Transformer has achieved great success in many computer vision tasks. However, its application to blind video quality assessment (VQA) is far from comprehensive. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a Co-trained Space-Time Attention network for the blind VQA problem, termed CoSTA. Specifically, we first build CoSTA by alternately concatenating the divided space–time attention. Then, to facilitate the training of CoSTA, we design a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special token as the learnable variable of MOS, leading to the better fitting of the human rating process. Finally, to solve the data-hungry problem within Transformer, we propose to co-train the spatial and temporal attention weights using both images and videos. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-Qualcomm, LIVE-VQC, KoNViD-1k, YouTube-UGC, LSVQ, LSVQ-1080p, and DVL2021. Experimental results demonstrate the superiority of the proposed CoSTA over the state-of-the-art. The source code is publicly available at
Original languageEnglish
Article number124651
JournalExpert Systems with Applications
Early online date2 Jul 2024
Publication statusE-pub ahead of print - 2 Jul 2024

Bibliographical note

Publisher Copyright:
© 2024 Elsevier Ltd


  • Co-training
  • In-the-wild videos
  • Self-attention
  • Transformer
  • Video quality assessment


Dive into the research topics of 'CoSTA: Co-training spatial-temporal attention for blind video quality assessment'. Together they form a unique fingerprint.

Cite this