TY - JOUR
T1 - CoSTA: Co-training spatial-temporal attention for blind video quality assessment
AU - XING, Fengchuang
AU - WANG, Yuan-Gen
AU - TANG, Weixuan
AU - ZHU, Guopu
AU - KWONG, Sam
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/12/1
Y1 - 2024/12/1
N2 - Self-attention-based Transformer has achieved great success in many computer vision tasks. However, its application to blind video quality assessment (VQA) is far from comprehensive. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a Co-trained Space-Time Attention network for the blind VQA problem, termed CoSTA. Specifically, we first build CoSTA by alternately concatenating the divided space–time attention. Then, to facilitate the training of CoSTA, we design a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special token as the learnable variable of MOS, leading to the better fitting of the human rating process. Finally, to solve the data-hungry problem within Transformer, we propose to co-train the spatial and temporal attention weights using both images and videos. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-Qualcomm, LIVE-VQC, KoNViD-1k, YouTube-UGC, LSVQ, LSVQ-1080p, and DVL2021. Experimental results demonstrate the superiority of the proposed CoSTA over the state-of-the-art. The source code is publicly available at https://github.com/GZHU-DVL/CoSTA
AB - Self-attention-based Transformer has achieved great success in many computer vision tasks. However, its application to blind video quality assessment (VQA) is far from comprehensive. Evaluating the quality of in-the-wild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a Co-trained Space-Time Attention network for the blind VQA problem, termed CoSTA. Specifically, we first build CoSTA by alternately concatenating the divided space–time attention. Then, to facilitate the training of CoSTA, we design a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special token as the learnable variable of MOS, leading to the better fitting of the human rating process. Finally, to solve the data-hungry problem within Transformer, we propose to co-train the spatial and temporal attention weights using both images and videos. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-Qualcomm, LIVE-VQC, KoNViD-1k, YouTube-UGC, LSVQ, LSVQ-1080p, and DVL2021. Experimental results demonstrate the superiority of the proposed CoSTA over the state-of-the-art. The source code is publicly available at https://github.com/GZHU-DVL/CoSTA
KW - Co-training
KW - In-the-wild videos
KW - Self-attention
KW - Transformer
KW - Video quality assessment
UR - http://www.scopus.com/inward/record.url?scp=85197618081&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2024.124651
DO - 10.1016/j.eswa.2024.124651
M3 - Journal Article (refereed)
SN - 0957-4174
VL - 255
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 124651
ER -