Combining CNN and transformers for full-reference and no-reference image quality assessment


*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

4 Citations (Scopus)


Most deep learning approaches for image quality assessment use regression from deep features extracted by CNN (Convolutional Neural Networks). However, non-local information is usually neglected in existing methods. Motivated by the recent success of transformers in modeling contextual information, we propose a hybrid framework that utilizes a vision transformer backbone to extract features and a CNN decoder for quality estimation. We propose a shared feature extraction scheme for both FR and NR settings. A two-branch structured attentive quality predictor is devised for quality prediction. Evaluation experiments on various IQA datasets, including LIVE, CSIQ and TID2013, LIVE-Challenge, KADID-10 K, and KONIQ-10 K, show that our proposed models achieve outstanding performance for both FR and NR settings.

Original languageEnglish
Article number126437
Early online date21 Jun 2023
Publication statusPublished - 7 Sept 2023
Externally publishedYes

Bibliographical note

Funding Information:
This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).

Publisher Copyright:
© 2023


  • Convolutional neural network
  • Image quality assessment
  • Non-local information
  • Transformers


Dive into the research topics of 'Combining CNN and transformers for full-reference and no-reference image quality assessment'. Together they form a unique fingerprint.

Cite this