Task-Driven Video Compression for Humans and Machines : Framework Design and Optimization

Xiaokai YI, Hanli WANG, Sam KWONG, C.-C. Jay KUO

Research output: Journal PublicationsJournal Article (refereed)peer-review

4 Citations (Scopus)


Learned video compression has developed rapidly and achieved impressive progress in recent years. Despite efficient compression performance, existing signal fidelity oriented or semantic fidelity oriented video compression methods limit the capability to meet the requirements of both machine and human vision. To address this problem, a task-driven video compression framework is proposed to flexibly support vision tasks for both human vision and machine vision. Specifically, to improve the compression performance, the backbone of the video compression framework is optimized by using three novel modules, including multi-scale motion estimation, multi-frame feature fusion, and reference based in-loop filters. Then, based on the proposed efficient compression backbone, a task-driven optimization approach is designed to achieve the trade-off between signal fidelity oriented compression and semantic fidelity oriented compression. Moreover, a post-filter module is employed for the framework to further improve the performance of the human vision branch. Finally, rate-distortion performance, rate-accuracy performance, and subjective quality are employed as the evaluation metrics, and experimental results show the superiority of the proposed framework for both human vision and machine vision. The source code of this work can be found in https://mic.tongji.edu.cn.
Original languageEnglish
JournalIEEE Transactions on Multimedia
Publication statusE-pub ahead of print - 30 Dec 2022
Externally publishedYes


  • action recognition
  • Feature extraction
  • Image coding
  • Machine vision
  • multi-task optimization
  • neural network
  • Neural networks
  • Semantics
  • Task analysis
  • video coding for machine
  • Video compression
  • Video compression


Dive into the research topics of 'Task-Driven Video Compression for Humans and Machines : Framework Design and Optimization'. Together they form a unique fingerprint.

Cite this