Abstract
Efficient talking face video coding and control are crucial in modern video communication, reshaping how individuals connect, collaborate, and interact. Coding seeks to reduce transmission costs, while control enables the realization of user-customizable facial expressions and head poses in the transmitted videos. However, the compression efficiency of the common paradigm of applying control algorithms before video coding is not satisfactory. In this paper, we propose an efficient, Controllable Generative Talking Face Video Coding (CoFaCo) framework, which seamlessly integrates control into the coding process. Specifically, CoFaCo projects talking face videos into ultra-compact and semantic feature representations that can be customized by users before compression. To enable independent controls of pose and expression, we design a set of sophisticated losses to accurately decouple the pose and expression direction codes. Once the decoupled direction codes and the semantic face representations are obtained, the pose and expression control modules can be effectively learned to generate decoupled, controlled pose and expression direction codes. The controlled direction codes are subsequently smoothed to enhance temporal consistency in the controlled video output by the generators. Experimental results demonstrate that CoFaCo achieves competitive compression efficiency in ultra-low bit rate video reconstruction and control tasks, providing valuable insights for advancing face video communication with diverse control capabilities.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Image Processing |
| Early online date | 12 Jan 2026 |
| DOIs | |
| Publication status | E-pub ahead of print - 12 Jan 2026 |
Bibliographical note
Publisher Copyright:© 1992-2012 IEEE.
Keywords
- Generative neural network
- customizable control
- face representation
- talking face video coding