Audio-driven talking face generation (ATFG) is typically multifaceted, requiring high-quality faces, lip movements synchronized with the audio, and plausible facial expressions. Previous works have mainly focused on minimizing a single loss function constructed on the basis of consideration of the multiple performance requirements for an ATFG model. However, as a proxy, the loss function does not always yield optimal performance when minimised. Moreover, it is unlikely that there is a single model that can be optimal in terms of all relevant quality indicators. In this paper, we formulate the training of an ATFG model as an indicator-aware multi-objective optimization problem and propose a novel approach, namely Evolutionary Multi-objective Indicator-aware audio-driven talking face Generation (EMIG), to solve this problem. EMIG explicitly uses the quality indicators in the design process of ATFG models and can be easily adapted to the preferences of users. Experimental studies demonstrate the potential advantages of evolutionary multi-objective optimization for solving the task of ATFG and the effectiveness of EMIG over five state-of-the-art ATFG methods. © 2022 IEEE.
|Title of host publication
|Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
|Institute of Electrical and Electronics Engineers Inc.
|Number of pages
|Published - 4 Dec 2022
Bibliographical noteThis work was supported by the grant from the National Natural Science Foundation of China (Grant No. 62106098). Corresponding author: Wenjing Hong.
- Audio-driven talking face generation
- Evolution-ary multi-objective optimization