3D-Aware Text-Driven Talking Avatar Generation

Xiuzhe WU, Yang-Tian SUN, Handi CHEN, Hang ZHOU, Jingdong WANG, Zhengzhe LIU, Xiaojuan QI*

*Corresponding author for this work

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

This paper introduces text-driven talking avatar generation, a task that uses text to instruct both the generation and animation of an avatar. One significant obstacle in this task is the absence of paired text and talking avatar data for model training, limiting data-driven methodologies. To this end, we present a zero-shot approach that adapts an existing 3D-aware image generation model, trained on a large-scale image dataset for high-quality avatar creation, to align with textual instructions and be animated to produce talking avatars, eliminating the need for paired text and talking avatar data. Our approach’s core lies in the seamless integration of a 3D-aware image generation model (i.e., EG3D), the explicit 3DMM model, and a newly developed self-supervised inpainting technique, to create and animate the avatar and generate a temporal consistent talking video. Thorough evaluations demonstrate the effectiveness of our proposed approach in generating realistic avatars based on textual descriptions and empowering avatars to express user-specified text. Notably, our approach is highly controllable and can generate rich expressions and head poses.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2024 : 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXVIII
EditorsAleš LEONARDIS, Elisa RICCI, Stefan ROTH, Olga RUSSAKOVSKY, Torsten SATTLER, Gül VAROL
PublisherSpringer Science and Business Media Deutschland GmbH
Pages416-433
Number of pages18
ISBN (Print)9783031732225
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event18th European Conference on Computer Vision, ECCV 2024 - Milan, Italy
Duration: 29 Sept 20244 Oct 2024

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15146
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference18th European Conference on Computer Vision, ECCV 2024
Country/TerritoryItaly
CityMilan
Period29/09/244/10/24

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Funding

This work has been supported by Hong Kong Research Grant Council - Early Career Scheme (Grant No. 27209621), General Research Fund Scheme (Grant No. 17202422), Theme-based Research (Grant No. T45-701/22-R) and RGC Matching Fund Scheme (RMGS). Part of the described research work is conducted in the JC STEM Lab of Robotics for Soft Materials funded by The Hong Kong Jockey Club Charities Trust.

Keywords

  • Talking Avatar
  • Text
  • Training Efficiency

Fingerprint

Dive into the research topics of '3D-Aware Text-Driven Talking Avatar Generation'. Together they form a unique fingerprint.

Cite this