Abstract
With the advent of generative AI, recent years have witnessed rapid growth in AI applications within the field of audiovisual translation (AVT). Scholars have explored various topics in this area, as highlighted in the 2023 special issue of Linguistica Antverpiensia, New Series – Themes in Translation Studies. These topics include major challenges in AI dubbing, human-machine cooperation, machine translation and subtitling, post-editing, and quality evaluation. However, further efforts are needed to provide a deeper understanding of current and future trends. Among these topics, the quality assessment of subtitling and dubbing has garnered the most attention. A commonly used framework is the FAR model (functional equivalence, acceptability, readability) introduced by Jan Pedersen for evaluating subtitling. This model can also be applied to the assessment of dubbing quality, with the addition of another critical factor: synchrony (S). While most scholars focus on isochrony, lip sync, and kinesic synchrony, we argue that the synchrony of acoustic factors—such as voice and tempo—is equally vital for an optimal audiovisual experience. In this study, we employed two AI dubbing applications, ElevenLabs and Heygen, to generate English-dubbed clips of the Chinese TV drama 甄嬛传(Empresses in the Palace). Both platforms integrate automatic speech recognition, machine translation, and text-to-speech synthesis. ElevenLabs offers the ability to dub by cloning the original voice, while Heygen modifies the speaker’s lip movements to align with English speech. Using the FARS model, we evaluated the initial clips without post-editing. Our findings reveal that both AI applications produced moderately accurate translations of the source speech, though errors—primarily stemming from incorrect source speech recognition—persist. In terms of synchrony of acoustic factors, ElevenLabs cannot achieve lip synchronization, but it can generate target speech that closely resemble the source voice; however, several main characters’ voices sound unnatural, and the tempo of some utterances is faster than the original due to lengthy direct translations. Conversely, Heygen performs better in maintaining appropriate tempo; however, it struggles to replicate the original Chinese voices, and its lip synchronization is unstable. Notably, in scenes where the speaker is off-screen, the lips of other actors who are not speaking are occasionally alter incorrectly. Overall, this study demonstrates the potential of human-machine collaboration to produce high-quality audiovisual content effectively.
| Original language | English |
|---|---|
| Publication status | Published - 21 May 2025 |
| Event | The 11th Asia-Pacific Translation and Interpreting Forum (APTIF11): Culture, Connectivity and Technology: Translating Communities, Transforming Perspectives - Hong Kong Baptist University, Hong Kong, Hong Kong, China Duration: 21 May 2025 → 23 May 2025 https://ctn.hkbu.edu.hk/aptif11/ |
Conference
| Conference | The 11th Asia-Pacific Translation and Interpreting Forum (APTIF11): Culture, Connectivity and Technology: Translating Communities, Transforming Perspectives |
|---|---|
| Abbreviated title | APTIF11 |
| Country/Territory | Hong Kong, China |
| City | Hong Kong |
| Period | 21/05/25 → 23/05/25 |
| Internet address |
Keywords
- AI dubbing
- Quality assessment
- FAR model
- Audiovisual translation