Audio-driven Talking Face Generation with Stabilized Synchronization Loss

1Karlsruhe Institute of Technology, 2Istanbul Technical University, 3Carnegie Mellon University
ECCV 2024

Abstract

In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.

1. Generated Videos by Our Model

We use 5 different videos with a single audio to generate talking faces. The videos are sample videos presented by DINet. For the face enhancement, we used VQFR as we proposed in our paper.

We use 5 different videos with a single audio to generate talking faces. The videos are sample videos presented by DINet. For the face enhancement, instead of using VQFR that we proposed in our paper, we employed GFPGAN for showing effect of different face enhancement methods.

We use 6 different videos with a single audio to generate talking faces. The videos are sample videos presented by DINet For the face enhancement, we used VQFR as we proposed in our paper.

We use 5 different videos with a single audio to generate talking faces. The videos along with two other audio are presented by VideoReTalking in their GitHub page under the examples folder. For the face enhancement, we used VQFR as we proposed in our paper.

We use the same 5 videos (presented above) with another single audio to generate talking faces.The videos along with two other audio are presented by VideoReTalking in their GitHub page under the examples folder. For the face enhancement, we used VQFR as we proposed in our paper.

2. Comparison with the Recent Methods

Here, we compare our method with the most recent and accurate methods; DINet (AAAI 2023), TalkLip (CVPR 2023), and VideoReTalking (SIGGRAPH 2022). Since the other recent methods in our paper (e.g., SyncTalkFace and LipFormer) have no public models, we were not be able to include them into this comparison. Please note that when we tried to generate video using FPS different than 25 with DINet and TalkLip models, the audio-visual synchronization was not obtained. Therefore, we had to keep the FPS as 25 (the authors shared like that) and generated below videos. On the other hand, with other models and our model, we kept the FPS the same with the original video which is 29.97. Therefore, DINet and TalkLip outputs look like they have different pose and head movemets. This is because of aforementioned FPS detail. We would like to clarify that there is NO generation error or ANY problem on these two models. However, if it is not possible do generate video with FPS value different than 25, then it is a significant drawback of these methods since there are lots of videos that have FPS value over than 30. In this case, those videos will be longer or suffer from lack of smoothness and information loss.

3. Comparison with the Older Methods

Here, we compare our results with PC-AVS, Wav2Lip and EAMM papers. We chose two videos from HDTF dataset and randomly took a clip from each of them. In the end, we run the published models to generate the videos and combine in a single video by only putting faces to make the comparison easier.

4. Ablation Study

This video contains output of different setups that were explained under Section 4.3 for ablation of the components in our paper. The scores were presented in first part of Table 2 (Ablation=Components) in the paper.

  1. Setup A: Base model with original/plain synchronization loss
  2. Setup D: Base model with silent-lip generation model and pretrained SyncNet audio encoder
  3. Setup E: Setup D with our proposed stabilized synchronization loss instead of original/plain synchronization loss
  4. Setup G: Setup E with our adaptive triplet loss. Full model with all our contributions and without post-processing (face enhancement).

5. Additional Videos Generated by Our Model


5.1. Our full pipeline with face restoration

5.2. Our full pipelien without face restoration

5.3. Different speakers with the same audio

We use 6 different speakers and use 5 different audios consecutively to generate talking faces.

BibTeX

@misc{yaman2024audiodriventalkingfacegeneration,
        title={Audio-driven Talking Face Generation with Stabilized Synchronization Loss}, 
        author={Dogucan Yaman and Fevziye Irem Eyiokur and Leonard Bärmann and Hazim Kemal Ekenel and Alexander Waibel},
        year={2024},
        eprint={2307.09368},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2307.09368}, 
  }