3D Talking Head: Regressive Models

[ECCV 2020 Neural Voice Puppetry] [Lip Sync.]

Neural Voice Puppetry: Audio-driven Facial Reenactment

NVP
Figure: NVP (ECCV 2020)

Task: Predict expression coefficients to drive expression blendshape basis.
Motivation: The visual counterpart is largely missing.
Motion: 3DMM Coefficients.
Dataset: 116 videos with an average length of 1.7min (total 302,750 frames).
Problem: Leveraging explicit facial structural priors may accumulate errors in predicting such intermediate representation.

[CVPR 2023 SadTalker] [Lip Sync.] [Context Expression] [Head Pose]

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Figure: SadTalker (CVPR 2023)

Task: Generate 3D motion coefficients.
Motivation: Unnatural head movement, distorted expression, and identity modification.
Motion: Expression and head pose.
Dataset:

Training: VoxCeleb
Evaluation: HDTF

Views:

It is an early trial to use lip-only 3DMM coefficients.
Generates 3D motion coefficients from audio for realistic head movement and facial expressions.

Problems:

These approaches relying on 3D intermediate representations typically face challenges in accurately capturing subtle expressions and realistic motions, which significantly limits the quality of the generated portrait animations.
A recurring challenge is the limited capacity of the 3D mesh to capture intricate details, constraining overall dynamism and realism. Omitting intermediate representations may improve naturalness.