3D Talking Head: Regressive Models

[ICCV 2021 AD-NeRF] [Lip Sync.] [Specific Animation]

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Figure: AD-NeRF (ICCV 2021)

Task: Generating high-fidelity talking head video by fitting with the input audio sequence.
Motivation: The information loss caused by the intermediate representation in existing methods, while Neural Radiance Field (NeRF) adopts implicit scene representation.
Motion: NeRF
Dataset: Self-collected. Average video length is 3–5 minutes, all at 25 fps.
Problem:

Heavy Training: Requires several hours of training time, hindering rapid transfer to other individuals.
Less Control: Due to the lack of a unified representation, these methods fail to generate videos driven by multiple conditions.

[IJCV 2022 RAD-NeRF] [Lip Sync.] [Specific Animation]

RAD-NeRF: Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

Figure: RAD-NeRF (IJCV 2022)

Task: Using a low-dimensional feature grid to model the high-dimensional audio-driven facial dynamics (using keypoints to smooth the samples).
Motivation: Dynamic NeRF exhibits slow training and inference speed.
Motion: NeRF
Dataset: Self-collected. Collected by previous works.
Problem:

Requires a complex MLP-based grid encoder to implicitly learn regional audio-motion mapping, limiting convergence and reconstruction quality.
Identity-dependent: Produces results with poor generalization for a different person.
Controllability: Cannot explicitly control facial expressions and poses, sometimes resulting in unsatisfactory outcomes.