[ICCV 2021 AD-NeRF] [Lip Sync.] [Specific Animation]
AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis
Figure: AD-NeRF (ICCV 2021)
Task: Generating high-fidelity talking head video by fitting with the input audio sequence.
Motivation: The information loss caused by the intermediate representation in existing methods, while Neural Radiance Field (NeRF) adopts implicit scene representation.
Motion: NeRF
Dataset: Self-collected. Average video length is 3–5 minutes, all at 25 fps.
Problem:
- Heavy Training: Requires several hours of training time, hindering rapid transfer to other individuals.
- Less Control: Due to the lack of a unified representation, these methods fail to generate videos driven by multiple conditions.
[IJCV 2022 RAD-NeRF] [Lip Sync.] [Specific Animation]
RAD-NeRF: Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition
Figure: RAD-NeRF (IJCV 2022)
Task: Using a low-dimensional feature grid to model the high-dimensional audio-driven facial dynamics (using keypoints to smooth the samples).
Motivation: Dynamic NeRF exhibits slow training and inference speed.
Motion: NeRF
Dataset: Self-collected. Collected by previous works.
Problem:
- Requires a complex MLP-based grid encoder to implicitly learn regional audio-motion mapping, limiting convergence and reconstruction quality.
- Identity-dependent: Produces results with poor generalization for a different person.
- Controllability: Cannot explicitly control facial expressions and poses, sometimes resulting in unsatisfactory outcomes.