3D Talking Head: Generative Models

[CVPR 2023 CodeTalker] [Lip Sync.] [Context Expression]

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

Figure: CodeTalker (CVPR 2023)

Task: Reduce cross-modal mapping uncertainty and regression-to-mean by casting speech-driven animation as a code query task.
Motivation:
By mapping speech to a finite proxy space (discrete codebook), uncertainty is reduced and synthesis quality improves.
Motion: Mesh Vertex Offsets.
Dataset:

BIWI: 40 shared sentences across speakers.
VOCASET: 255 unique sentences.

Views: Uses VQ-VAE to learn a discrete code space, offering more detailed representations than parameter-based methods.
Problems:

Diversity: No explicit stochasticity declared; diversity evaluation lacking.
Controllability: Lacks explicit emotion control; adjusting output requires vertex-level manipulation.
Style: Stores stylized motion priors in a codebook — storing expressive priors remains under-explored.

[ICCV 2025 GaussianSpeech] [Lip Sync.] [Context Expression]

GaussianSpeech: Audio-Driven Gaussian Avatars

Figure: GaussianSpeech (ICCV 2025)

Task: Decomposes lip and wrinkle features from audio to animate the FLAME mesh; rendering via 3D Gaussian Splatting.
Motivation: Real-time, free-viewpoint, photorealistic 3D-consistent facial synthesis.
Motion: Mesh Vertex Offsets.
Dataset: Multi-View Audio-Visual Dataset
Views: Uses 3D Gaussian fields + dynamic deformation for lip-sync realism and flexibility in avatars.

[IJCAI 2025 GLDiTalker] [Lip Sync.] [Context Expression]

Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Figure: GLDiTalker (IJCAI 2025)

Task: Improve lip-sync via graph-based encoder & codebook; enhance diversity using latent diffusion.
Motivation: Address modality inconsistencies and misalignment.
Motion: Mesh Vertex Offsets.
Dataset:

BIWI: 40 shared sentences.
VOCASET: 255 unique sentences.

Views: Uses facial meshes and a graph-latent-diffusion transformer to enable temporally stable 3D facial animation.