3D Talking Head: Regressive Models

[TOG 2017 ADFA] [Lip Sync.] [Context Expression]

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion

Figure: ADFA (TOG 2017)

Task: Using CNN Network to predict the vertex position of 3D Mesh by end-to-end learning.
Motion: Mesh Vertex.
Dataset: DI4D PRO system.
Views: It maps the raw audio waveform to the 3D coordinates of a face mesh. A trainable parameter is defined to capture emotions. During inference, this parameter is modified to simulate different emotions.
Problems: It captures the idiosyncrasies of an individual, making it inappropriate for generalization across characters.

[CVPR 2019 VOCA] [Lip Sync.] [Context Expression]

Capture, Learning, and Synthesis of 3D Speaking Styles

Figure: VOCA (CVPR 2019)

Task: Construct the encoder-decoder based on DNN and decompose the identity from audio to achieve generalization.
Motion: Mesh Vertex Offsets.
Dataset: VOCASET, 4D Scanned Mesh data.
Views: It is an end-to-end deep neural network for speech-to-animation translation trained on multiple subjects. From extracted audio embedding, VOCA regresses 3D vertices on a FLAME face model conditioned on a subject label.
Problems: Requires high quality 4D scans recorded in a studio setup. Needs more training data due to latent modalities (prior parametric models).

[ECCV 2024 UniTalker] [Dataset Assembly]

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Figure: UniTalker learns from various data formats and scales up training data (ECCV 2024)

Figure: UniTalker architecture with multi-head decoders (ECCV 2024)

Task: Existing 3D dataset annotation inconsistency restricts previous models to specific data formats, limiting scalability.
Motivation: Unified 5 public datasets + 3 new ones, expanding training data from < 1 hour to 18.5 hours. Trains multi-head decoders to supervise multiple representation targets.
Motion: For vertex-based annotations, motion is vertex displacement; for parameter-based annotations, motion is parameter vectors.

[MM 2020 Learn2Talk] [Lip Sync.] [3D Supervision Form 2D]

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Figure: Learn2Talk (MM 2020)

Task: Enhances 3D lip-sync accuracy by extending SyncNet to SyncNet3D.
Motivation: SyncNet exists in 2D talking heads to measure speech-motion temporal relationships.
Summary:

LVE: The 3D lip vertex error (LVE) as a 3D reconstruction loss represents per-frame 3D accuracy.
SyncNet3D: Measures the temporal relationship between speech audio and facial motion.