[TOG 2017 ADFA] [Lip Sync.] [Context Expression]
Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion
Figure: ADFA (TOG 2017)
Task: Using CNN Network to predict the vertex position of 3D Mesh by end-to-end learning.
Motion: Mesh Vertex.
Dataset: DI4D PRO system.
Views: It maps the raw audio waveform to the 3D coordinates of a face mesh. A trainable parameter is defined to capture emotions. During inference, this parameter is modified to simulate different emotions.
Problems: It captures the idiosyncrasies of an individual, making it inappropriate for generalization across characters.
[CVPR 2019 VOCA] [Lip Sync.] [Context Expression]
Capture, Learning, and Synthesis of 3D Speaking Styles
Figure: VOCA (CVPR 2019)
Task: Construct the encoder-decoder based on DNN and decompose the identity from audio to achieve generalization.
Motion: Mesh Vertex Offsets.
Dataset: VOCASET, 4D Scanned Mesh data.
Views: It is an end-to-end deep neural network for speech-to-animation translation trained on multiple subjects. From extracted audio embedding, VOCA regresses 3D vertices on a FLAME face model conditioned on a subject label.
Problems: Requires high quality 4D scans recorded in a studio setup. Needs more training data due to latent modalities (prior parametric models).
[ECCV 2024 UniTalker] [Dataset Assembly]
UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model
Figure: UniTalker learns from various data formats and scales up training data (ECCV 2024)
Figure: UniTalker architecture with multi-head decoders (ECCV 2024)
Task: Existing 3D dataset annotation inconsistency restricts previous models to specific data formats, limiting scalability.
Motivation: Unified 5 public datasets + 3 new ones, expanding training data from < 1 hour to 18.5 hours. Trains multi-head decoders to supervise multiple representation targets.
Motion: For vertex-based annotations, motion is vertex displacement; for parameter-based annotations, motion is parameter vectors.
[MM 2020 Learn2Talk] [Lip Sync.] [3D Supervision Form 2D]
Learn2Talk: 3D Talking Face Learns from 2D Talking Face
Figure: Learn2Talk (MM 2020)
Task: Enhances 3D lip-sync accuracy by extending SyncNet to SyncNet3D.
Motivation: SyncNet exists in 2D talking heads to measure speech-motion temporal relationships.
Summary:
LVE: The 3D lip vertex error (LVE) as a 3D reconstruction loss represents per-frame 3D accuracy.
SyncNet3D: Measures the temporal relationship between speech audio and facial motion.