[TOG 2017 ADFA] [Lip Sync.] [Context Expression]

Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion


Figure: ADFA (TOG 2017)

Task: Using CNN Network to predict the vertex position of 3D Mesh by end-to-end learning.
Motion: Mesh Vertex.
Dataset: DI4D PRO system.
Views: It maps the raw audio waveform to the 3D coordinates of a face mesh. A trainable parameter is defined to capture emotions. During inference, this parameter is modified to simulate different emotions.
Problems: It captures the idiosyncrasies of an individual, making it inappropriate for generalization across characters.


[CVPR 2019 VOCA] [Lip Sync.] [Context Expression]

Capture, Learning, and Synthesis of 3D Speaking Styles


Figure: VOCA (CVPR 2019)

Task: Construct the encoder-decoder based on DNN and decompose the identity from audio to achieve generalization.
Motion: Mesh Vertex Offsets.
Dataset: VOCASET, 4D Scanned Mesh data.
Views: It is an end-to-end deep neural network for speech-to-animation translation trained on multiple subjects. From extracted audio embedding, VOCA regresses 3D vertices on a FLAME face model conditioned on a subject label.
Problems: Requires high quality 4D scans recorded in a studio setup. Needs more training data due to latent modalities (prior parametric models).


[ECCV 2024 UniTalker] [Dataset Assembly]

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model


Figure: UniTalker learns from various data formats and scales up training data (ECCV 2024)


Figure: UniTalker architecture with multi-head decoders (ECCV 2024)

Task: Existing 3D dataset annotation inconsistency restricts previous models to specific data formats, limiting scalability.
Motivation: Unified 5 public datasets + 3 new ones, expanding training data from < 1 hour to 18.5 hours. Trains multi-head decoders to supervise multiple representation targets.
Motion: For vertex-based annotations, motion is vertex displacement; for parameter-based annotations, motion is parameter vectors.


[MM 2020 Learn2Talk] [Lip Sync.] [3D Supervision Form 2D]

Learn2Talk: 3D Talking Face Learns from 2D Talking Face


Figure: Learn2Talk (MM 2020)

Task: Enhances 3D lip-sync accuracy by extending SyncNet to SyncNet3D.
Motivation: SyncNet exists in 2D talking heads to measure speech-motion temporal relationships.
Summary:

LVE: The 3D lip vertex error (LVE) as a 3D reconstruction loss represents per-frame 3D accuracy.
SyncNet3D: Measures the temporal relationship between speech audio and facial motion.



© 2025 - Zhihao Li Created using Stellar
Page UV: 326 | Page PV: 326
Site UV: 113701 | Site PV: 113701
🦉 感谢你的到访,愿你每天都有好心情!🦉