Abstract:Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 hour, to 18.5 hours. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page https://github.com/X-niper/UniTalker.
Abstract:Recent works based on convolutional encoder-decoder architecture and 3DMM parameterization have shown great potential for canonical view reconstruction from a single input image. Conventional CNN architectures benefit from exploiting the spatial correspondence between the input and output pixels. However, in 3D face reconstruction, the spatial misalignment between the input image (e.g. face) and the canonical/UV output makes the feature encoding-decoding process quite challenging. In this paper, to tackle this problem, we propose a new network architecture, namely the Affine Convolution Networks, which enables CNN based approaches to handle spatially non-corresponding input and output images and maintain high-fidelity quality output at the same time. In our method, an affine transformation matrix is learned from the affine convolution layer for each spatial location of the feature maps. In addition, we represent 3D human heads in UV space with multiple components, including diffuse maps for texture representation, position maps for geometry representation, and light maps for recovering more complex lighting conditions in the real world. All the components can be trained without any manual annotations. Our method is parametric-free and can generate high-quality UV maps at resolution of 512 x 512 pixels, while previous approaches normally generate 256 x 256 pixels or smaller. Our code will be released once the paper got accepted.