Abstract:3D face reconstruction from monocular images has promoted the development of various applications such as augmented reality. Though existing methods have made remarkable progress, most of them emphasize geometric reconstruction, while overlooking the importance of texture prediction. To address this issue, we propose VGG-Tex, a novel Vivid Geometry-Guided Facial Texture Estimation model designed for High Fidelity Monocular 3D Face Reconstruction. The core of this approach is leveraging 3D parametric priors to enhance the outcomes of 2D UV texture estimation. Specifically, VGG-Tex includes a Facial Attributes Encoding Module, a Geometry-Guided Texture Generator, and a Visibility-Enhanced Texture Completion Module. These components are responsible for extracting parametric priors, generating initial textures, and refining texture details, respectively. Based on the geometry-texture complementarity principle, VGG-Tex also introduces a Texture-guided Geometry Refinement Module to further balance the overall fidelity of the reconstructed 3D faces, along with corresponding losses. Comprehensive experiments demonstrate that our method significantly improves texture reconstruction performance compared to existing state-of-the-art methods.
Abstract:Kolmogorov-Arnold Networks (KAN) is an emerging neural network architecture in machine learning. It has greatly interested the research community about whether KAN can be a promising alternative of the commonly used Multi-Layer Perceptions (MLP). Experiments in various fields demonstrated that KAN-based machine learning can achieve comparable if not better performance than MLP-based methods, but with much smaller parameter scales and are more explainable. In this paper, we explore the incorporation of KAN into the actor and critic networks for offline reinforcement learning (RL). We evaluated the performance, parameter scales, and training efficiency of various KAN and MLP based conservative Q-learning (CQL) on the the classical D4RL benchmark for offline RL. Our study demonstrates that KAN can achieve performance close to the commonly used MLP with significantly fewer parameters. This provides us an option to choose the base networks according to the requirements of the offline RL tasks.
Abstract:Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously crafted for speaking style adaptation. Grounded in the novel concept of meta-learning, MetaFace is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (DRMN) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization as well as learning style details. Leveraging these novel designs, MetaFace not only significantly outperforms robust existing baselines but also establishes a new state-of-the-art, as substantiated by our experimental results.
Abstract:Purpose: Current 3D Magnetic Resonance Spin TomogrAphy in Time-domain (MR-STAT) protocols use transient-state, gradient-spoiled gradient-echo sequences that are prone to cerebrospinal fluid (CSF) pulsation artifacts when applied to the brain. This study aims at developing a 3D MR-STAT protocol for whole-brain relaxometry that overcomes the challenges posed by CSF-induced ghosting artifacts. Method: We optimized the flip-angle train within the Cartesian 3D MR-STAT framework to achieve two objectives: (1) minimization of the noise level in the reconstructed quantitative maps, and (2) reduction of the CSF-to-white-matter signal ratio to suppress CSF signal and the associated pulsation artifacts. The optimized new sequence was tested on a gel/water-phantom to evaluate the accuracy of the quantitative maps, and on healthy volunteers to explore the effectiveness of the CSF artifact suppression and robustness of the new protocol. Results: A new optimized sequence with both high parameter encoding capability and low CSF intensity was proposed and initially validated in the gel/water-phantom experiment. From in-vivo experiments with five volunteers, the proposed CSF-suppressed sequence shows no CSF ghosting artifacts and overall greatly improved image quality for all quantitative maps compared to the baseline sequence. Statistical analysis indicated low inter-subject and inter-scan variability for quantitative parameters in gray matter and white matter (1.6%-2.4% for T1 and 2.0%-4.6% for T2), demonstrating the robustness of the new sequence. Conclusion: We presented a new 3D MR-STAT sequence with CSF suppression that effectively eliminates CSF pulsation artifacts. The new sequence ensures consistently high-quality, 1mm^3 whole-brain relaxometry within a rapid 5.5-minute scan time.
Abstract:Recent years have witnessed a broader range of applications of image processing technologies in multiple industrial processes, such as smoke detection, security monitoring, and workpiece inspection. Different kinds of distortion types and levels must be introduced into an image during the processes of acquisition, compression, transmission, storage, and display, which might heavily degrade the image quality and thus strongly reduce the final display effect and clarity. To verify the reliability of existing image quality assessment methods, we establish a new industrial process image database (IPID), which contains 3000 distorted images generated by applying different levels of distortion types to each of the 50 source images. We conduct the subjective test on the aforementioned 3000 images to collect their subjective quality ratings in a well-suited laboratory environment. Finally, we perform comparison experiments on IPID database to investigate the performance of some objective image quality assessment algorithms. The experimental results show that the state-of-the-art image quality assessment methods have difficulty in predicting the quality of images that contain multiple distortion types.
Abstract:Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk
Abstract:Dance and music are closely related forms of expression, with mutual retrieval between dance videos and music being a fundamental task in various fields like education, art, and sports. However, existing methods often suffer from unnatural generation effects or fail to fully explore the correlation between music and dance. To overcome these challenges, we propose BeatDance, a novel beat-based model-agnostic contrastive learning framework. BeatDance incorporates a Beat-Aware Music-Dance InfoExtractor, a Trans-Temporal Beat Blender, and a Beat-Enhanced Hubness Reducer to improve dance-music retrieval performance by utilizing the alignment between music beats and dance movements. We also introduce the Music-Dance (MD) dataset, a large-scale collection of over 10,000 music-dance video pairs for training and testing. Experimental results on the MD dataset demonstrate the superiority of our method over existing baselines, achieving state-of-the-art performance. The code and dataset will be made public available upon acceptance.
Abstract:Scene Graph Generation is a critical enabler of environmental comprehension for autonomous robotic systems. Most of existing methods, however, are often thwarted by the intricate dynamics of background complexity, which limits their ability to fully decode the inherent topological information of the environment. Additionally, the wealth of contextual information encapsulated within depth cues is often left untapped, rendering existing approaches less effective. To address these shortcomings, we present STDG, an avant-garde Depth-Guided One-Stage Scene Graph Generation methodology. The innovative architecture of STDG is a triad of custom-built modules: The Depth Guided HHA Representation Generation Module, the Depth Guided Semi-Teaching Network Learning Module, and the Depth Guided Scene Graph Generation Module. This trifecta of modules synergistically harnesses depth information, covering all aspects from depth signal generation and depth feature utilization, to the final scene graph prediction. Importantly, this is achieved without imposing additional computational burden during the inference phase. Experimental results confirm that our method significantly enhances the performance of one-stage scene graph generation baselines.
Abstract:Speech-driven 3D face animation technique, extending its applications to various multimedia fields. Previous research has generated promising realistic lip movements and facial expressions from audio signals. However, traditional regression models solely driven by data face several essential problems, such as difficulties in accessing precise labels and domain gaps between different modalities, leading to unsatisfactory results lacking precision and coherence. To enhance the visual accuracy of generated lip movement while reducing the dependence on labeled data, we propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces. The framework constructs a network system consisting of three modules: facial animator, speech recognizer, and lip-reading interpreter. The core of SelfTalk is a commutative training diagram that facilitates compatible features exchange among audio, text, and lip shape, enabling our models to learn the intricate connection between these factors. The proposed framework leverages the knowledge learned from the lip-reading interpreter to generate more plausible lip shapes. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. We recommend watching the supplementary video.
Abstract:In this study, we develop a physics-informed deep learning-based method to synthesize multiple brain magnetic resonance imaging (MRI) contrasts from a single five-minute acquisition and investigate its ability to generalize to arbitrary contrasts to accelerate neuroimaging protocols. A dataset of fifty-five subjects acquired with a standard MRI protocol and a five-minute transient-state sequence was used to develop a physics-informed deep learning-based method. The model, based on a generative adversarial network, maps data acquired from the five-minute scan to "effective" quantitative parameter maps, here named q*-maps, by using its generated PD, T1, and T2 values in a signal model to synthesize four standard contrasts (proton density-weighted, T1-weighted, T2-weighted, and T2-weighted fluid-attenuated inversion recovery), from which losses are computed. The q*-maps are compared to literature values and the synthetic contrasts are compared to an end-to-end deep learning-based method proposed by literature. The generalizability of the proposed method is investigated for five volunteers by synthesizing three non-standard contrasts unseen during training and comparing these to respective ground truth acquisitions via contrast-to-noise ratio and quantitative assessment. The physics-informed method was able to match the high-quality synthMRI of the end-to-end method for the four standard contrasts, with mean \pm standard deviation structural similarity metrics above 0.75 \pm 0.08 and peak signal-to-noise ratios above 22.4 \pm 1.9 and 22.6 \pm 2.1. Additionally, the physics-informed method provided retrospective contrast adjustment, with visually similar signal contrast and comparable contrast-to-noise ratios to the ground truth acquisitions for three sequences unused for model training, demonstrating its generalizability and potential application to accelerate neuroimaging protocols.