Abstract:We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Furthermore, we observe the commonly encountered SyncNet convergence issue and conduct comprehensive empirical studies, identifying key factors affecting SyncNet convergence in terms of model architecture, training hyperparameters, and data preprocessing methods. We significantly improve the accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet. Based on the above innovations, our method outperforms state-of-the-art lip sync methods across various metrics on the HDTF and VoxCeleb2 datasets.
Abstract:Human pose estimation in complicated situations has always been a challenging task. Many Transformer-based pose networks have been proposed recently, achieving encouraging progress in improving performance. However, the remarkable performance of pose networks is always accompanied by heavy computation costs and large network scale. In order to deal with this problem, this paper proposes a High-Efficiency Vision Transformer for Human Pose Estimation (HEViTPose). In HEViTPose, a Cascaded Group Spatial Reduction Multi-Head Attention Module (CGSR-MHA) is proposed, which reduces the computational cost through feature grouping and spatial degradation mechanisms, while preserving feature diversity through multiple low-dimensional attention heads. Moreover, a concept of Patch Embedded Overlap Width (PEOW) is defined to help understand the relationship between the amount of overlap and local continuity. By optimising PEOW, our model gains improvements in performance, parameters and GFLOPs. Comprehensive experiments on two benchmark datasets (MPII and COCO) demonstrate that the small and large HEViTPose models are on par with state-of-the-art models while being more lightweight. Specifically, HEViTPose-B achieves 90.7 PCK@0.5 on the MPII test set and 72.6 AP on the COCO test-dev2017 set. Compared with HRNet-W32 and Swin-S, our HEViTPose-B significantly reducing Params ($\downarrow$62.1%,$\downarrow$80.4%,) and GFLOPs ($\downarrow$43.4%,$\downarrow$63.8%,). Code and models are available at \url{here}.
Abstract:Three-dimensional (3D) object reconstruction based on differentiable rendering (DR) is an active research topic in computer vision. DR-based methods minimize the difference between the rendered and target images by optimizing both the shape and appearance and realizing a high visual reproductivity. However, most approaches perform poorly for textureless objects because of the geometrical ambiguity, which means that multiple shapes can have the same rendered result in such objects. To overcome this problem, we introduce active sensing with structured light (SL) into multi-view 3D object reconstruction based on DR to learn the unknown geometry and appearance of arbitrary scenes and camera poses. More specifically, our framework leverages the correspondences between pixels in different views calculated by structured light as an additional constraint in the DR-based optimization of implicit surface, color representations, and camera poses. Because camera poses can be optimized simultaneously, our method realizes high reconstruction accuracy in the textureless region and reduces efforts for camera pose calibration, which is required for conventional SL-based methods. Experiment results on both synthetic and real data demonstrate that our system outperforms conventional DR- and SL-based methods in a high-quality surface reconstruction, particularly for challenging objects with textureless or shiny surfaces.
Abstract:Depth reconstruction and hyperspectral reflectance reconstruction are two active research topics in computer vision and image processing. Conventionally, these two topics have been studied separately using independent imaging setups and there is no existing method which can acquire depth and spectral reflectance simultaneously in one shot without using special hardware. In this paper, we propose a novel single-shot hyperspectral-depth reconstruction method using an off-the-shelf RGB camera and projector. Our method is based on a single color-dot projection, which simultaneously acts as structured light for depth reconstruction and spatially-varying color illuminations for hyperspectral reflectance reconstruction. To jointly reconstruct the depth and the hyperspectral reflectance from a single color-dot image, we propose a novel end-to-end network architecture that effectively incorporates a geometric color-dot pattern loss and a photometric hyperspectral reflectance loss. Through the experiments, we demonstrate that our hyperspectral-depth reconstruction method outperforms the combination of an existing state-of-the-art single-shot hyperspectral reflectance reconstruction method and depth reconstruction method.
Abstract:Long Document retrieval (DR) has always been a tremendous challenge for reading comprehension and information retrieval. The pre-training model has achieved good results in the retrieval stage and Ranking for long documents in recent years. However, there is still some crucial problem in long document ranking, such as data label noises, long document representations, negative data Unbalanced sampling, etc. To eliminate the noise of labeled data and to be able to sample the long documents in the search reasonably negatively, we propose the bag sampling method and the group-wise Localized Contrastive Estimation(LCE) method. We use the head middle tail passage for the long document to encode the long document, and in the retrieval, stage Use dense retrieval to generate the candidate's data. The retrieval data is divided into multiple bags at the ranking stage, and negative samples are selected in each bag. After sampling, two losses are combined. The first loss is LCE. To fit bag sampling well, after query and document are encoded, the global features of each group are extracted by convolutional layer and max-pooling to improve the model's resistance to the impact of labeling noise, finally, calculate the LCE group-wise loss. Notably, our model shows excellent performance on the MS MARCO Long document ranking leaderboard.
Abstract:Reconstructing an object's high-quality 3D shape with inherent spectral reflectance property, beyond typical device-dependent RGB albedos, opens the door to applications requiring a high-fidelity 3D model in terms of both geometry and photometry. In this paper, we propose a novel Multi-View Inverse Rendering (MVIR) method called Spectral MVIR for jointly reconstructing the 3D shape and the spectral reflectance for each point of object surfaces from multi-view images captured using a standard RGB camera and low-cost lighting equipment such as an LED bulb or an LED projector. Our main contributions are twofold: (i) We present a rendering model that considers both geometric and photometric principles in the image formation by explicitly considering camera spectral sensitivity, light's spectral power distribution, and light source positions. (ii) Based on the derived model, we build a cost-optimization MVIR framework for the joint reconstruction of the 3D shape and the per-vertex spectral reflectance while estimating the light source positions and the shadows. Different from most existing spectral-3D acquisition methods, our method does not require expensive special equipment and cumbersome geometric calibration. Experimental results using both synthetic and real-world data demonstrate that our Spectral MVIR can acquire a high-quality 3D model with accurate spectral reflectance property.
Abstract:In this paper, we propose a novel projector-camera system for practical and low-cost acquisition of a dense object 3D model with the spectral reflectance property. In our system, we use a standard RGB camera and leverage an off-the-shelf projector as active illumination for both the 3D reconstruction and the spectral reflectance estimation. We first reconstruct the 3D points while estimating the poses of the camera and the projector, which are alternately moved around the object, by combining multi-view structured light and structure-from-motion (SfM) techniques. We then exploit the projector for multispectral imaging and estimate the spectral reflectance of each 3D point based on a novel spectral reflectance estimation model considering the geometric relationship between the reconstructed 3D points and the estimated projector positions. Experimental results on several real objects demonstrate that our system can precisely acquire a dense 3D model with the full spectral reflectance property using off-the-shelf devices.