Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have unlocked significant potential for modeling 3D head avatars, providing greater flexibility than mesh-based methods and more efficient rendering compared to NeRF-based approaches. Despite these advancements, the creation of controllable 3DGS-based head avatars remains time-intensive, often requiring tens of minutes to hours. To expedite this process, we here introduce the ``Gaussian D\'ej\`a-vu" framework, which first obtains a generalized model of the head avatar and then personalizes the result. The generalized model is trained on large 2D (synthetic and real) image datasets. This model provides a well-initialized 3D Gaussian head that is further refined using a monocular video to achieve the personalized head avatar. For personalizing, we propose learnable expression-aware rectification blendmaps to correct the initial 3D Gaussians, ensuring rapid convergence without the reliance on neural networks. Experiments demonstrate that the proposed method meets its objectives. It outperforms state-of-the-art 3D Gaussian head avatars in terms of photorealistic quality as well as reduces training time consumption to at least a quarter of the existing methods, producing the avatar in minutes.
Abstract:Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.
Abstract:3D models, particularly AI-generated ones, have witnessed a recent surge across various industries such as entertainment. Hence, there is an alarming need to protect the intellectual property and avoid the misuse of these valuable assets. As a viable solution to address these concerns, we rigorously define the novel task of automated 3D visible watermarking in terms of two competing aspects: watermark quality and asset utility. Moreover, we propose a method of embedding visible watermarks that automatically determines the right location, orientation, and number of watermarks to be placed on arbitrary 3D assets for high watermark quality and asset utility. Our method is based on a novel rigid-body optimization that uses back-propagation to automatically learn transforms for ideal watermark placement. In addition, we propose a novel curvature-matching method for fusing the watermark into the 3D model that further improves readability and security. Finally, we provide a detailed experimental analysis on two benchmark 3D datasets validating the superior performance of our approach in comparison to baselines. Code and demo are available.
Abstract:Representing source code in a generic input format is crucial to automate software engineering tasks, e.g., applying machine learning algorithms to extract information. Visualizing code representations can further enable human experts to gain an intuitive insight into the code. Unfortunately, as of today, there is no universal tool that can simultaneously visualise different types of code representations. In this paper, we introduce a tool, CodeLens, which provides a visual interaction environment that supports various representation methods and helps developers understand and explore them. CodeLens is designed to support multiple programming languages, such as Java, Python, and JavaScript, and four types of code representations, including sequence of tokens, abstract syntax tree (AST), data flow graph (DFG), and control flow graph (CFG). By using CodeLens, developers can quickly visualize the specific code representation and also obtain the represented inputs for models of code. The Web-based interface of CodeLens is available at http://www.codelens.org. The demonstration video can be found at http://www.codelens.org/demo.
Abstract:Recent learning-based inpainting algorithms have achieved compelling results for completing missing regions after removing undesired objects in videos. To maintain the temporal consistency among the frames, 3D spatial and temporal operations are often heavily used in the deep networks. However, these methods usually suffer from memory constraints and can only handle low resolution videos. We propose STRA-Net, a novel spatial-temporal residual aggregation framework for high resolution video inpainting. The key idea is to first learn and apply a spatial and temporal inpainting network on the downsampled low resolution videos. Then, we refine the low resolution results by aggregating the learned spatial and temporal image residuals (details) to the upsampled inpainted frames. Both the quantitative and qualitative evaluations show that we can produce more temporal-coherent and visually appealing results than the state-of-the-art methods on inpainting high resolution videos.
Abstract:Advances in deep neural networks have considerably improved the art of animating a still image without operating in 3D domain. Whereas, prior arts can only animate small images (typically no larger than 512x512) due to memory limitations, difficulty of training and lack of high-resolution (HD) training datasets, which significantly reduce their potential for applications in movie production and interactive systems. Motivated by the idea that HD images can be generated by adding high-frequency residuals to low-resolution results produced by a neural network, we propose a novel framework known as Animating Through Warping (ATW) to enable efficient animation of HD images. Specifically, the proposed framework consists of two modules, a novel two-stage neural-network generator and a novel post-processing module known as Animating Through Warping (ATW). It only requires the generator to be trained on small images and can do inference on an image of any size. During inference, an HD input image is decomposed into a low-resolution component(128x128) and its corresponding high-frequency residuals. The generator predicts the low-resolution result as well as the motion field that warps the input face to the desired status (e.g., expressions categories or action units). Finally, the ResWarp module warps the residuals based on the motion field and adding the warped residuals to generates the final HD results from the naively up-sampled low-resolution results. Experiments show the effectiveness and efficiency of our method in generating high-resolution animations. Our proposed framework successfully animates a 4K facial image, which has never been achieved by prior neural models. In addition, our method generally guarantee the temporal coherency of the generated animations. Source codes will be made publicly available.
Abstract:Recently data-driven image inpainting methods have made inspiring progress, impacting fundamental image editing tasks such as object removal and damaged image repairing. These methods are more effective than classic approaches, however, due to memory limitations they can only handle low-resolution inputs, typically smaller than 1K. Meanwhile, the resolution of photos captured with mobile devices increases up to 8K. Naive up-sampling of the low-resolution inpainted result can merely yield a large yet blurry result. Whereas, adding a high-frequency residual image onto the large blurry image can generate a sharp result, rich in details and textures. Motivated by this, we propose a Contextual Residual Aggregation (CRA) mechanism that can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Since convolutional layers of the neural network only need to operate on low-resolution inputs and outputs, the cost of memory and computing power is thus well suppressed. Moreover, the need for high-resolution training datasets is alleviated. In our experiments, we train the proposed model on small images with resolutions 512x512 and perform inference on high-resolution images, achieving compelling inpainting quality. Our model can inpaint images as large as 8K with considerable hole sizes, which is intractable with previous learning-based approaches. We further elaborate on the light-weight design of the network architecture, achieving real-time performance on 2K images on a GTX 1080 Ti GPU. Codes are available at: Atlas200dk/sample-imageinpainting-HiFill.
Abstract:Collaborative filtering recommenders provide effective personalization services at the cost of sacrificing the privacy of their end users. Due to the increasing concerns from the society and stricter privacy regulations, it is an urgent research challenge to design privacy-preserving and yet robust recommenders which offer recommendation services to privacy-aware users. Our analysis shows that existing solutions fall short in several aspects, including lacking attention to the precise output to end users and ignoring the correlated robustness issues. In this paper, we provide a general system structure for latent factor based collaborative filtering recommenders by formulating them into model training and prediction computing stages, and also describe a new security model. Aiming at pragmatic solutions, we first show how to construct privacy-preserving and yet robust model training stage based on existing solutions. Then, we propose two cryptographic protocols to realize a privacy-preserving prediction computing stage, depending on whether or not an extra proxy is involved. Different from standard Top-k recommendations, we alternatively let the end user retrieve the unrated items whose predictions are above a threshold, as a result of our privacy by design strategy. Experimental results show that our new protocols are quite efficient.