Abstract:Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: https://njust-yang.github.io/ConsistentAvatar.github.io/
Abstract:The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.
Abstract:Recent advances in low light image enhancement have been dominated by Retinex-based learning framework, leveraging convolutional neural networks (CNNs) and Transformers. However, the vanilla Retinex theory primarily addresses global illumination degradation and neglects local issues such as noise and blur in dark conditions. Moreover, CNNs and Transformers struggle to capture global degradation due to their limited receptive fields. While state space models (SSMs) have shown promise in the long-sequence modeling, they face challenges in combining local invariants and global context in visual data. In this paper, we introduce MambaLLIE, an implicit Retinex-aware low light enhancer featuring a global-then-local state space design. We first propose a Local-Enhanced State Space Module (LESSM) that incorporates an augmented local bias within a 2D selective scan mechanism, enhancing the original SSMs by preserving local 2D dependency. Additionally, an Implicit Retinex-aware Selective Kernel module (IRSK) dynamically selects features using spatially-varying operations, adapting to varying inputs through an adaptive kernel selection process. Our Global-then-Local State Space Block (GLSSB) integrates LESSM and IRSK with LayerNorm as its core. This design enables MambaLLIE to achieve comprehensive global long-range modeling and flexible local feature aggregation. Extensive experiments demonstrate that MambaLLIE significantly outperforms state-of-the-art CNN and Transformer-based methods. Project Page: https://mamballie.github.io/anon/
Abstract:Real driving-video dehazing poses a significant challenge due to the inherent difficulty in acquiring precisely aligned hazy/clear video pairs for effective model training, especially in dynamic driving scenarios with unpredictable weather conditions. In this paper, we propose a pioneering approach that addresses this challenge through a nonaligned regularization strategy. Our core concept involves identifying clear frames that closely match hazy frames, serving as references to supervise a video dehazing network. Our approach comprises two key components: reference matching and video dehazing. Firstly, we introduce a non-aligned reference frame matching module, leveraging an adaptive sliding window to match high-quality reference frames from clear videos. Video dehazing incorporates flow-guided cosine attention sampler and deformable cosine attention fusion modules to enhance spatial multiframe alignment and fuse their improved information. To validate our approach, we collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse rural and urban road environments. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art methods in the challenging task of real driving-video dehazing. Project page.
Abstract:Removing haze from real-world images is challenging due to unpredictable weather conditions, resulting in misaligned hazy and clear image pairs. In this paper, we propose a non-aligned supervision framework that consists of three networks - dehazing, airlight, and transmission. In particular, we explore a non-alignment setting by utilizing a clear reference image that is not aligned with the hazy input image to supervise the dehazing network through a multi-scale reference loss that compares the features of the two images. Our setting makes it easier to collect hazy/clear image pairs in real-world environments, even under conditions of misalignment and shift views. To demonstrate this, we have created a new hazy dataset called "Phone-Hazy", which was captured using mobile phones in both rural and urban areas. Additionally, we present a mean and variance self-attention network to model the infinite airlight using dark channel prior as position guidance, and employ a channel attention network to estimate the three-channel transmission. Experimental results show that our framework outperforms current state-of-the-art methods in the real-world image dehazing. Phone-Hazy and code will be available at https://github.com/hello2377/NSDNet.
Abstract:Real depth super-resolution (DSR), unlike synthetic settings, is a challenging task due to the structural distortion and the edge noise caused by the natural degradation in real-world low-resolution (LR) depth maps. These defeats result in significant structure inconsistency between the depth map and the RGB guidance, which potentially confuses the RGB-structure guidance and thereby degrades the DSR quality. In this paper, we propose a novel structure flow-guided DSR framework, where a cross-modality flow map is learned to guide the RGB-structure information transferring for precise depth upsampling. Specifically, our framework consists of a cross-modality flow-guided upsampling network (CFUNet) and a flow-enhanced pyramid edge attention network (PEANet). CFUNet contains a trilateral self-attention module combining both the geometric and semantic correlations for reliable cross-modality flow learning. Then, the learned flow maps are combined with the grid-sampling mechanism for coarse high-resolution (HR) depth prediction. PEANet targets at integrating the learned flow map as the edge attention into a pyramid network to hierarchically learn the edge-focused guidance feature for depth edge refinement. Extensive experiments on real and synthetic DSR datasets verify that our approach achieves excellent performance compared to state-of-the-art methods.
Abstract:Image guidance is an effective strategy for depth super-resolution. Generally, most existing methods employ hand-crafted operators to decompose the high-frequency (HF) and low-frequency (LF) ingredients from low-resolution depth maps and guide the HF ingredients by directly concatenating them with image features. However, the hand-designed operators usually cause inferior HF maps (e.g., distorted or structurally missing) due to the diverse appearance of complex depth maps. Moreover, the direct concatenation often results in weak guidance because not all image features have a positive effect on the HF maps. In this paper, we develop a recurrent structure attention guided (RSAG) framework, consisting of two important parts. First, we introduce a deep contrastive network with multi-scale filters for adaptive frequency-domain separation, which adopts contrastive networks from large filters to small ones to calculate the pixel contrasts for adaptive high-quality HF predictions. Second, instead of the coarse concatenation guidance, we propose a recurrent structure attention block, which iteratively utilizes the latest depth estimation and the image features to jointly select clear patterns and boundaries, aiming at providing refined guidance for accurate depth recovery. In addition, we fuse the features of HF maps to enhance the edge structures in the decomposed LF maps. Extensive experiments show that our approach obtains superior performance compared with state-of-the-art depth super-resolution methods.
Abstract:Unsupervised domain adaptation for point cloud semantic segmentation has attracted great attention due to its effectiveness in learning with unlabeled data. Most of existing methods use global-level feature alignment to transfer the knowledge from the source domain to the target domain, which may cause the semantic ambiguity of the feature space. In this paper, we propose a graph-based framework to explore the local-level feature alignment between the two domains, which can reserve semantic discrimination during adaptation. Specifically, in order to extract local-level features, we first dynamically construct local feature graphs on both domains and build a memory bank with the graphs from the source domain. In particular, we use optimal transport to generate the graph matching pairs. Then, based on the assignment matrix, we can align the feature distributions between the two domains with the graph-based local feature loss. Furthermore, we consider the correlation between the features of different categories and formulate a category-guided contrastive loss to guide the segmentation model to learn discriminative features on the target domain. Extensive experiments on different synthetic-to-real and real-to-real domain adaptation scenarios demonstrate that our method can achieve state-of-the-art performance.
Abstract:Contrastive learning has shown great promise in the field of graph representation learning. By manually constructing positive/negative samples, most graph contrastive learning methods rely on the vector inner product based similarity metric to distinguish the samples for graph representation. However, the handcrafted sample construction (e.g., the perturbation on the nodes or edges of the graph) may not effectively capture the intrinsic local structures of the graph. Also, the vector inner product based similarity metric cannot fully exploit the local structures of the graph to characterize the graph difference well. To this end, in this paper, we propose a novel adaptive subgraph generation based contrastive learning framework for efficient and robust self-supervised graph representation learning, and the optimal transport distance is utilized as the similarity metric between the subgraphs. It aims to generate contrastive samples by capturing the intrinsic structures of the graph and distinguish the samples based on the features and structures of subgraphs simultaneously. Specifically, for each center node, by adaptively learning relation weights to the nodes of the corresponding neighborhood, we first develop a network to generate the interpolated subgraph. We then construct the positive and negative pairs of subgraphs from the same and different nodes, respectively. Finally, we employ two types of optimal transport distances (i.e., Wasserstein distance and Gromov-Wasserstein distance) to construct the structured contrastive loss. Extensive node classification experiments on benchmark datasets verify the effectiveness of our graph contrastive learning method.
Abstract:Face representation in the wild is extremely hard due to the large scale face variations. To this end, some deep convolutional neural networks (CNNs) have been developed to learn discriminative feature by designing properly margin-based losses, which perform well on easy samples but fail on hard samples. Based on this, some methods mainly adjust the weights of hard samples in training stage to improve the feature discrimination. However, these methods overlook the feature distribution property which may lead to better results since the miss-classified hard samples may be corrected by using the distribution metric. This paper proposes the hard samples guided optimal transport (OT) loss for deep face representation, OTFace for short. OTFace aims to enhance the performance of hard samples by introducing the feature distribution discrepancy while maintain the performance on easy samples. Specifically, we embrace triplet scheme to indicate hard sample groups in one mini-batch during training. OT is then used to characterize the distribution differences of features from the high level convolutional layer. Finally, we integrate the margin-based-softmax (e.g. ArcFace or AM-Softmax) and OT to guide deep CNN learning. Extensive experiments are conducted on several benchmark databases. The quantitative results demonstrate the advantages of the proposed OTFace over state-of-the-art methods.