Abstract:Test-time Adaptation (TTA) aims to improve model performance when the model encounters domain changes after deployment. The standard TTA mainly considers the case where the target domain is static, while the continual TTA needs to undergo a sequence of domain changes. This encounters a significant challenge as the model needs to adapt for the long-term and is unaware of when the domain changes occur. The quality of pseudo-labels is hard to guarantee. Noisy pseudo-labels produced by simple self-training methods can cause error accumulation and catastrophic forgetting. In this work, we propose a new framework named SPARNet which consists of two parts, sample partitioning strategy and anti-forgetting regularization. The sample partition strategy divides samples into two groups, namely reliable samples and unreliable samples. According to the characteristics of each group of samples, we choose different strategies to deal with different groups of samples. This ensures that reliable samples contribute more to the model. At the same time, the negative impacts of unreliable samples are eliminated by the mean teacher's consistency learning. Finally, we introduce a regularization term to alleviate the catastrophic forgetting problem, which can limit important parameters from excessive changes. This term enables long-term adaptation of parameters in the network. The effectiveness of our method is demonstrated in continual TTA scenario by conducting a large number of experiments on CIFAR10-C, CIFAR100-C and ImageNet-C.
Abstract:Lidars and cameras play essential roles in autonomous driving, offering complementary information for 3D detection. The state-of-the-art fusion methods integrate them at the feature level, but they mostly rely on the learned soft association between point clouds and images, which lacks interpretability and neglects the hard association between them. In this paper, we combine feature-level fusion with point-level fusion, using hard association established by the calibration matrices to guide the generation of object queries. Specifically, in the early fusion stage, we use the 2D CNN features of images to decorate the point cloud data, and employ two independent sparse convolutions to extract the decorated point cloud features. In the mid-level fusion stage, we initialize the queries with a center heatmap and embed the predicted class labels as auxiliary information into the queries, making the initial positions closer to the actual centers of the targets. Extensive experiments conducted on two popular datasets, i.e. KITTI, Waymo, demonstrate the superiority of DecoratingFusion.
Abstract:Image manipulation detection is to identify the authenticity of each pixel in images. One typical approach to uncover manipulation traces is to model image correlations. The previous methods commonly adopt the grids, which are fixed-size squares, as graph nodes to model correlations. However, these grids, being independent of image content, struggle to retain local content coherence, resulting in imprecise detection. To address this issue, we describe a new method named Hierarchical Region-aware Graph Reasoning (HRGR) to enhance image manipulation detection. Unlike existing grid-based methods, we model image correlations based on content-coherence feature regions with irregular shapes, generated by a novel Differentiable Feature Partition strategy. Then we construct a Hierarchical Region-aware Graph based on these regions within and across different feature layers. Subsequently, we describe a structural-agnostic graph reasoning strategy tailored for our graph to enhance the representation of nodes. Our method is fully differentiable and can seamlessly integrate into mainstream networks in an end-to-end manner, without requiring additional supervision. Extensive experiments demonstrate the effectiveness of our method in image manipulation detection, exhibiting its great potential as a plug-and-play component for existing architectures.
Abstract:The Large Visual Language Models (LVLMs) enhances user interaction and enriches user experience by integrating visual modality on the basis of the Large Language Models (LLMs). It has demonstrated their powerful information processing and generation capabilities. However, the existence of hallucinations has limited the potential and practical effectiveness of LVLM in various fields. Although lots of work has been devoted to the issue of hallucination mitigation and correction, there are few reviews to summary this issue. In this survey, we first introduce the background of LVLMs and hallucinations. Then, the structure of LVLMs and main causes of hallucination generation are introduced. Further, we summary recent works on hallucination correction and mitigation. In addition, the available hallucination evaluation benchmarks for LVLMs are presented from judgmental and generative perspectives. Finally, we suggest some future research directions to enhance the dependability and utility of LVLMs.
Abstract:Image recognition is an essential baseline for deep metric learning. Hierarchical knowledge about image classes depicts inter-class similarities or dissimilarities. Effective fusion of hierarchical knowledge about image classes to enhance image recognition remains a challenging topic to advance. In this paper, we propose a novel deep metric learning based method to effectively fuse hierarchical prior knowledge about image classes and enhance image recognition performances in an end-to-end supervised regression manner. Existing deep metric learning incorporated image classification mainly exploits qualitative relativity between image classes, i.e., whether sampled images are from the same class. A new triplet loss function term that exploits quantitative relativity and aligns distances in model latent space with those in knowledge space is also proposed and incorporated in the proposed dual-modality fusion method. Experimental results indicate that the proposed method enhanced image recognition performances and outperformed baseline and existing methods on CIFAR-10, CIFAR-100, Mini-ImageNet, and ImageNet-1K datasets.
Abstract:In this paper, we introduce D$^4$-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that D$^4$-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency.
Abstract:Computer-aided design (CAD) tools are increasingly popular in modern dental practice, particularly for treatment planning or comprehensive prognosis evaluation. In particular, the 2D panoramic X-ray image efficiently detects invisible caries, impacted teeth and supernumerary teeth in children, while the 3D dental cone beam computed tomography (CBCT) is widely used in orthodontics and endodontics due to its low radiation dose. However, there is no open-access 2D public dataset for children's teeth and no open 3D dental CBCT dataset, which limits the development of automatic algorithms for segmenting teeth and analyzing diseases. The Semi-supervised Teeth Segmentation (STS) Challenge, a pioneering event in tooth segmentation, was held as a part of the MICCAI 2023 ToothFairy Workshop on the Alibaba Tianchi platform. This challenge aims to investigate effective semi-supervised tooth segmentation algorithms to advance the field of dentistry. In this challenge, we provide two modalities including the 2D panoramic X-ray images and the 3D CBCT tooth volumes. In Task 1, the goal was to segment tooth regions in panoramic X-ray images of both adult and pediatric teeth. Task 2 involved segmenting tooth sections using CBCT volumes. Limited labelled images with mostly unlabelled ones were provided in this challenge prompt using semi-supervised algorithms for training. In the preliminary round, the challenge received registration and result submission by 434 teams, with 64 advancing to the final round. This paper summarizes the diverse methods employed by the top-ranking teams in the STS MICCAI 2023 Challenge.
Abstract:Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at https://github.com/wuchengyu123/MMFusion.
Abstract:The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
Abstract:AI-generated medical images are gaining growing popularity due to their potential to address the data scarcity challenge in the real world. However, the issue of accurate identification of these synthetic images, particularly when they exhibit remarkable realism with their real copies, remains a concern. To mitigate this challenge, image generators such as DALLE and Imagen, have integrated digital watermarks aimed at facilitating the discernment of synthetic images' authenticity. These watermarks are embedded within the image pixels and are invisible to the human eye while remains their detectability. Nevertheless, a comprehensive investigation into the potential impact of these invisible watermarks on the utility of synthetic medical images has been lacking. In this study, we propose the incorporation of invisible watermarks into synthetic medical images and seek to evaluate their efficacy in the context of downstream classification tasks. Our goal is to pave the way for discussions on the viability of such watermarks in boosting the detectability of synthetic medical images, fortifying ethical standards, and safeguarding against data pollution and potential scams.