Abstract:Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently inferior to the teacher model. However, we identify that the fundamental issue affecting student performance is the bias transferred by the teacher. Current KD frameworks transmit both right and wrong knowledge, introducing bias that misleads the student model. To address this issue, we propose a novel strategy to rectify bias and greatly improve the student model's performance. Our strategy involves three steps: First, we differentiate knowledge and design a bias elimination method to filter out biases, retaining only the right knowledge for the student model to learn. Next, we propose a bias rectification method to rectify the teacher model's wrong predictions, fundamentally addressing bias interference. The student model learns from both the right knowledge and the rectified biases, greatly improving its prediction accuracy. Additionally, we introduce a dynamic learning approach with a loss function that updates weights dynamically, allowing the student model to quickly learn right knowledge-based easy tasks initially and tackle hard tasks corresponding to biases later, greatly enhancing the student model's learning efficiency. To the best of our knowledge, this is the first strategy enabling the student model to surpass the teacher model. Experiments demonstrate that our strategy, as a plug-and-play module, is versatile across various mainstream KD frameworks. We will release our code after the paper is accepted.
Abstract:The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
Abstract:Detecting cracks with pixel-level precision for key structures is a significant challenge, as existing methods struggle to effectively integrate local textures and pixel dependencies of cracks. Furthermore, these methods often possess numerous parameters and substantial computational requirements, complicating deployment on edge devices. In this paper, we propose a staircase cascaded fusion crack segmentation network (CrackSCF) that generates high-quality crack segmentation maps using minimal computational resources. We constructed a staircase cascaded fusion module that effectively captures local patterns of cracks and long-range dependencies of pixels, and it can suppress background noise well. To reduce the computational resources required by the model, we introduced a lightweight convolution block, which replaces all convolution operations in the network, significantly reducing the required computation and parameters without affecting the network's performance. To evaluate our method, we created a challenging benchmark dataset called TUT and conducted experiments on this dataset and five other public datasets. The experimental results indicate that our method offers significant advantages over existing methods, especially in handling background noise interference and detailed crack segmentation. The F1 and mIoU scores on the TUT dataset are 0.8382 and 0.8473, respectively, achieving state-of-the-art (SOTA) performance while requiring the least computational resources. The code and dataset is available at https://github.com/Karl1109/CrackSCF.
Abstract:Time series classification with missing data is a prevalent issue in time series analysis, as temporal data often contain missing values in practical applications. The traditional two-stage approach, which handles imputation and classification separately, can result in sub-optimal performance as label information is not utilized in the imputation process. On the other hand, a one-stage approach can learn features under missing information, but feature representation is limited as imputed errors are propagated in the classification process. To overcome these challenges, this study proposes an end-to-end neural network that unifies data imputation and representation learning within a single framework, allowing the imputation process to take advantage of label information. Differing from previous methods, our approach places less emphasis on the accuracy of imputation data and instead prioritizes classification performance. A specifically designed multi-scale feature learning module is implemented to extract useful information from the noise-imputation data. The proposed model is evaluated on 68 univariate time series datasets from the UCR archive, as well as a multivariate time series dataset with various missing data ratios and 4 real-world datasets with missing information. The results indicate that the proposed model outperforms state-of-the-art approaches for incomplete time series classification, particularly in scenarios with high levels of missing data.
Abstract:Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.
Abstract:In this work, we are dedicated to leveraging the denoising diffusion models' success and formulating feature refinement as the autoencoder-formed diffusion process. The state-of-the-art CSLR framework consists of a spatial module, a visual module, a sequence module, and a sequence learning function. However, this framework has faced sequence module overfitting caused by the objective function and small-scale available benchmarks, resulting in insufficient model training. To overcome the overfitting problem, some CSLR studies enforce the sequence module to learn more visual temporal information or be guided by more informative supervision to refine its representations. In this work, we propose a novel autoencoder-formed conditional diffusion feature refinement~(ACDR) to refine the sequence representations to equip desired properties by learning the encoding-decoding optimization process in an end-to-end way. Specifically, for the ACDR, a noising Encoder is proposed to progressively add noise equipped with semantic conditions to the sequence representations. And a denoising Decoder is proposed to progressively denoise the noisy sequence representations with semantic conditions. Therefore, the sequence representations can be imbued with the semantics of provided semantic conditions. Further, a semantic constraint is employed to prevent the denoised sequence representations from semantic corruption. Extensive experiments are conducted to validate the effectiveness of our ACDR, benefiting state-of-the-art methods and achieving a notable gain on three benchmarks.
Abstract:Accurate and reliable optical remote sensing image-based small-ship detection is crucial for maritime surveillance systems, but existing methods often struggle with balancing detection performance and computational complexity. In this paper, we propose a novel lightweight framework called \textit{HSI-ShipDetectionNet} that is based on high-order spatial interactions and is suitable for deployment on resource-limited platforms, such as satellites and unmanned aerial vehicles. HSI-ShipDetectionNet includes a prediction branch specifically for tiny ships and a lightweight hybrid attention block for reduced complexity. Additionally, the use of a high-order spatial interactions module improves advanced feature understanding and modeling ability. Our model is evaluated using the public Kaggle marine ship detection dataset and compared with multiple state-of-the-art models including small object detection models, lightweight detection models, and ship detection models. The results show that HSI-ShipDetectionNet outperforms the other models in terms of recall, and mean average precision (mAP) while being lightweight and suitable for deployment on resource-limited platforms.
Abstract:Network architecture search (NAS), in particular the differentiable architecture search (DARTS) method, has shown a great power to learn excellent model architectures on the specific dataset of interest. In contrast to using a fixed dataset, in this work, we focus on a different but important scenario for NAS: how to refine a deployed network's model architecture to enhance its robustness with the guidance of a few collected and misclassified examples that are degraded by some real-world unknown corruptions having a specific pattern (e.g., noise, blur, etc.). To this end, we first conduct an empirical study to validate that the model architectures can be definitely related to the corruption patterns. Surprisingly, by just adding a few corrupted and misclassified examples (e.g., $10^3$ examples) to the clean training dataset (e.g., $5.0 \times 10^4$ examples), we can refine the model architecture and enhance the robustness significantly. To make it more practical, the key problem, i.e., how to select the proper failure examples for the effective NAS guidance, should be carefully investigated. Then, we propose a novel core-failure-set guided DARTS that embeds a K-center-greedy algorithm for DARTS to select suitable corrupted failure examples to refine the model architecture. We use our method for DARTS-refined DNNs on the clean as well as 15 corruptions with the guidance of four specific real-world corruptions. Compared with the state-of-the-art NAS as well as data-augmentation-based enhancement methods, our final method can achieve higher accuracy on both corrupted datasets and the original clean dataset. On some of the corruption patterns, we can achieve as high as over 45% absolute accuracy improvements.
Abstract:Visual Simultaneous Localization and Mapping (vSLAM) has achieved great progress in the computer vision and robotics communities, and has been successfully used in many fields such as autonomous robot navigation and AR/VR. However, vSLAM cannot achieve good localization in dynamic and complex environments. Numerous publications have reported that, by combining with the semantic information with vSLAM, the semantic vSLAM systems have the capability of solving the above problems in recent years. Nevertheless, there is no comprehensive survey about semantic vSLAM. To fill the gap, this paper first reviews the development of semantic vSLAM, explicitly focusing on its strengths and differences. Secondly, we explore three main issues of semantic vSLAM: the extraction and association of semantic information, the application of semantic information, and the advantages of semantic vSLAM. Then, we collect and analyze the current state-of-the-art SLAM datasets which have been widely used in semantic vSLAM systems. Finally, we discuss future directions that will provide a blueprint for the future development of semantic vSLAM.
Abstract:This work studies the joint rain and haze removal problem. In real-life scenarios, rain and haze, two often co-occurring common weather phenomena, can greatly degrade the clarity and quality of the scene images, leading to a performance drop in the visual applications, such as autonomous driving. However, jointly removing the rain and haze in scene images is ill-posed and challenging, where the existence of haze and rain and the change of atmosphere light, can both degrade the scene information. Current methods focus on the contamination removal part, thus ignoring the restoration of the scene information affected by the change of atmospheric light. We propose a novel deep neural network, named Asymmetric Dual-decoder U-Net (ADU-Net), to address the aforementioned challenge. The ADU-Net produces both the contamination residual and the scene residual to efficiently remove the rain and haze while preserving the fidelity of the scene information. Extensive experiments show our work outperforms the existing state-of-the-art methods by a considerable margin in both synthetic data and real-world data benchmarks, including RainCityscapes, BID Rain, and SPA-Data. For instance, we improve the state-of-the-art PSNR value by 2.26/4.57 on the RainCityscapes/SPA-Data, respectively. Codes will be made available freely to the research community.