Abstract:Fine-grained remote sensing image segmentation is essential for accurately identifying detailed objects in remote sensing images. Recently, vision transformer models (VTM) pretrained on large-scale datasets have shown strong zero-shot generalization, indicating that they have learned the general knowledge of object understanding. We introduce a novel end-to-end learning paradigm combining knowledge guidance with domain refinement to enhance performance. We present two key components: the Feature Alignment Module (FAM) and the Feature Modulation Module (FMM). FAM aligns features from a CNN-based backbone with those from the pretrained VTM's encoder using channel transformation and spatial interpolation, and transfers knowledge via KL divergence and L2 normalization constraint. FMM further adapts the knowledge to the specific domain to address domain shift. We also introduce a fine-grained grass segmentation dataset and demonstrate, through experiments on two datasets, that our method achieves a significant improvement of 2.57 mIoU on the grass dataset and 3.73 mIoU on the cloud dataset. The results highlight the potential of combining knowledge transfer and domain adaptation to overcome domain-related challenges and data limitations. The project page is available at https://xavierjiezou.github.io/KTDA/.
Abstract:Cloud segmentation is a critical challenge in remote sensing image interpretation, as its accuracy directly impacts the effectiveness of subsequent data processing and analysis. Recently, vision foundation models (VFM) have demonstrated powerful generalization capabilities across various visual tasks. In this paper, we present a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation. Our method leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training. Cloud-Adapter incorporates a lightweight spatial perception module that initially utilizes a convolutional neural network (ConvNet) to extract dense spatial representations. These multi-scale features are then aggregated and serve as contextual inputs to an adapting module, which modulates the frozen transformer layers within the VFM. Experimental results demonstrate that the Cloud-Adapter approach, utilizing only 0.6% of the trainable parameters of the frozen backbone, achieves substantial performance gains. Cloud-Adapter consistently attains state-of-the-art (SOTA) performance across a wide variety of cloud segmentation datasets from multiple satellite sources, sensor series, data processing levels, land cover scenarios, and annotation granularities. We have released the source code and pretrained models at https://github.com/XavierJiezou/Cloud-Adapter to support further research.
Abstract:Cattle face recognition holds paramount significance in domains such as animal husbandry and behavioral research. Despite significant progress in confined environments, applying these accomplishments in wild settings remains challenging. Thus, we create the first large-scale cattle face recognition dataset, ICRWE, for wild environments. It encompasses 483 cattle and 9,816 high-resolution image samples. Each sample undergoes annotation for face features, light conditions, and face orientation. Furthermore, we introduce a novel parallel attention network, PANet. Comprising several cascaded Transformer modules, each module incorporates two parallel Position Attention Modules (PAM) and Feature Mapping Modules (FMM). PAM focuses on local and global features at each image position through parallel channel attention, and FMM captures intricate feature patterns through non-linear mappings. Experimental results indicate that PANet achieves a recognition accuracy of 88.03% on the ICRWE dataset, establishing itself as the current state-of-the-art approach. The source code is available in the supplementary materials.
Abstract:Pansharpening, a pivotal task in remote sensing, involves integrating low-resolution multispectral images with high-resolution panchromatic images to synthesize an image that is both high-resolution and retains multispectral information. These pansharpened images enhance precision in land cover classification, change detection, and environmental monitoring within remote sensing data analysis. While deep learning techniques have shown significant success in pansharpening, existing methods often face limitations in their evaluation, focusing on restricted satellite data sources, single scene types, and low-resolution images. This paper addresses this gap by introducing PanBench, a high-resolution multi-scene dataset containing all mainstream satellites and comprising 5,898 pairs of samples. Each pair includes a four-channel (RGB + near-infrared) multispectral image of 256x256 pixels and a mono-channel panchromatic image of 1,024x1,024 pixels. To achieve high-fidelity synthesis, we propose a Cascaded Multiscale Fusion Network (CMFNet) for Pansharpening. Extensive experiments validate the effectiveness of CMFNet. We have released the dataset, source code, and pre-trained models in the supplementary, fostering further research in remote sensing.
Abstract:Optical satellite images are a critical data source; however, cloud cover often compromises their quality, hindering image applications and analysis. Consequently, effectively removing clouds from optical satellite images has emerged as a prominent research direction. While recent advancements in cloud removal primarily rely on generative adversarial networks, which may yield suboptimal image quality, diffusion models have demonstrated remarkable success in diverse image-generation tasks, showcasing their potential in addressing this challenge. This paper presents a novel framework called DiffCR, which leverages conditional guided diffusion with deep convolutional networks for high-performance cloud removal for optical satellite imagery. Specifically, we introduce a decoupled encoder for conditional image feature extraction, providing a robust color representation to ensure the close similarity of appearance information between the conditional input and the synthesized output. Moreover, we propose a novel and efficient time and condition fusion block within the cloud removal model to accurately simulate the correspondence between the appearance in the conditional image and the target image at a low computational cost. Extensive experimental evaluations on two commonly used benchmark datasets demonstrate that DiffCR consistently achieves state-of-the-art performance on all metrics, with parameter and computational complexities amounting to only 5.1% and 5.4%, respectively, of those previous best methods. The source code, pre-trained models, and all the experimental results will be publicly available at https://github.com/XavierJiezou/DiffCR upon the paper's acceptance of this work.