Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yinghui Xing

Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector

Feb 08, 2025

Qirui Wu, Shizhou Zhang, De Cheng, Yinghui Xing, Di Xu, Peng Wang, Yanning Zhang

Abstract:Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code will be available soon.

* 14 pages, 7 figures, 9 tables

Via

Access Paper or Ask Questions

Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Aug 14, 2024

Shizhou Zhang, Wenlong Luo, De Cheng, Qingchun Yang, Lingyan Ran, Yinghui Xing, Yanning Zhang

Figure 1 for Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Figure 2 for Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Figure 3 for Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Figure 4 for Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach

Abstract:In this paper, we construct a large-scale benchmark dataset for Ground-to-Aerial Video-based person Re-Identification, named G2A-VReID, which comprises 185,907 images and 5,576 tracklets, featuring 2,788 distinct identities. To our knowledge, this is the first dataset for video ReID under Ground-to-Aerial scenarios. G2A-VReID dataset has the following characteristics: 1) Drastic view changes; 2) Large number of annotated identities; 3) Rich outdoor scenarios; 4) Huge difference in resolution. Additionally, we propose a new benchmark approach for cross-platform ReID by transforming the cross-platform visual alignment problem into visual-semantic alignment through vision-language model (i.e., CLIP) and applying a parameter-efficient Video Set-Level-Adapter module to adapt image-based foundation model to video ReID tasks, termed VSLA-CLIP. Besides, to further reduce the great discrepancy across the platforms, we also devise the platform-bridge prompts for efficient visual feature alignment. Extensive experiments demonstrate the superiority of the proposed method on all existing video ReID datasets and our proposed G2A-VReID dataset.

Via

Access Paper or Ask Questions

Visual Prompt Tuning in Null Space for Continual Learning

Jun 11, 2024

Yue Lu, Shizhou Zhang, De Cheng, Yinghui Xing, Nannan Wang, Peng Wang, Yanning Zhang

Figure 1 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 2 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 3 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 4 for Visual Prompt Tuning in Null Space for Continual Learning

Abstract:Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i.e., 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https://github.com/zugexiaodui/VPTinNSforCL.

* 20 pages, 10 figures

Via

Access Paper or Ask Questions

DDF: A Novel Dual-Domain Image Fusion Strategy for Remote Sensing Image Semantic Segmentation with Unsupervised Domain Adaptation

Mar 05, 2024

Lingyan Ran, Lushuang Wang, Tao Zhuo, Yinghui Xing

Figure 1 for DDF: A Novel Dual-Domain Image Fusion Strategy for Remote Sensing Image Semantic Segmentation with Unsupervised Domain Adaptation

Figure 2 for DDF: A Novel Dual-Domain Image Fusion Strategy for Remote Sensing Image Semantic Segmentation with Unsupervised Domain Adaptation

Figure 3 for DDF: A Novel Dual-Domain Image Fusion Strategy for Remote Sensing Image Semantic Segmentation with Unsupervised Domain Adaptation

Figure 4 for DDF: A Novel Dual-Domain Image Fusion Strategy for Remote Sensing Image Semantic Segmentation with Unsupervised Domain Adaptation

Abstract:Semantic segmentation of remote sensing images is a challenging and hot issue due to the large amount of unlabeled data. Unsupervised domain adaptation (UDA) has proven to be advantageous in incorporating unclassified information from the target domain. However, independently fine-tuning UDA models on the source and target domains has a limited effect on the outcome. This paper proposes a hybrid training strategy as well as a novel dual-domain image fusion strategy that effectively utilizes the original image, transformation image, and intermediate domain information. Moreover, to enhance the precision of pseudo-labels, we present a pseudo-label region-specific weight strategy. The efficacy of our approach is substantiated by extensive benchmark experiments and ablation studies conducted on the ISPRS Vaihingen and Potsdam datasets.

Via

Access Paper or Ask Questions

CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Jan 13, 2024

Yinghui Xing, Litao Qu, Shizhou Zhang, Kai Zhang, Yanning Zhang

Figure 1 for CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Figure 2 for CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Figure 3 for CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Figure 4 for CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model

Abstract:Fusion of a panchromatic (PAN) image and corresponding multispectral (MS) image is also known as pansharpening, which aims to combine abundant spatial details of PAN and spectral information of MS. Due to the absence of high-resolution MS images, available deep-learning-based methods usually follow the paradigm of training at reduced resolution and testing at both reduced and full resolution. When taking original MS and PAN images as inputs, they always obtain sub-optimal results due to the scale variation. In this paper, we propose to explore the self-supervised representation of pansharpening by designing a cross-predictive diffusion model, named CrossDiff. It has two-stage training. In the first stage, we introduce a cross-predictive pretext task to pre-train the UNet structure based on conditional DDPM, while in the second stage, the encoders of the UNets are frozen to directly extract spatial and spectral features from PAN and MS, and only the fusion head is trained to adapt for pansharpening task. Extensive experiments show the effectiveness and superiority of the proposed model compared with state-of-the-art supervised and unsupervised methods. Besides, the cross-sensor experiments also verify the generalization ability of proposed self-supervised representation learners for other satellite's datasets. We will release our code for reproducibility.

Via

Access Paper or Ask Questions

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

Aug 24, 2023

Shizhou Zhang, Qingchun Yang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang

Abstract:In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on \url{https://github.com/yqc123456/HKD_for_person_search}.

* Accepted by ACM MM 2023

Via

Access Paper or Ask Questions

Pre-train, Adapt and Detect: Multi-Task Adapter Tuning for Camouflaged Object Detection

Jul 20, 2023

Yinghui Xing, Dexuan Kong, Shizhou Zhang, Geng Chen, Lingyan Ran, Peng Wang, Yanning Zhang

Abstract:Camouflaged object detection (COD), aiming to segment camouflaged objects which exhibit similar patterns with the background, is a challenging task. Most existing works are dedicated to establishing specialized modules to identify camouflaged objects with complete and fine details, while the boundary can not be well located for the lack of object-related semantics. In this paper, we propose a novel ``pre-train, adapt and detect" paradigm to detect camouflaged objects. By introducing a large pre-trained model, abundant knowledge learned from massive multi-modal data can be directly transferred to COD. A lightweight parallel adapter is inserted to adjust the features suitable for the downstream COD task. Extensive experiments on four challenging benchmark datasets demonstrate that our method outperforms existing state-of-the-art COD models by large margins. Moreover, we design a multi-task learning scheme for tuning the adapter to exploit the shareable knowledge across different semantic classes. Comprehensive experimental results showed that the generalization ability of our model can be substantially improved with multi-task adapter initialization on source tasks and multi-task adaptation on target tasks.

Via

Access Paper or Ask Questions

Multispectral Pedestrian Detection via Reference Box Constrained Cross Attention and Modality Balanced Optimization

Feb 01, 2023

Yinghui Xing, Song Wang, Guoqiang Liang, Qingyi Li, Xiuwei Zhang, Shizhou Zhang, Yanning Zhang

Abstract:Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. To reduce the influence of hand-designed components in available multispectral pedestrian detectors, we propose a MultiSpectral pedestrian DEtection TRansformer (MS-DETR), which extends deformable DETR to multi-modal paradigm. In order to facilitate the multi-modal learning process, a Reference box Constrained Cross-Attention (RCCA) module is firstly introduced to the multi-modal Transformer decoder, which takes fusion branch together with the reference boxes as intermediaries to enable the interaction of visible and thermal modalities. To further balance the contribution of different modalities, we design a modality-balanced optimization strategy, which aligns the slots of decoders by adaptively adjusting the instance-level weight of three branches. Our end-to-end MS-DETR shows superior performance on the challenging KAIST and CVC-14 benchmark datasets.

Via

Access Paper or Ask Questions

Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

Aug 22, 2022

Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guoqiang Liang, Yanning Zhang

Figure 1 for Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

Figure 2 for Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

Figure 3 for Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

Figure 4 for Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

Abstract:With the emergence of large pre-trained vison-language model like CLIP, transferrable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in both the image and text encoders of the pre-trained vision-language model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side, while tuning the text prompt alone can not affect the computed visual features of the image encoder, thus leading to sub-optimal. In this paper, we propose a dual modality prompt tuning paradigm through learning text prompts and visual prompts for both the text and image encoder simultaneously. In addition, to make the visual prompt concentrate more on the target visual concept, we propose Class-Aware Visual Prompt Tuning (CAVPT), which is generated dynamically by performing the cross attention between language descriptions of template prompts and visual class token embeddings. Our method provides a new paradigm for tuning the large pre-trained vision-language model and extensive experimental results on 8 datasets demonstrate the effectiveness of the proposed method. Our code is available in the supplementary materials.

* 9 pages, 4 figures

Via

Access Paper or Ask Questions

PC-GANs: Progressive Compensation Generative Adversarial Networks for Pan-sharpening

Jul 29, 2022

Yinghui Xing, Shuyuan Yang, Song Wang, Yan Zhang, Yanning Zhang

Figure 1 for PC-GANs: Progressive Compensation Generative Adversarial Networks for Pan-sharpening

Figure 2 for PC-GANs: Progressive Compensation Generative Adversarial Networks for Pan-sharpening

Figure 3 for PC-GANs: Progressive Compensation Generative Adversarial Networks for Pan-sharpening

Figure 4 for PC-GANs: Progressive Compensation Generative Adversarial Networks for Pan-sharpening

Abstract:The fusion of multispectral and panchromatic images is always dubbed pansharpening. Most of the available deep learning-based pan-sharpening methods sharpen the multispectral images through a one-step scheme, which strongly depends on the reconstruction ability of the network. However, remote sensing images always have large variations, as a result, these one-step methods are vulnerable to the error accumulation and thus incapable of preserving spatial details as well as the spectral information. In this paper, we propose a novel two-step model for pan-sharpening that sharpens the MS image through the progressive compensation of the spatial and spectral information. Firstly, a deep multiscale guided generative adversarial network is used to preliminarily enhance the spatial resolution of the MS image. Starting from the pre-sharpened MS image in the coarse domain, our approach then progressively refines the spatial and spectral residuals over a couple of generative adversarial networks (GANs) that have reverse architectures. The whole model is composed of triple GANs, and based on the specific architecture, a joint compensation loss function is designed to enable the triple GANs to be trained simultaneously. Moreover, the spatial-spectral residual compensation structure proposed in this paper can be extended to other pan-sharpening methods to further enhance their fusion results. Extensive experiments are performed on different datasets and the results demonstrate the effectiveness and efficiency of our proposed method.

* 36 pages

Via

Access Paper or Ask Questions