Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chunhong Pan

UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Feb 04, 2025

Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, Chunhong Pan

Figure 1 for UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Figure 2 for UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Figure 3 for UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Figure 4 for UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Abstract:Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5\% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.

* ICLR 2025. 27 pages, 13 figures, 21 tables

Via

Access Paper or Ask Questions

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Feb 03, 2025

Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

Figure 1 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Figure 2 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Figure 3 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Figure 4 for Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Abstract:Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28$\times$ training speedup compared to existing preference optimization methods. Our code will be available at https://github.com/casiatao/LPO.

* 20 pages, 14 tables, 15 figures

Via

Access Paper or Ask Questions

Weak Distribution Detectors Lead to Stronger Generalizability of Vision-Language Prompt Tuning

Mar 31, 2024

Kun Ding, Haojian Zhang, Qiang Yu, Ying Wang, Shiming Xiang, Chunhong Pan

Abstract:We propose a generalized method for boosting the generalization ability of pre-trained vision-language models (VLMs) while fine-tuning on downstream few-shot tasks. The idea is realized by exploiting out-of-distribution (OOD) detection to predict whether a sample belongs to a base distribution or a novel distribution and then using the score generated by a dedicated competition based scoring function to fuse the zero-shot and few-shot classifier. The fused classifier is dynamic, which will bias towards the zero-shot classifier if a sample is more likely from the distribution pre-trained on, leading to improved base-to-novel generalization ability. Our method is performed only in test stage, which is applicable to boost existing methods without time-consuming re-training. Extensive experiments show that even weak distribution detectors can still improve VLMs' generalization ability. Specifically, with the help of OOD detectors, the harmonic mean of CoOp and ProGrad increase by 2.6 and 1.5 percentage points over 11 recognition datasets in the base-to-novel setting.

* Accepted by AAAI2024

Via

Access Paper or Ask Questions

Reusable Architecture Growth for Continual Stereo Matching

Mar 30, 2024

Chenghao Zhang, Gaofeng Meng, Bin Fan, Kun Tian, Zhaoxiang Zhang, Shiming Xiang, Chunhong Pan

Figure 1 for Reusable Architecture Growth for Continual Stereo Matching

Figure 2 for Reusable Architecture Growth for Continual Stereo Matching

Figure 3 for Reusable Architecture Growth for Continual Stereo Matching

Figure 4 for Reusable Architecture Growth for Continual Stereo Matching

Abstract:The remarkable performance of recent stereo depth estimation models benefits from the successful use of convolutional neural networks to regress dense disparity. Akin to most tasks, this needs gathering training data that covers a number of heterogeneous scenes at deployment time. However, training samples are typically acquired continuously in practical applications, making the capability to learn new scenes continually even more crucial. For this purpose, we propose to perform continual stereo matching where a model is tasked to 1) continually learn new scenes, 2) overcome forgetting previously learned scenes, and 3) continuously predict disparities at inference. We achieve this goal by introducing a Reusable Architecture Growth (RAG) framework. RAG leverages task-specific neural unit search and architecture growth to learn new scenes continually in both supervised and self-supervised manners. It can maintain high reusability during growth by reusing previous units while obtaining good performance. Additionally, we present a Scene Router module to adaptively select the scene-specific architecture path at inference. Comprehensive experiments on numerous datasets show that our framework performs impressively in various weather, road, and city circumstances and surpasses the state-of-the-art methods in more challenging cross-dataset settings. Further experiments also demonstrate the adaptability of our method to unseen scenes, which can facilitate end-to-end stereo architecture learning and practical deployment.

* Extended version of CVPR 2022 paper "Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture" - Accepted to TPAMI in 2024

Via

Access Paper or Ask Questions

PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images

Dec 13, 2023

Tao Zhang, Kun Ding, Jinyong Wen, Yu Xiong, Zeyu Zhang, Shiming Xiang, Chunhong Pan

Abstract:Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images, primarily due to three prominent challenges: 1) the lack of a suitable large-scale infrared pre-training dataset, 2) the distinctiveness of non-iconic infrared images rendering common pre-training tasks like masked image modeling (MIM) less effective, and 3) the scarcity of fine-grained textures making it particularly challenging to learn general image features. To address these issues, we construct a Multi-Scene Infrared Pre-training (MSIP) dataset comprising 178,756 images, and introduce object-sensitive random RoI cropping, an image preprocessing method, to tackle the challenge posed by non-iconic images. To alleviate the impact of weak textures on feature learning, we propose a pre-training paradigm called Pre-training with ADapter (PAD), which uses adapters to learn domain-specific features while freezing parameters pre-trained on ImageNet to retain the general feature extraction capability. This new paradigm is applicable to any transformer-based SSL method. Furthermore, to achieve more flexible coordination between pre-trained and newly-learned features in different layers and patches, a patchwise-scale adapter with dynamically learnable scale factors is introduced. Extensive experiments on three downstream tasks show that PAD, with only 1.23M pre-trainable parameters, outperforms other baseline paradigms including continual full pre-training on MSIP. Our code and dataset are available at https://github.com/casiatao/PAD.

Via

Access Paper or Ask Questions

Prompt Tuning with Soft Context Sharing for Vision-Language Models

Aug 29, 2022

Kun Ding, Ying Wang, Pengzhang Liu, Qiang Yu, Haojian Zhang, Shiming Xiang, Chunhong Pan

Figure 1 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Figure 2 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Figure 3 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Figure 4 for Prompt Tuning with Soft Context Sharing for Vision-Language Models

Abstract:Vision-language models have recently shown great potential on many computer vision tasks. Meanwhile, prior work demonstrates prompt tuning designed for vision-language models could acquire superior performance on few-shot image recognition compared to linear probe, a strong baseline. In real-world applications, many few-shot tasks are correlated, particularly in a specialized area. However, such information is ignored by previous work. Inspired by the fact that modeling task relationships by multi-task learning can usually boost performance, we propose a novel method SoftCPT (Soft Context Sharing for Prompt Tuning) to fine-tune pre-trained vision-language models on multiple target few-shot tasks, simultaneously. Specifically, we design a task-shared meta network to generate prompt vector for each task using pre-defined task name together with a learnable meta prompt as input. As such, the prompt vectors of all tasks will be shared in a soft manner. The parameters of this shared meta network as well as the meta prompt vector are tuned on the joint training set of all target tasks. Extensive experiments on three multi-task few-shot datasets show that SoftCPT outperforms the representative single-task prompt tuning method CoOp [78] by a large margin, implying the effectiveness of multi-task learning in vision-language prompt tuning. The source code and data will be made publicly available.

Via

Access Paper or Ask Questions

Pro-tuning: Unified Prompt Tuning for Vision Tasks

Aug 14, 2022

Xing Nie, Bolin Ni, Jianlong Chang, Gaomeng Meng, Chunlei Huo, Zhaoxiang Zhang, Shiming Xiang, Qi Tian, Chunhong Pan

Figure 1 for Pro-tuning: Unified Prompt Tuning for Vision Tasks

Figure 2 for Pro-tuning: Unified Prompt Tuning for Vision Tasks

Figure 3 for Pro-tuning: Unified Prompt Tuning for Vision Tasks

Figure 4 for Pro-tuning: Unified Prompt Tuning for Vision Tasks

Abstract:In computer vision, fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. However, deploying it in practice is quite challenging, due to adopting parameter inefficient global update and heavily relying on high-quality downstream data. Recently, prompt-based learning, which adds a task-relevant prompt to adapt the downstream tasks to pre-trained models, has drastically boosted the performance of many natural language downstream tasks. In this work, we extend this notable transfer ability benefited from prompt into vision models as an alternative to fine-tuning. To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks. The key to Pro-tuning is prompt-based tuning, i.e., learning task-specific vision prompts for downstream input images with the pre-trained model frozen. By only training a few additional parameters, it can work on diverse CNN-based and Transformer-based architectures. Extensive experiments evidence that Pro-tuning outperforms fine-tuning in a broad range of vision tasks and scenarios, including image classification (generic objects, class imbalance, image corruption, adversarial robustness, and out-of-distribution generalization), and dense prediction tasks such as object detection and semantic segmentation.

Via

Access Paper or Ask Questions

Differentiable Convolution Search for Point Cloud Processing

Aug 29, 2021

Xing Nie, Yongcheng Liu, Shaohong Chen, Jianlong Chang, Chunlei Huo, Gaofeng Meng, Qi Tian, Weiming Hu, Chunhong Pan

Figure 1 for Differentiable Convolution Search for Point Cloud Processing

Figure 2 for Differentiable Convolution Search for Point Cloud Processing

Figure 3 for Differentiable Convolution Search for Point Cloud Processing

Figure 4 for Differentiable Convolution Search for Point Cloud Processing

Abstract:Exploiting convolutional neural networks for point cloud processing is quite challenging, due to the inherent irregular distribution and discrete shape representation of point clouds. To address these problems, many handcrafted convolution variants have sprung up in recent years. Though with elaborate design, these variants could be far from optimal in sufficiently capturing diverse shapes formed by discrete points. In this paper, we propose PointSeaConv, i.e., a novel differential convolution search paradigm on point clouds. It can work in a purely data-driven manner and thus is capable of auto-creating a group of suitable convolutions for geometric shape modeling. We also propose a joint optimization framework for simultaneous search of internal convolution and external architecture, and introduce epsilon-greedy algorithm to alleviate the effect of discretization error. As a result, PointSeaNet, a deep network that is sufficient to capture geometric shapes at both convolution level and architecture level, can be searched out for point cloud processing. Extensive experiments strongly evidence that our proposed PointSeaNet surpasses current handcrafted deep models on challenging benchmarks across multiple tasks with remarkable margins.

Via

Access Paper or Ask Questions

HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition

Jun 25, 2021

Jianbo Liu, Ying Wang, Shiming Xiang, Chunhong Pan

Figure 1 for HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition

Figure 2 for HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition

Figure 3 for HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition

Figure 4 for HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition

Abstract:Previous methods for skeleton-based gesture recognition mostly arrange the skeleton sequence into a pseudo picture or spatial-temporal graph and apply deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) for feature extraction. Although achieving superior results, these methods have inherent limitations in dynamically capturing local features of interactive hand parts, and the computing efficiency still remains a serious issue. In this work, the self-attention mechanism is introduced to alleviate this problem. Considering the hierarchical structure of hand joints, we propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition, which is based on pure self-attention without any CNN, RNN or GCN operators. Specifically, the joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand. In terms of temporal features, the temporal self-attention module is utilized to capture the temporal dynamics of the fingers and the entire hand. Finally, these features are fused by the fusion self-attention module for gesture classification. Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.

* Under peer review for TCSVT

Via

Access Paper or Ask Questions

Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Jan 16, 2021

Nuo Xu, Chunlei Huo, Jiacheng Guo, Yiwei Liu, Jian Wang, Chunhong Pan

Figure 1 for Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Figure 2 for Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Figure 3 for Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Figure 4 for Adaptive Remote Sensing Image Attribute Learning for Active Object Detection

Abstract:In recent years, deep learning methods bring incredible progress to the field of object detection. However, in the field of remote sensing image processing, existing methods neglect the relationship between imaging configuration and detection performance, and do not take into account the importance of detection performance feedback for improving image quality. Therefore, detection performance is limited by the passive nature of the conventional object detection framework. In order to solve the above limitations, this paper takes adaptive brightness adjustment and scale adjustment as examples, and proposes an active object detection method based on deep reinforcement learning. The goal of adaptive image attribute learning is to maximize the detection performance. With the help of active object detection and image attribute adjustment strategies, low-quality images can be converted into high-quality images, and the overall performance is improved without retraining the detector.

* Accepted in 25th International Conference on Pattern Recognition (ICPR), (Milan, Italy), January 2021

Via

Access Paper or Ask Questions