Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xianing Chen

Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

May 22, 2025

Xianing Chen, Si Huo, Borui Jiang, Hailin Hu, Xinghao Chen

Abstract:Few-shot counting estimates the number of target objects in an image using only a few annotated exemplars. However, domain shift severely hinders existing methods to generalize to unseen scenarios. This falls into the realm of single domain generalization that remains unexplored in few-shot counting. To solve this problem, we begin by analyzing the main limitations of current methods, which typically follow a standard pipeline that extract the object prototypes from exemplars and then match them with image feature to construct the correlation map. We argue that existing methods overlook the significance of learning highly generalized prototypes. Building on this insight, we propose the first single domain generalization few-shot counting model, Universal Representation Matching, termed URM. Our primary contribution is the discovery that incorporating universal vision-language representations distilled from a large scale pretrained vision-language model into the correlation construction process substantially improves robustness to domain shifts without compromising in domain performance. As a result, URM achieves state-of-the-art performance on both in domain and the newly introduced domain generalization setting.

* CVPR 2025

Via

Access Paper or Ask Questions

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

Apr 29, 2025

Yanan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Yang, Yingbo Wang, Yang Du, Xianing Chen, Bo Zheng

Abstract:Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Apr 28, 2022

Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, Dacheng Tao

Figure 1 for DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Figure 2 for DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Figure 3 for DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Figure 4 for DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Abstract:Transformers are successfully applied to computer vision due to their powerful modeling capacity with self-attention. However, the excellent performance of transformers heavily depends on enormous training images. Thus, a data-efficient transformer solution is urgently needed. In this work, we propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation. Further, our DearKD can be readily applied to the extreme data-free case where no real images are available. In this case, we propose a boundary-preserving intra-divergence loss based on DeepInversion to further close the performance gap against the full-data counterpart. Extensive experiments on ImageNet, partial ImageNet, data-free setting and other downstream tasks prove the superiority of DearKD over its baselines and state-of-the-art methods.

* CVPR 2022

Via

Access Paper or Ask Questions

OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Sep 23, 2021

Xianing Chen, Jialang Xu, Jiale Xu, Shenghua Gao

Figure 1 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Figure 2 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Figure 3 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Figure 4 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Abstract:Transformers have shown preferable performance on many vision tasks. However, for the task of person re-identification (ReID), vanilla transformers leave the rich contexts on high-order feature relations under-exploited and deteriorate local feature details, which are insufficient due to the dramatic variations of pedestrians. In this work, we propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for ReID. First, to strengthen the capacity of visual representation, instead of obtaining the attention matrix based on pairs of queries and isolated keys at each spatial location, we take a step further to model high-order statistics information for the non-local mechanism. We share the attention weights in the corresponding layer of each order with a prior mixing mechanism to reduce the computation cost. Then, a convolution-based local relation perception module is proposed to extract the local relations and 2D position information. The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.

Via

Access Paper or Ask Questions