Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chaoqi Chen

Generalized Category Discovery via Token Manifold Capacity Learning

May 20, 2025

Luyao Tang, Kunze Huang, Chaoqi Chen, Cheng Chen

Abstract:Generalized category discovery (GCD) is essential for improving deep learning models' robustness in open-world scenarios by clustering unlabeled data containing both known and novel categories. Traditional GCD methods focus on minimizing intra-cluster variations, often sacrificing manifold capacity, which limits the richness of intra-class representations. In this paper, we propose a novel approach, Maximum Token Manifold Capacity (MTMC), that prioritizes maximizing the manifold capacity of class tokens to preserve the diversity and complexity of data. MTMC leverages the nuclear norm of singular values as a measure of manifold capacity, ensuring that the representation of samples remains informative and well-structured. This method enhances the discriminability of clusters, allowing the model to capture detailed semantic features and avoid the loss of critical information during clustering. Through theoretical analysis and extensive experiments on coarse- and fine-grained datasets, we demonstrate that MTMC outperforms existing GCD methods, improving both clustering accuracy and the estimation of category numbers. The integration of MTMC leads to more complete representations, better inter-class separability, and a reduction in dimensional collapse, establishing MTMC as a vital component for robust open-world learning. Code is in github.com/lytang63/MTMC.

Via

Access Paper or Ask Questions

OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

Mar 24, 2025

Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Zeyu Zhang, Yue Huang, Kun Zhang

Abstract:Although foundation models (FMs) claim to be powerful, their generalization ability significantly decreases when faced with distribution shifts, weak supervision, or malicious attacks in the open world. On the other hand, most domain generalization or adversarial fine-tuning methods are task-related or model-specific, ignoring the universality in practical applications and the transferability between FMs. This paper delves into the problem of generalizing FMs to the out-of-domain data. We propose a novel framework, the Object-Concept-Relation Triad (OCRT), that enables FMs to extract sparse, high-level concepts and intricate relational structures from raw visual inputs. The key idea is to bind objects in visual scenes and a set of object-centric representations through unsupervised decoupling and iterative refinement. To be specific, we project the object-centric representations onto a semantic concept space that the model can readily interpret and estimate their importance to filter out irrelevant elements. Then, a concept-based graph, which has a flexible degree, is constructed to incorporate the set of concepts and their corresponding importance, enabling the extraction of high-order factors from informative concepts and facilitating relational reasoning among these concepts. Extensive experiments demonstrate that OCRT can substantially boost the generalizability and robustness of SAM and CLIP across multiple downstream tasks.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

Single-View Graph Contrastive Learning with Soft Neighborhood Awareness

Dec 12, 2024

Qingqiang Sun, Chaoqi Chen, Ziyue Qiao, Xubin Zheng, Kai Wang

Abstract:Most graph contrastive learning (GCL) methods heavily rely on cross-view contrast, thus facing several concomitant challenges, such as the complexity of designing effective augmentations, the potential for information loss between views, and increased computational costs. To mitigate reliance on cross-view contrasts, we propose \ttt{SIGNA}, a novel single-view graph contrastive learning framework. Regarding the inconsistency between structural connection and semantic similarity of neighborhoods, we resort to soft neighborhood awareness for GCL. Specifically, we leverage dropout to obtain structurally-related yet randomly-noised embedding pairs for neighbors, which serve as potential positive samples. At each epoch, the role of partial neighbors is switched from positive to negative, leading to probabilistic neighborhood contrastive learning effect. Furthermore, we propose a normalized Jensen-Shannon divergence estimator for a better effect of contrastive learning. Surprisingly, experiments on diverse node-level tasks demonstrate that our simple single-view GCL framework consistently outperforms existing methods by margins of up to 21.74% (PPI). In particular, with soft neighborhood awareness, SIGNA can adopt MLPs instead of complicated GCNs as the encoder to generate representations in transductive learning tasks, thus speeding up its inference process by 109 times to 331 times. The source code is available at https://github.com/sunisfighting/SIGNA.

* Accepted by AAAI2025; full version including appendix

Via

Access Paper or Ask Questions

DIVE: Taming DINO for Subject-Driven Video Editing

Dec 04, 2024

Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen

Figure 1 for DIVE: Taming DINO for Subject-Driven Video Editing

Figure 2 for DIVE: Taming DINO for Subject-Driven Video Editing

Figure 3 for DIVE: Taming DINO for Subject-Driven Video Editing

Figure 4 for DIVE: Taming DINO for Subject-Driven Video Editing

Abstract:Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject's identity. Project page: https://dino-video-editing.github.io

Via

Access Paper or Ask Questions

Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

Aug 29, 2024

Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Kunze Huang, Xinghao Ding, Yue Huang

Figure 1 for Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

Figure 2 for Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

Figure 3 for Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

Figure 4 for Bootstrap Segmentation Foundation Model under Distribution Shift via Object-Centric Learning

Abstract:Foundation models have made incredible strides in achieving zero-shot or few-shot generalization, leveraging prompt engineering to mimic the problem-solving approach of human intelligence. However, when it comes to some foundation models like Segment Anything, there is still a challenge in performing well on out-of-distribution data, including camouflaged and medical images. Inconsistent prompting strategies during fine-tuning and testing further compound the issue, leading to decreased performance. Drawing inspiration from how human cognition processes new environments, we introduce SlotSAM, a method that reconstructs features from the encoder in a self-supervised manner to create object-centric representations. These representations are then integrated into the foundation model, bolstering its object-level perceptual capabilities while reducing the impact of distribution-related variables. The beauty of SlotSAM lies in its simplicity and adaptability to various tasks, making it a versatile solution that significantly enhances the generalization abilities of foundation models. Through limited parameter fine-tuning in a bootstrap manner, our approach paves the way for improved generalization in novel environments. The code is available at github.com/lytang63/SlotSAM.

* This work is accepted by ECCV 2024 EVAL-FoMo Workshop

Via

Access Paper or Ask Questions

Mixstyle-Entropy: Domain Generalization with Causal Intervention and Perturbation

Aug 22, 2024

Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Xinghao Ding, Yue Huang

Figure 1 for Mixstyle-Entropy: Domain Generalization with Causal Intervention and Perturbation

Figure 2 for Mixstyle-Entropy: Domain Generalization with Causal Intervention and Perturbation

Figure 3 for Mixstyle-Entropy: Domain Generalization with Causal Intervention and Perturbation

Figure 4 for Mixstyle-Entropy: Domain Generalization with Causal Intervention and Perturbation

Abstract:Despite the considerable advancements achieved by deep neural networks, their performance tends to degenerate when the test environment diverges from the training ones. Domain generalization (DG) solves this issue by learning representations independent of domain-related information, thus facilitating extrapolation to unseen environments. Existing approaches typically focus on formulating tailored training objectives to extract shared features from the source data. However, the disjointed training and testing procedures may compromise robustness, particularly in the face of unforeseen variations during deployment. In this paper, we propose a novel and holistic framework based on causality, named InPer, designed to enhance model generalization by incorporating causal intervention during training and causal perturbation during testing. Specifically, during the training phase, we employ entropy-based causal intervention (EnIn) to refine the selection of causal variables. To identify samples with anti-interference causal variables from the target domain, we propose a novel metric, homeostatic score, through causal perturbation (HoPer) to construct a prototype classifier in test time. Experimental results across multiple cross-domain tasks confirm the efficacy of InPer.

* Accepted by BMVC2024

Via

Access Paper or Ask Questions

InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation

Aug 07, 2024

Luyao Tang, Yuxuan Yuan, Chaoqi Chen, Xinghao Ding, Yue Huang

Figure 1 for InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation

Figure 2 for InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation

Figure 3 for InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation

Figure 4 for InPer: Whole-Process Domain Generalization via Causal Intervention and Perturbation

* Accepted by BMVC2024

Via

Access Paper or Ask Questions

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

Aug 05, 2024

Yunxiang Fu, Chaoqi Chen, Yizhou Yu

Abstract:Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62\% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters.

Via

Access Paper or Ask Questions

DreamDA: Generative Data Augmentation with Diffusion Models

Mar 19, 2024

Yunxiang Fu, Chaoqi Chen, Yu Qiao, Yizhou Yu

Abstract:The acquisition of large-scale, high-quality data is a resource-intensive and time-consuming endeavor. Compared to conventional Data Augmentation (DA) techniques (e.g. cropping and rotation), exploiting prevailing diffusion models for data generation has received scant attention in classification tasks. Existing generative DA methods either inadequately bridge the domain gap between real-world and synthesized images, or inherently suffer from a lack of diversity. To solve these issues, this paper proposes a new classification-oriented framework DreamDA, which enables data synthesis and label generation by way of diffusion models. DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds and perturbing their reverse diffusion process. In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels and training classifiers using the synthesized data. Extensive experiments across four tasks and five datasets demonstrate consistent improvements over strong baselines, revealing the efficacy of DreamDA in synthesizing high-quality and diverse images with accurate labels. Our code will be available at https://github.com/yunxiangfu2001/DreamDA.

* 14 pages, 8 tables, 3 figures

Via

Access Paper or Ask Questions

Progressive Conservative Adaptation for Evolving Target Domains

Feb 07, 2024

Gangming Zhao, Chaoqi Chen, Wenhao He, Chengwei Pan, Chaowei Fang, Jinpeng Li, Xilin Chen, Yizhou Yu

Abstract:Conventional domain adaptation typically transfers knowledge from a source domain to a stationary target domain. However, in many real-world cases, target data usually emerge sequentially and have continuously evolving distributions. Restoring and adapting to such target data results in escalating computational and resource consumption over time. Hence, it is vital to devise algorithms to address the evolving domain adaptation (EDA) problem, \emph{i.e.,} adapting models to evolving target domains without access to historic target domains. To achieve this goal, we propose a simple yet effective approach, termed progressive conservative adaptation (PCAda). To manage new target data that diverges from previous distributions, we fine-tune the classifier head based on the progressively updated class prototypes. Moreover, as adjusting to the most recent target domain can interfere with the features learned from previous target domains, we develop a conservative sparse attention mechanism. This mechanism restricts feature adaptation within essential dimensions, thus easing the inference related to historical knowledge. The proposed PCAda is implemented with a meta-learning framework, which achieves the fast adaptation of the classifier with the help of the progressively updated class prototypes in the inner loop and learns a generalized feature without severely interfering with the historic knowledge via the conservative sparse attention in the outer loop. Experiments on Rotated MNIST, Caltran, and Portraits datasets demonstrate the effectiveness of our method.

* 7 pages, 5 figures

Via

Access Paper or Ask Questions