Abstract:Existing progress in object keypoint estimation primarily benefits from the conventional supervised learning paradigm based on numerous data labeled with pre-defined keypoints. However, these well-trained models can hardly detect the undefined new keypoints in test time, which largely hinders their feasibility for diverse downstream tasks. To handle this, various solutions are explored but still suffer from either limited generalizability or transferability. Therefore, in this paper, we explore a novel keypoint learning paradigm in that we only annotate new keypoints in the new data and incrementally train the model, without retaining any old data, called Incremental object Keypoint Learning (IKL). A two-stage learning scheme as a novel baseline tailored to IKL is developed. In the first Knowledge Association stage, given the data labeled with only new keypoints, an auxiliary KA-Net is trained to automatically associate the old keypoints to these new ones based on their spatial and intrinsic anatomical relations. In the second Mutual Promotion stage, based on a keypoint-oriented spatial distillation loss, we jointly leverage the auxiliary KA-Net and the old model for knowledge consolidation to mutually promote the estimation of all old and new keypoints. Owing to the investigation of the correlations between new and old keypoints, our proposed method can not just effectively mitigate the catastrophic forgetting of old keypoints, but may even further improve the estimation of the old ones and achieve a positive transfer beyond anti-forgetting. Such an observation has been solidly verified by extensive experiments on different keypoint datasets, where our method exhibits superiority in alleviating the forgetting issue and boosting performance while enjoying labeling efficiency even under the low-shot data regime.
Abstract:Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model's ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at https://github.com/zhoujiahuan1991/CVPR2025-STOP.
Abstract:Vision-language models (VLMs) encounter considerable challenges when adapting to domain shifts stemming from changes in data distribution. Test-time adaptation (TTA) has emerged as a promising approach to enhance VLM performance under such conditions. In practice, test data often arrives in batches, leading to increasing interest in the transductive TTA setting. However, existing TTA methods primarily focus on individual test samples, overlooking crucial cross-sample correlations within a batch. While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first forms supportive cliques of test samples in an unsupervised manner based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. Our code is available at https://github.com/zhoujiahuan1991/CVPR2025-SCAP.
Abstract:Vision-centric autonomous driving systems require diverse data for robust training and evaluation, which can be augmented by manipulating object positions and appearances within existing scene captures. While recent advancements in diffusion models have shown promise in video editing, their application to object manipulation in driving scenarios remains challenging due to imprecise positional control and difficulties in preserving high-fidelity object appearances. To address these challenges in position and appearance control, we introduce DriveEditor, a diffusion-based framework for object editing in driving videos. DriveEditor offers a unified framework for comprehensive object editing operations, including repositioning, replacement, deletion, and insertion. These diverse manipulations are all achieved through a shared set of varying inputs, processed by identical position control and appearance maintenance modules. The position control module projects the given 3D bounding box while preserving depth information and hierarchically injects it into the diffusion process, enabling precise control over object position and orientation. The appearance maintenance module preserves consistent attributes with a single reference image by employing a three-tiered approach: low-level detail preservation, high-level semantic maintenance, and the integration of 3D priors from a novel view synthesis model. Extensive qualitative and quantitative evaluations on the nuScenes dataset demonstrate DriveEditor's exceptional fidelity and controllability in generating diverse driving scene edits, as well as its remarkable ability to facilitate downstream tasks.
Abstract:Lifelong person re-identification (LReID) is an important but challenging task that suffers from catastrophic forgetting due to significant domain gaps between training steps. Existing LReID approaches typically rely on data replay and knowledge distillation to mitigate this issue. However, data replay methods compromise data privacy by storing historical exemplars, while knowledge distillation methods suffer from limited performance due to the cumulative forgetting of undistilled knowledge. To overcome these challenges, we propose a novel paradigm that models and rehearses the distribution of the old domains to enhance knowledge consolidation during the new data learning, possessing a strong anti-forgetting capacity without storing any exemplars. Specifically, we introduce an exemplar-free LReID method called Distribution Rehearsing via Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser Learning mechanism that learns to transform arbitrary distribution data into the current data style at each learning step. To enhance the style transfer capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve an instance-specific distribution adjustment. Additionally, we design a Distribution Rehearsing-driven LReID Training module, which rehearses old distribution based on the new data via the old AKPNet model, achieving effective new-old knowledge accumulation under a joint knowledge consolidation scheme. Experimental results show our DASK outperforms the existing methods by 3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity, respectively. Our code is available at https://github.com/zhoujiahuan1991/AAAI2025-DASK
Abstract:Recently, prompt tuning methods for pre-trained models have demonstrated promising performance in Class Incremental Learning (CIL). These methods typically involve learning task-specific prompts and predicting the task ID to select the appropriate prompts for inference. However, inaccurate task ID predictions can cause severe inconsistencies between the prompts used during training and inference, leading to knowledge forgetting and performance degradation. Additionally, existing prompt tuning methods rely solely on the pre-trained model to predict task IDs, without fully leveraging the knowledge embedded in the learned prompt parameters, resulting in inferior prediction performance. To address these issues, we propose a novel Cyclic Prompt Aggregation (CAPrompt) method that eliminates the dependency on task ID prediction by cyclically aggregating the knowledge from different prompts. Specifically, rather than predicting task IDs, we introduce an innovative prompt aggregation strategy during both training and inference to overcome prompt inconsistency by utilizing a weighted sum of different prompts. Thorough theoretical analysis demonstrates that under concave conditions, the aggregated prompt achieves lower error compared to selecting a single task-specific prompt. Consequently, we incorporate a concave constraint and a linear constraint to guide prompt learning, ensuring compliance with the concave condition requirement. Furthermore, to fully exploit the prompts and achieve more accurate prompt weights, we develop a cyclic weight prediction strategy. This strategy begins with equal weights for each task and automatically adjusts them to more appropriate values in a cyclical manner. Experiments on various datasets demonstrate that our proposed CAPrompt outperforms state-of-the-art methods by 2%-3%. Our code is available at https://github.com/zhoujiahuan1991/AAAI2025-CAPrompt.
Abstract:Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at https://github.com/zhoujiahuan1991/AAAI2025-SVP.
Abstract:Training Single-Image Super-Resolution (SISR) models using pixel-based regression losses can achieve high distortion metrics scores (e.g., PSNR and SSIM), but often results in blurry images due to insufficient recovery of high-frequency details. Conversely, using GAN or perceptual losses can produce sharp images with high perceptual metric scores (e.g., LPIPS), but may introduce artifacts and incorrect textures. Balancing these two types of losses can help achieve a trade-off between distortion and perception, but the challenge lies in tuning the loss function weights. To address this issue, we propose a novel method that incorporates Multi-Objective Optimization (MOO) into the training process of SISR models to balance perceptual quality and distortion. We conceptualize the relationship between loss weights and image quality assessment (IQA) metrics as black-box objective functions to be optimized within our Multi-Objective Bayesian Optimization Super-Resolution (MOBOSR) framework. This approach automates the hyperparameter tuning process, reduces overall computational cost, and enables the use of numerous loss functions simultaneously. Extensive experiments demonstrate that MOBOSR outperforms state-of-the-art methods in terms of both perceptual quality and distortion, significantly advancing the perception-distortion Pareto frontier. Our work points towards a new direction for future research on balancing perceptual quality and fidelity in nearly all image restoration tasks. The source code and pretrained models are available at: https://github.com/ZhuKeven/MOBOSR.
Abstract:Analysis of human actions in videos demands understanding complex human dynamics, as well as the interaction between actors and context. However, these interaction relationships usually exhibit large intra-class variations from diverse human poses or object manipulations, and fine-grained inter-class differences between similar actions. Thus the performance of existing methods is severely limited. Motivated by the observation that interactive actions can be decomposed into actor dynamics and participating objects or humans, we propose to investigate the composite property of them. In this paper, we present a novel Grammatical Compositional Model (GCM) for action detection based on typical And-Or graphs. Our model exploits the intrinsic structures and latent relationships of actions in a hierarchical manner to harness both the compositionality of grammar models and the capability of expressing rich features of DNNs. The proposed model can be readily embodied into a neural network module for efficient optimization in an end-to-end manner. Extensive experiments are conducted on the AVA dataset and the Something-Else task to demonstrate the superiority of our model, meanwhile the interpretability is enhanced through an inference parsing procedure.
Abstract:It is broadly known that deep neural networks are susceptible to being fooled by adversarial examples with perturbations imperceptible by humans. Various defenses have been proposed to improve adversarial robustness, among which adversarial training methods are most effective. However, most of these methods treat the training samples independently and demand a tremendous amount of samples to train a robust network, while ignoring the latent structural information among these samples. In this work, we propose a novel Local Structure Preserving (LSP) regularization, which aims to preserve the local structure of the input space in the learned embedding space. In this manner, the attacking effect of adversarial samples lying in the vicinity of clean samples can be alleviated. We show strong empirical evidence that with or without adversarial training, our method consistently improves the performance of adversarial robustness on several image classification datasets compared to the baselines and some state-of-the-art approaches, thus providing promising direction for future research.