National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, China, Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China
Abstract:In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using "dark knowledge" and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state-of-the-art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
Abstract:Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.
Abstract:Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformer-based detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.
Abstract:This paper focuses on an integrated sensing and communication (ISAC) system in the presence of signal-dependent modulated jamming (SDMJ). Our goal is to suppress jamming while carrying out simultaneous communications and sensing. We minimize the integrated sidelobe level (ISL) of the mismatch filter output for the transmitted waveform and the integrated level (IL) of the mismatch filter output for the jamming, under the constraints of the loss in-processing gain (LPG) and the peak-to-average power ratio (PAPR) of the transmitted waveform. Meanwhile, the similarity constraint is introduced for information-bearing transmit waveform. We develop a decoupled majorization minimization (DMM) algorithm to solve the proposed multi-constrained optimization problem. In contrast to the existing approaches, the proposed algorithm transforms the difficult optimization problem involving two variables into two parallel sub-problems with one variable, thus significantly speeding up the convergence rate. Furthermore, fast Fourier transform (FFT) is introduced to compute the closed-form solution of each sub-problem, giving rise to a greatly reduced computation complexity. Simulation results demonstrate the capabilities of the proposed ISAC system which strikes a proper trade-off among sensing and jamming suppression.
Abstract:LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.
Abstract:We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
Abstract:The symplectic geometry mode decomposition (SGMD) is a powerful method for analyzing time sequences. The SGMD is based on the upper conversion via embedding and down conversion via diagonal averaging principle (DAP) inherited from the singular spectrum analysis (SSA). However, there are two defects in the DAP: it just hold for the time delay $\tau=1$ in the trajectory matrix and it fails for the time sequence of type-1 with the form $X=\{x[n]\}^N_{n=1}$. In order to overcome these disadvantages, the inverse step for embedding is explored with binary Diophantine equation in number theory. The contributions of this work lie in three aspects: firstly, the pulling back theorem is proposed and proved, which state the general formula for converting the component of trajectory matrix to the component of time sequence for the general representation of time sequence and for any time delay $\tau\ge 1$; secondly a unified framework for decomposing both the deterministic and random time sequences into multiple modes is presented and explained; finally, the guidance of configuring the time delay is suggested, namely the time delay should be selected in a limited range via balancing the efficiency of matrix computation and accuracy of state estimation. It could be expected that the pulling back theorem will help the researchers and engineers to deepen the understanding of the theory and extend the applications of the SGMD and SSA in analyzing time sequences.
Abstract:Federated Learning (FL) enables multiple clients to collaboratively develop a global model while maintaining data privacy. However, online FL deployment faces challenges due to distribution shifts and evolving test samples. Personalized Federated Learning (PFL) tailors the global model to individual client distributions, but struggles with Out-Of-Distribution (OOD) samples during testing, leading to performance degradation. In real-world scenarios, balancing personalization and generalization during online testing is crucial and existing methods primarily focus on training-phase generalization. To address the test-time trade-off, we introduce a new scenario: Test-time Generalization for Internal and External Distributions in Federated Learning (TGFL), which evaluates adaptability under Internal Distribution (IND) and External Distribution (EXD). We propose BTFL, a Bayesian-based test-time generalization method for TGFL, which balances generalization and personalization at the sample level during testing. BTFL employs a two-head architecture to store local and global knowledge, interpolating predictions via a dual-Bayesian framework that considers both historical test data and current sample characteristics with theoretical guarantee and faster speed. Our experiments demonstrate that BTFL achieves improved performance across various datasets and models with less time cost. The source codes are made publicly available at https://github.com/ZhouYuCS/BTFL .
Abstract:Inverse kinematics is a fundamental technique for motion and positioning control in robotics, typically applied to end-effectors. In this paper, we extend the concept of inverse kinematics to guiding vector fields for path following in autonomous mobile robots. The desired path is defined by its implicit equation, i.e., by a collection of points belonging to one or more zero-level sets. These level sets serve as a reference to construct an error signal that drives the guiding vector field toward the desired path, enabling the robot to converge and travel along the path by following such a vector field. We start with the formal exposition on how inverse kinematics can be applied to guiding vector fields for single-integrator robots in an m-dimensional Euclidean space. Then, we leverage inverse kinematics to ensure that the level-set error signal behaves as a linear system, facilitating control over the robot's transient motion toward the desired path and allowing for the injection of feed-forward signals to induce precise motion behavior along the path. We then propose solutions to the theoretical and practical challenges of applying this technique to unicycles with constant speeds to follow 2D paths with precise transient control. We finish by validating the predicted theoretical results through real flights with fixed-wing drones.
Abstract:Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.