Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yue Lu

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

May 15, 2025

Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu

Abstract:Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $\epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.

* 31 pages,9 figures,conference

Via

Access Paper or Ask Questions

Fractional Correspondence Framework in Detection Transformer

Mar 06, 2025

Masoumeh Zareapoor, Pourya Shamsolmoali, Huiyu Zhou, Yue Lu, Salvador García

Abstract:The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model's capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.

* ACMMM-2024

Via

Access Paper or Ask Questions

Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning

Nov 23, 2024

De Cheng, Yue Lu, Lingfeng He, Shizhou Zhang, Xi Yang, Nannan Wang, Xinbo Gao

Abstract:Continual Learning (CL) aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. Recently, State Space Models (SSMs), particularly the Mamba model, have achieved notable success in computer vision. Building on the strengths of SSMs, this study explores leveraging the Mamba model for CL. Therefore, we introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model by updating parameters orthogonal to the feature subspace of previous tasks. This approach theoretically guarantees the consistency objective aiming to preserves consistent output for each SSM module across both previous and current tasks, so as to overcome catastrophic forgetting issue. Specifically, we achieve this goal by deducing the overall consistency constraints on four key time-invariant parameters in the Mamba model, streamlining its recurrent state-space structure and non-linear discretization process in SSM. In practice, we apply the null-space projection to efficiently implement the orthogonality within Mamba model. Extensive experiments on four class-incremental benchmarks demonstrate the effectiveness of Mamba-CL for anti-forgetting, achieving superior performances to state-of-the-art methods. Code is available in the supplementary materials.

Via

Access Paper or Ask Questions

Visual Prompt Tuning in Null Space for Continual Learning

Jun 11, 2024

Yue Lu, Shizhou Zhang, De Cheng, Yinghui Xing, Nannan Wang, Peng Wang, Yanning Zhang

Figure 1 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 2 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 3 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 4 for Visual Prompt Tuning in Null Space for Continual Learning

Abstract:Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i.e., 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https://github.com/zugexiaodui/VPTinNSforCL.

* 20 pages, 10 figures

Via

Access Paper or Ask Questions

Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairing

Apr 11, 2024

ZhenZhe Gao, Zhenjun Tang, Zhaoxia Yin, Baoyuan Wu, Yue Lu

Figure 1 for Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairing

Figure 2 for Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairing

Figure 3 for Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairing

Figure 4 for Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairing

Abstract:Neural networks have increasingly influenced people's lives. Ensuring the faithful deployment of neural networks as designed by their model owners is crucial, as they may be susceptible to various malicious or unintentional modifications, such as backdooring and poisoning attacks. Fragile model watermarks aim to prevent unexpected tampering that could lead DNN models to make incorrect decisions. They ensure the detection of any tampering with the model as sensitively as possible.However, prior watermarking methods suffered from inefficient sample generation and insufficient sensitivity, limiting their practical applicability. Our approach employs a sample-pairing technique, placing the model boundaries between pairs of samples, while simultaneously maximizing logits. This ensures that the model's decision results of sensitive samples change as much as possible and the Top-1 labels easily alter regardless of the direction it moves.

* The article has been accepted by IEEE International Conference on Multimedia and Expo 2024

Via

Access Paper or Ask Questions

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Dec 19, 2023

Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao

Figure 1 for Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Figure 2 for Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Figure 3 for Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Figure 4 for Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Abstract:Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.

* Accepted to AAAI 2024. Code: https://github.com/ecnuljzhang/brush-your-text

Via

Access Paper or Ask Questions

Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation

Oct 28, 2023

Pourya Shamsolmoali, Jocelyn Chanussot, Huiyu Zhou, Yue Lu

Figure 1 for Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation

Figure 2 for Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation

Figure 3 for Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation

Figure 4 for Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation

Abstract:Efficient object detection methods have recently received great attention in remote sensing. Although deep convolutional networks often have excellent detection accuracy, their deployment on resource-limited edge devices is difficult. Knowledge distillation (KD) is a strategy for addressing this issue since it makes models lightweight while maintaining accuracy. However, existing KD methods for object detection have encountered two constraints. First, they discard potentially important background information and only distill nearby foreground regions. Second, they only rely on the global context, which limits the student detector's ability to acquire local information from the teacher detector. To address the aforementioned challenges, we propose Attention-based Feature Distillation (AFD), a new KD approach that distills both local and global information from the teacher detector. To enhance local distillation, we introduce a multi-instance attention mechanism that effectively distinguishes between background and foreground elements. This approach prompts the student detector to focus on the pertinent channels and pixels, as identified by the teacher detector. Local distillation lacks global information, thus attention global distillation is proposed to reconstruct the relationship between various pixels and pass it from teacher to student detector. The performance of AFD is evaluated on two public aerial image benchmarks, and the evaluation results demonstrate that AFD in object detection can attain the performance of other state-of-the-art models while being efficient.

Via

Access Paper or Ask Questions

Distance Weighted Trans Network for Image Completion

Oct 25, 2023

Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Xuelong Li, Yue Lu

Abstract:The challenge of image generation has been effectively modeled as a problem of structure priors or transformation. However, existing models have unsatisfactory performance in understanding the global input image structures because of particular inherent features (for example, local inductive prior). Recent studies have shown that self-attention is an efficient modeling technique for image completion problems. In this paper, we propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. In our model, we leverage the strengths of both Convolutional Neural Networks (CNNs) and DWT blocks to enhance the image completion process. Specifically, CNNs are used to augment the local texture information of coarse priors and DWT blocks are used to recover certain coarse textures and coherent visual structures. Unlike current approaches that generally use CNNs to create feature maps, we use the DWT to encode global dependencies and compute distance-based weighted feature maps, which substantially minimizes the problem of visual ambiguities. Meanwhile, to better produce repeated textures, we introduce Residual Fast Fourier Convolution (Res-FFC) blocks to combine the encoder's skip features with the coarse features provided by our generator. Furthermore, a simple yet effective technique is proposed to normalize the non-zero values of convolutions, and fine-tune the network layers for regularization of the gradient norms to provide an efficient training stabiliser. Extensive quantitative and qualitative experiments on three challenging datasets demonstrate the superiority of our proposed model compared to existing approaches.

Via

Access Paper or Ask Questions

DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

Oct 25, 2023

Yuejun Jiao, Song Qiu, Mingsong Chen, Dingding Han, Qingli Li, Yue Lu

Figure 1 for DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

Figure 2 for DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

Figure 3 for DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

Figure 4 for DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for Vehicle Re-identification

Abstract:In recent years, vehicle re-identification (Re-ID) has gained increasing importance in various applications such as assisted driving systems, traffic flow management, and vehicle tracking, due to the growth of intelligent transportation systems. However, the presence of extraneous background information and occlusions can interfere with the learning of discriminative features, leading to significant variations in the same vehicle image across different scenarios. This paper proposes a method, named graph network based on dynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel approach for constructing adjacency matrices to capture spatial relationships of local features and reduce background noise. Specifically, the proposed method divides the extracted vehicle features into different patches as nodes within the graph network. A spatial attention-based similarity adjacency matrix generation (SASAMG) module is employed to compute similarity matrices of nodes, and a dynamic erasure operation is applied to disconnect nodes with low similarity, resulting in similarity adjacency matrices. Finally, the nodes and similarity adjacency matrices are fed into graph networks to extract more discriminative features for vehicle Re-ID. Experimental results on public datasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed method compared with recent works.

* This paper has been accepted by the 20th Pacific Rim International Conference on Artificial Intelligence in 2023

Via

Access Paper or Ask Questions

Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

Aug 22, 2023

Zhenzhe Gao, Zhaoxia Yin, Hongjian Zhan, Heng Yin, Yue Lu

Figure 1 for Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

Figure 2 for Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

Figure 3 for Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

Figure 4 for Adaptive White-Box Watermarking with Self-Mutual Check Parameters in Deep Neural Networks

Abstract:Artificial Intelligence (AI) has found wide application, but also poses risks due to unintentional or malicious tampering during deployment. Regular checks are therefore necessary to detect and prevent such risks. Fragile watermarking is a technique used to identify tampering in AI models. However, previous methods have faced challenges including risks of omission, additional information transmission, and inability to locate tampering precisely. In this paper, we propose a method for detecting tampered parameters and bits, which can be used to detect, locate, and restore parameters that have been tampered with. We also propose an adaptive embedding method that maximizes information capacity while maintaining model accuracy. Our approach was tested on multiple neural networks subjected to attacks that modified weight parameters, and our results demonstrate that our method achieved great recovery performance when the modification rate was below 20%. Furthermore, for models where watermarking significantly affected accuracy, we utilized an adaptive bit technique to recover more than 15% of the accuracy loss of the model.

Via

Access Paper or Ask Questions