Abstract:We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.
Abstract:In recent years, diffusion model has shown its potential across diverse domains from vision generation to language modeling. Transferring its capabilities to modern autonomous driving systems has also emerged as a promising direction.In this work, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving. The encoded scene information serves as the multi-modal conditional input of the denoising decoder. To tackle the mode collapse dilemma in generating high-quality diverse trajectories, we introduce a simple yet effective multi-modal representation decorrelation optimization mechanism during the training process.TransDiffuser achieves PDMS of 94.85 on the NAVSIM benchmark, surpassing previous state-of-the-art methods without any anchor-based prior trajectories.
Abstract:Ensuring trustworthiness is fundamental to the development of artificial intelligence (AI) that is considered societally responsible, particularly in cancer diagnostics, where a misdiagnosis can have dire consequences. Current digital pathology AI models lack systematic solutions to address trustworthiness concerns arising from model limitations and data discrepancies between model deployment and development environments. To address this issue, we developed TRUECAM, a framework designed to ensure both data and model trustworthiness in non-small cell lung cancer subtyping with whole-slide images. TRUECAM integrates 1) a spectral-normalized neural Gaussian process for identifying out-of-scope inputs and 2) an ambiguity-guided elimination of tiles to filter out highly ambiguous regions, addressing data trustworthiness, as well as 3) conformal prediction to ensure controlled error rates. We systematically evaluated the framework across multiple large-scale cancer datasets, leveraging both task-specific and foundation models, illustrate that an AI model wrapped with TRUECAM significantly outperforms models that lack such guidance, in terms of classification accuracy, robustness, interpretability, and data efficiency, while also achieving improvements in fairness. These findings highlight TRUECAM as a versatile wrapper framework for digital pathology AI models with diverse architectural designs, promoting their responsible and effective applications in real-world settings.
Abstract:Machine learning systems are vulnerable to backdoor attacks, where attackers manipulate model behavior through data tampering or architectural modifications. Traditional backdoor attacks involve injecting malicious samples with specific triggers into the training data, causing the model to produce targeted incorrect outputs in the presence of the corresponding triggers. More sophisticated attacks modify the model's architecture directly, embedding backdoors that are harder to detect as they evade traditional data-based detection methods. However, the drawback of the architectural modification based backdoor attacks is that the trigger must be visible in order to activate the backdoor. To further strengthen the invisibility of the backdoor attacks, a novel backdoor attack method is presented in the paper. To be more specific, this method embeds the backdoor within the model's architecture and has the capability to generate inconspicuous and stealthy triggers. The attack is implemented by modifying pre-trained models, which are then redistributed, thereby posing a potential threat to unsuspecting users. Comprehensive experiments conducted on standard computer vision benchmarks validate the effectiveness of this attack and highlight the stealthiness of its triggers, which remain undetectable through both manual visual inspection and advanced detection tools.
Abstract:Having shown early promise, free-space optical communications (FSO) face formidable challenges in the age of information explosion. The ever-growing demand for greater channel communication capacity is one of the challenges. The inter-channel crosstalk, which severely degrades the quality of transmitted information, creates another roadblock in the way of efficient FSO implementation. Here we advance theoretically and realize experimentally a potentially high-capacity FSO protocol that enables high-fidelity transfer of an image, or set of images through a complex environment. In our protocol, we complement random light structuring at the transmitter with a deep learning image classification platform at the receiver. Multiplexing novel, independent, mutually orthogonal degrees of freedom available to structured random light can potentially significantly boost the channel communication capacity of our protocol without introducing any deleterious crosstalk. Specifically, we show how one can multiplex the degrees of freedom associated with the source coherence radius and a spatial position of a beamlet within an array of structured random beams to greatly enhance the capacity of our communication link. The superb resilience of structured random light to environmental noise, as well as extreme efficiency of deep learning networks at classifying images guarantees high-fidelity image transfer within the framework of our protocol.
Abstract:Electroencephalogram (EEG) is a non-invasive technique to record bioelectrical signals. Integrating supervised deep learning techniques with EEG signals has recently facilitated automatic analysis across diverse EEG-based tasks. However, the label issues of EEG signals have constrained the development of EEG-based deep models. Obtaining EEG annotations is difficult that requires domain experts to guide collection and labeling, and the variability of EEG signals among different subjects causes significant label shifts. To solve the above challenges, self-supervised learning (SSL) has been proposed to extract representations from unlabeled samples through well-designed pretext tasks. This paper concentrates on integrating SSL frameworks with temporal EEG signals to achieve efficient representation and proposes a systematic review of the SSL for EEG signals. In this paper, 1) we introduce the concept and theory of self-supervised learning and typical SSL frameworks. 2) We provide a comprehensive review of SSL for EEG analysis, including taxonomy, methodology, and technique details of the existing EEG-based SSL frameworks, and discuss the difference between these methods. 3) We investigate the adaptation of the SSL approach to various downstream tasks, including the task description and related benchmark datasets. 4) Finally, we discuss the potential directions for future SSL-EEG research.
Abstract:Infrared small target detection plays an important role in the remote sensing fields. Therefore, many detection algorithms have been proposed, in which the infrared patch-tensor (IPT) model has become a mainstream tool due to its excellent performance. However, most IPT-based methods face great challenges, such as inaccurate measure of the tensor low-rankness and poor robustness to complex scenes, which will leadto poor detection performance. In order to solve these problems, this paper proposes a novel double-weighted multi-granularity infrared patch tensor (DWMGIPT) model. First, to capture different granularity information of tensor from multiple modes, a multi-granularity infrared patch tensor (MGIPT) model is constructed by collecting nonoverlapping patches and tensor augmentation based on the tensor train (TT) decomposition. Second, to explore the latent structure of tensor more efficiently, we utilize the auto-weighted mechanism to balance the importance of information at different granularity. Then, the steering kernel (SK) is employed to extract local structure prior, which suppresses background interference such as strong edges and noise. Finally, an efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) is presented to solve the model. Extensive experiments in various challenging scenes show that the proposed algorithm is robust to noise and different scenes. Compared with the other eight state-of-the-art methods, different evaluation metrics demonstrate that our method achieves better detection performance in various complex scenes.
Abstract:When reviewing codes, it seems that there's something wrong with the training codes that we did not notice, this article is temporarily withdrawn and will be modified and resubmitted after codes reviewing.
Abstract:In clinical diagnosis, doctors are able to see biological tissues and early lesions more clearly with the assistance of high-resolution(HR) medical images, which is of vital significance for improving diagnosis accuracy. In order to address the issue that medical images would suffer from severe blurring caused by lack of high-frequency details, this paper develops a novel image super-resolution(SR) algorithm called SR-DBAN via dense neural network and blended attention mechanism. Specifically, a novel blended attention block is proposed and introduced to dense neural network(DenseNet), so that the neural network can concentrate more attention to the regions and channels with sufficient high-frequency details adaptively. In the framework of SR-DBAN, batch normalization layers in the original DenseNet are removed to avoid loss of high-frequency texture details, final HR images are obtained by deconvolution at the very end of the network. Furthermore, inspired by the impressive performance of generative adversarial network, this paper develops a novel image SR algorithm called SR-DBAGAN via dense blended attention generative adversarial network. SR-DBAGAN consists a generator and a discriminator, the generator uses our proposed SR-DBAN to generate HR images and try to fool the discriminator while the discriminator is designed based on Wasserstein GAN(WGAN) to discriminate. We deployed our algorithms on blurry prostate MRI images, and experimental results showed that our proposed algorithms have generated considerable sharpness and texture details and have a significant improvement on the peak signal-to-noise ratio(PSNR) and structural similarity index(SSIM), respectively, compared with mainstream interpolation-based and deep learning-based image SR algorithms, which fully proves the effectiveness and superiority of our proposed algorithms.
Abstract:In order to address the issue that medical image would suffer from severe blurring caused by the lack of high-frequency details in the process of image super-resolution reconstruction, a novel medical image super-resolution method based on dense neural network and blended attention mechanism is proposed. The proposed method adds blended attention blocks to dense neural network(DenseNet), so that the neural network can concentrate more attention to the regions and channels with sufficient high-frequency details. Batch normalization layers are removed to avoid loss of high-frequency texture details. Final obtained high resolution medical image are obtained using deconvolutional layers at the very end of the network as up-sampling operators. Experimental results show that the proposed method has an improvement of 0.05db to 11.25dB and 0.6% to 14.04% on the peak signal-to-noise ratio(PSNR) metric and structural similarity index(SSIM) metric, respectively, compared with the mainstream image super-resolution methods. This work provides a new idea for theoretical studies of medical image super-resolution reconstruction.