Abstract:Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM's success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.
Abstract:Diffusion models suffer from the huge consumption of time and resources to train. For example, diffusion models need hundreds of GPUs to train for several weeks for a high-resolution generative task to meet the requirements of an extremely large number of iterations and a large batch size. Training diffusion models become a millionaire's game. With limited resources that only fit a small batch size, training a diffusion model always fails. In this paper, we investigate the key reasons behind the difficulties of training diffusion models with limited resources. Through numerous experiments and demonstrations, we identified a major factor: the significant variation in the training losses across different timesteps, which can easily disrupt the progress made in previous iterations. Moreover, different prediction types of $x_0$ exhibit varying effectiveness depending on the task and timestep. We hypothesize that using a mixed-prediction approach to identify the most accurate $x_0$ prediction type could potentially serve as a breakthrough in addressing this issue. In this paper, we outline several challenges and insights, with the hope of inspiring further research aimed at tackling the limitations of training diffusion models with constrained resources, particularly for high-resolution tasks.
Abstract:Precise alignment of multi-modal images with inherent feature discrepancies poses a pivotal challenge in deformable image registration. Traditional learning-based approaches often consider registration networks as black boxes without interpretability. One core insight is that disentangling alignment features and non-alignment features across modalities bring benefits. Meanwhile, it is challenging for the prominent methods for image registration tasks, such as convolutional neural networks, to capture long-range dependencies by their local receptive fields. The methods often fail when the given image pair has a large misalignment due to the lack of effectively learning long-range dependencies and correspondence. In this paper, we propose MambaReg, a novel Mamba-based architecture that integrates Mamba's strong capability in capturing long sequences to address these challenges. With our proposed several sub-modules, MambaReg can effectively disentangle modality-independent features responsible for registration from modality-dependent, non-aligning features. By selectively attending to the relevant features, our network adeptly captures the correlation between multi-modal images, enabling focused deformation field prediction and precise image alignment. The Mamba-based architecture seamlessly integrates the local feature extraction power of convolutional layers with the long-range dependency modeling capabilities of Mamba. Experiments on public non-rigid RGB-IR image datasets demonstrate the superiority of our method, outperforming existing approaches in terms of registration accuracy and deformation field smoothness.
Abstract:Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications.
Abstract:Segment Anything Model~(SAM), a prompt-driven foundation model for natural image segmentation, has demonstrated impressive zero-shot performance. However, SAM does not work when directly applied to medical image segmentation tasks, since SAM lacks the functionality to predict semantic labels for predicted masks and needs to provide extra prompts, such as points or boxes, to segment target regions. Meanwhile, there is a huge gap between 2D natural images and 3D medical images, so the performance of SAM is imperfect for medical image segmentation tasks. Following the above issues, we propose MaskSAM, a novel mask classification prompt-free SAM adaptation framework for medical image segmentation. We design a prompt generator combined with the image encoder in SAM to generate a set of auxiliary classifier tokens, auxiliary binary masks, and auxiliary bounding boxes. Each pair of auxiliary mask and box prompts, which can solve the requirements of extra prompts, is associated with class label predictions by the sum of the auxiliary classifier token and the learnable global classifier tokens in the mask decoder of SAM to solve the predictions of semantic labels. Meanwhile, we design a 3D depth-convolution adapter for image embeddings and a 3D depth-MLP adapter for prompt embeddings. We inject one of them into each transformer block in the image encoder and mask decoder to enable pre-trained 2D SAM models to extract 3D information and adapt to 3D medical images. Our method achieves state-of-the-art performance on AMOS2022, 90.52% Dice, which improved by 2.7% compared to nnUNet. Our method surpasses nnUNet by 1.7% on ACDC and 1.0% on Synapse datasets.
Abstract:Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone, instead of plain transformer, to predict pixel-level image-text cost map. Compared to plain transformer, hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed, we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder, resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets, which demonstrates the efficacy of our SED method. When using ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000. We will release it at \url{https://github.com/xb534/SED.git}.
Abstract:Benefiting from the rapid development of deep learning, 2D and 3D computer vision applications are deployed in many safe-critical systems, such as autopilot and identity authentication. However, deep learning models are not trustworthy enough because of their limited robustness against adversarial attacks. The physically realizable adversarial attacks further pose fatal threats to the application and human safety. Lots of papers have emerged to investigate the robustness and safety of deep learning models against adversarial attacks. To lead to trustworthy AI, we first construct a general threat model from different perspectives and then comprehensively review the latest progress of both 2D and 3D adversarial attacks. We extend the concept of adversarial examples beyond imperceptive perturbations and collate over 170 papers to give an overview of deep learning model robustness against various adversarial attacks. To the best of our knowledge, we are the first to systematically investigate adversarial attacks for 3D models, a flourishing field applied to many real-world applications. In addition, we examine physical adversarial attacks that lead to safety violations. Last but not least, we summarize present popular topics, give insights on challenges, and shed light on future research on trustworthy AI.
Abstract:How to accurately measure the relevance and redundancy of features is an age-old challenge in the field of feature selection. However, existing filter-based feature selection methods cannot directly measure redundancy for continuous data. In addition, most methods rely on manually specifying the number of features, which may introduce errors in the absence of expert knowledge. In this paper, we propose a non-parametric feature selection algorithm based on maximum inter-class variation and minimum redundancy, abbreviated as MVMR-FS. We first introduce supervised and unsupervised kernel density estimation on the features to capture their similarities and differences in inter-class and overall distributions. Subsequently, we present the criteria for maximum inter-class variation and minimum redundancy (MVMR), wherein the inter-class probability distributions are employed to reflect feature relevance and the distances between overall probability distributions are used to quantify redundancy. Finally, we employ an AGA to search for the feature subset that minimizes the MVMR. Compared with ten state-of-the-art methods, MVMR-FS achieves the highest average accuracy and improves the accuracy by 5% to 11%.
Abstract:Denoising diffusion (score-based) generative models have recently achieved significant accomplishments in generating realistic and diverse data. These approaches define a forward diffusion process for transforming data into noise and a backward denoising process for sampling data from noise. Unfortunately, the generation process of current denoising diffusion models is notoriously slow due to the lengthy iterative noise estimations, which rely on cumbersome neural networks. It prevents the diffusion models from being widely deployed, especially on edge devices. Previous works accelerate the generation process of diffusion model (DM) via finding shorter yet effective sampling trajectories. However, they overlook the cost of noise estimation with a heavy network in every iteration. In this work, we accelerate generation from the perspective of compressing the noise estimation network. Due to the difficulty of retraining DMs, we exclude mainstream training-aware compression paradigms and introduce post-training quantization (PTQ) into DM acceleration. However, the output distributions of noise estimation networks change with time-step, making previous PTQ methods fail in DMs since they are designed for single-time step scenarios. To devise a DM-specific PTQ method, we explore PTQ on DM in three aspects: quantized operations, calibration dataset, and calibration metric. We summarize and use several observations derived from all-inclusive investigations to formulate our method, which especially targets the unique multi-time-step structure of DMs. Experimentally, our method can directly quantize full-precision DMs into 8-bit models while maintaining or even improving their performance in a training-free manner. Importantly, our method can serve as a plug-and-play module on other fast-sampling methods, e.g., DDIM.
Abstract:Brain vessel image segmentation can be used as a promising biomarker for better prevention and treatment of different diseases. One successful approach is to consider the segmentation as an image-to-image translation task and perform a conditional Generative Adversarial Network (cGAN) to learn a transformation between two distributions. In this paper, we present a novel multi-view approach, MLP-GAN, which splits a 3D volumetric brain vessel image into three different dimensional 2D images (i.e., sagittal, coronal, axial) and then feed them into three different 2D cGANs. The proposed MLP-GAN not only alleviates the memory issue which exists in the original 3D neural networks but also retains 3D spatial information. Specifically, we utilize U-Net as the backbone for our generator and redesign the pattern of skip connection integrated with the MLP-Mixer which has attracted lots of attention recently. Our model obtains the ability to capture cross-patch information to learn global information with the MLP-Mixer. Extensive experiments are performed on the public brain vessel dataset that show our MLP-GAN outperforms other state-of-the-art methods. We release our code at https://github.com/bxie9/MLP-GAN