Abstract:Thanks to the explosive growth of data and the development of computational resources, it is possible to build pre-trained models that can achieve outstanding performance on various tasks, such as neural language processing, computer vision, and more. Despite their powerful capabilities, pre-trained models have also sparked attention to the emerging security challenges associated with their real-world applications. Security and privacy issues, such as leaking privacy information and generating harmful responses, have seriously undermined users' confidence in these powerful models. Concerns are growing as model performance improves dramatically. Researchers are eager to explore the unique security and privacy issues that have emerged, their distinguishing factors, and how to defend against them. However, the current literature lacks a clear taxonomy of emerging attacks and defenses for pre-trained models, which hinders a high-level and comprehensive understanding of these questions. To fill the gap, we conduct a systematical survey on the security risks of pre-trained models, proposing a taxonomy of attack and defense methods based on the accessibility of pre-trained models' input and weights in various security test scenarios. This taxonomy categorizes attacks and defenses into No-Change, Input-Change, and Model-Change approaches. With the taxonomy analysis, we capture the unique security and privacy issues of pre-trained models, categorizing and summarizing existing security issues based on their characteristics. In addition, we offer a timely and comprehensive review of each category's strengths and limitations. Our survey concludes by highlighting potential new research opportunities in the security and privacy of pre-trained models.
Abstract:Depth completion, inferring dense depth maps from sparse measurements, is crucial for robust 3D perception. Although deep learning based methods have made tremendous progress in this problem, these models cannot generalize well across different scenes that are unobserved in training, posing a fundamental limitation that yet to be overcome. A careful analysis of existing deep neural network architectures for depth completion, which are largely borrowing from successful backbones for image analysis tasks, reveals that a key design bottleneck actually resides in the conventional normalization layers. These normalization layers are designed, on one hand, to make training more stable, on the other hand, to build more visual invariance across scene scales. However, in depth completion, the scale is actually what we want to robustly estimate in order to better generalize to unseen scenes. To mitigate, we propose a novel scale propagation normalization (SP-Norm) method to propagate scales from input to output, and simultaneously preserve the normalization operator for easy convergence. More specifically, we rescale the input using learned features of a single-layer perceptron from the normalized input, rather than directly normalizing the input as conventional normalization layers. We then develop a new network architecture based on SP-Norm and the ConvNeXt V2 backbone. We explore the composition of various basic blocks and architectures to achieve superior performance and efficient inference for generalizable depth completion. Extensive experiments are conducted on six unseen datasets with various types of sparse depth maps, i.e., randomly sampled 0.1\%/1\%/10\% valid pixels, 4/8/16/32/64-line LiDAR points, and holes from Structured-Light. Our model consistently achieves the best accuracy with faster speed and lower memory when compared to state-of-the-art methods.
Abstract:Automated nodule segmentation is essential for computer-assisted diagnosis in ultrasound images. Nevertheless, most existing methods depend on precise pixel-level annotations by medical professionals, a process that is both costly and labor-intensive. Recently, segmentation foundation models like SAM have shown impressive generalizability on natural images, suggesting their potential as pseudo-labelers. However, accurate prompts remain crucial for their success in medical images. In this work, we devise a novel weakly supervised framework that effectively utilizes the segmentation foundation model to generate pseudo-labels from aspect ration annotations for automatic nodule segmentation. Specifically, we develop three types of bounding box prompts based on scalable shape priors, followed by an adaptive pseudo-label selection module to fully exploit the prediction capabilities of the foundation model for nodules. We also present a SAM-driven uncertainty-aware cross-teaching strategy. This approach integrates SAM-based uncertainty estimation and label-space perturbations into cross-teaching to mitigate the impact of pseudo-label inaccuracies on model training. Extensive experiments on two clinically collected ultrasound datasets demonstrate the superior performance of our proposed method.
Abstract:Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
Abstract:Recent advances in deep learning have greatly facilitated the automated segmentation of ultrasound images, which is essential for nodule morphological analysis. Nevertheless, most existing methods depend on extensive and precise annotations by domain experts, which are labor-intensive and time-consuming. In this study, we suggest using simple aspect ratio annotations directly from ultrasound clinical diagnoses for automated nodule segmentation. Especially, an asymmetric learning framework is developed by extending the aspect ratio annotations with two types of pseudo labels, i.e., conservative labels and radical labels, to train two asymmetric segmentation networks simultaneously. Subsequently, a conservative-radical-balance strategy (CRBS) strategy is proposed to complementally combine radical and conservative labels. An inconsistency-aware dynamically mixed pseudo-labels supervision (IDMPS) module is introduced to address the challenges of over-segmentation and under-segmentation caused by the two types of labels. To further leverage the spatial prior knowledge provided by clinical annotations, we also present a novel loss function namely the clinical anatomy prior loss. Extensive experiments on two clinically collected ultrasound datasets (thyroid and breast) demonstrate the superior performance of our proposed method, which can achieve comparable and even better performance than fully supervised methods using ground truth annotations.
Abstract:Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that leverages the CTC blank output from intermediate layers to trigger the skipping of the last few encoder layers for frames with high blank probabilities. Furthermore, we factorize the CTC output distribution and perform knowledge distillation on intermediate layers to reduce computation and improve recognition accuracy. Experimental results show that by utilizing the CTC blank, the encoder layer depth can be adjusted dynamically, resulting in 29% acceleration of the CTC model inference with minor performance degradation.
Abstract:Monocular depth inference is a fundamental problem for scene perception of robots. Specific robots may be equipped with a camera plus an optional depth sensor of any type and located in various scenes of different scales, whereas recent advances derived multiple individual sub-tasks. It leads to additional burdens to fine-tune models for specific robots and thereby high-cost customization in large-scale industrialization. This paper investigates a unified task of monocular depth inference, which infers high-quality depth maps from all kinds of input raw data from various robots in unseen scenes. A basic benchmark G2-MonoDepth is developed for this task, which comprises four components: (a) a unified data representation RGB+X to accommodate RGB plus raw depth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and errors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth sparsity/errors of input raw data and diverse scales of output scenes, (c) an improved network to well propagate diverse scene scales from input to output, and (d) a data augmentation pipeline to simulate all types of real artifacts in raw depth maps for training. G2-MonoDepth is applied in three sub-tasks including depth estimation, depth completion with different sparsity, and depth enhancement in unseen scenes, and it always outperforms SOTA baselines on both real-world data and synthetic data.
Abstract:Convolutional neural networks excel in histopathological image classification, yet their pixel-level focus hampers explainability. Conversely, emerging graph convolutional networks spotlight cell-level features and medical implications. However, limited by their shallowness and suboptimal use of high-dimensional pixel data, GCNs underperform in multi-class histopathological image classification. To make full use of pixel-level and cell-level features dynamically, we propose an asymmetric co-training framework combining a deep graph convolutional network and a convolutional neural network for multi-class histopathological image classification. To improve the explainability of the entire framework by embedding morphological and topological distribution of cells, we build a 14-layer deep graph convolutional network to handle cell graph data. For the further utilization and dynamic interactions between pixel-level and cell-level information, we also design a co-training strategy to integrate the two asymmetric branches. Notably, we collect a private clinically acquired dataset termed LUAD7C, including seven subtypes of lung adenocarcinoma, which is rare and more challenging. We evaluated our approach on the private LUAD7C and public colorectal cancer datasets, showcasing its superior performance, explainability, and generalizability in multi-class histopathological image classification.
Abstract:It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.
Abstract:The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. \footnote{Our code is publicly available at \url{https://github.com/LUMIA-Group/FourierTransformer}}