Abstract:MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.
Abstract:Multimodal brain magnetic resonance (MR) imaging is indispensable in neuroscience and neurology. However, due to the accessibility of MRI scanners and their lengthy acquisition time, multimodal MR images are not commonly available. Current MR image synthesis approaches are typically trained on independent datasets for specific tasks, leading to suboptimal performance when applied to novel datasets and tasks. Here, we present TUMSyn, a Text-guided Universal MR image Synthesis generalist model, which can flexibly generate brain MR images with demanded imaging metadata from routinely acquired scans guided by text prompts. To ensure TUMSyn's image synthesis precision, versatility, and generalizability, we first construct a brain MR database comprising 31,407 3D images with 7 MRI modalities from 13 centers. We then pre-train an MRI-specific text encoder using contrastive learning to effectively control MR image synthesis based on text prompts. Extensive experiments on diverse datasets and physician assessments indicate that TUMSyn can generate clinically meaningful MR images with specified imaging metadata in supervised and zero-shot scenarios. Therefore, TUMSyn can be utilized along with acquired MR scan(s) to facilitate large-scale MRI-based screening and diagnosis of brain diseases.
Abstract:Next Basket Recommendation (NBR) is a new type of recommender system that predicts combinations of items users are likely to purchase together. Existing NBR models often overlook a crucial factor, which is price, and do not fully capture item-basket-user interactions. To address these limitations, we propose a novel method called Basket-augmented Dynamic Heterogeneous Hypergraph (BDHH). BDHH utilizes a heterogeneous multi-relational graph to capture the intricate relationships among item features, with price as a critical factor. Moreover, our approach includes a basket-guided dynamic augmentation network that could dynamically enhances item-basket-user interactions. Experiments on real-world datasets demonstrate that BDHH significantly improves recommendation accuracy, providing a more comprehensive understanding of user behavior.
Abstract:Recent studies have demonstrated the effectiveness of token-based methods for visual content generation. As a representative work, non-autoregressive Transformers (NATs) are able to synthesize images with decent quality in a small number of steps. However, NATs usually necessitate configuring a complicated generation policy comprising multiple manually-designed scheduling rules. These heuristic-driven rules are prone to sub-optimality and come with the requirements of expert knowledge and labor-intensive efforts. Moreover, their one-size-fits-all nature cannot flexibly adapt to the diverse characteristics of each individual sample. To address these issues, we propose AdaNAT, a learnable approach that automatically configures a suitable policy tailored for every sample to be generated. In specific, we formulate the determination of generation policies as a Markov decision process. Under this framework, a lightweight policy network for generation can be learned via reinforcement learning. Importantly, we demonstrate that simple reward designs such as FID or pre-trained reward models, may not reliably guarantee the desired quality or diversity of generated samples. Therefore, we propose an adversarial reward design to guide the training of policy networks effectively. Comprehensive experiments on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M, validate the effectiveness of AdaNAT. Code and pre-trained models will be released at https://github.com/LeapLabTHU/AdaNAT.
Abstract:Test-Time Adaptation (TTA) aims to adapt pre-trained models to the target domain during testing. In reality, this adaptability can be influenced by multiple factors. Researchers have identified various challenging scenarios and developed diverse methods to address these challenges, such as dealing with continual domain shifts, mixed domains, and temporally correlated or imbalanced class distributions. Despite these efforts, a unified and comprehensive benchmark has yet to be established. To this end, we propose a Unified Test-Time Adaptation (UniTTA) benchmark, which is comprehensive and widely applicable. Each scenario within the benchmark is fully described by a Markov state transition matrix for sampling from the original dataset. The UniTTA benchmark considers both domain and class as two independent dimensions of data and addresses various combinations of imbalance/balance and i.i.d./non-i.i.d./continual conditions, covering a total of \( (2 \times 3)^2 = 36 \) scenarios. It establishes a comprehensive evaluation benchmark for realistic TTA and provides a guideline for practitioners to select the most suitable TTA method. Alongside this benchmark, we propose a versatile UniTTA framework, which includes a Balanced Domain Normalization (BDN) layer and a COrrelated Feature Adaptation (COFA) method--designed to mitigate distribution gaps in domain and class, respectively. Extensive experiments demonstrate that our UniTTA framework excels within the UniTTA benchmark and achieves state-of-the-art performance on average. Our code is available at \url{https://github.com/LeapLabTHU/UniTTA}.
Abstract:Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{https://github.com/Ziwei-Zheng/EfficientGEBD}.
Abstract:The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their inferior performance compared to diffusion models. In this paper, we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically, we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this, we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method, named AutoNAT, advances the performance boundaries of NATs notably, and is able to perform comparably with the latest diffusion models at a significantly reduced inference cost. The effectiveness of AutoNAT is validated on four benchmark datasets, i.e., ImageNet-256 & 512, MS-COCO, and CC3M. Our code is available at https://github.com/LeapLabTHU/ImprovedNAT.
Abstract:The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).
Abstract:Long-tailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Such imbalance issue considerably impairs the performance of standard supervised learning algorithms, which are mainly designed for balanced training sets. Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance. However, the performance of supervised contrastive learning is plagued by an inherent challenge: it necessitates sufficiently large batches of training data to construct contrastive pairs that cover all categories, yet this requirement is difficult to meet in the context of class-imbalanced data. To overcome this obstacle, we propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space, and samples contrastive pairs accordingly. In fact, estimating the distributions of all classes using features in a small batch, particularly for imbalanced data, is not feasible. Our key idea is to introduce a reasonable and simple assumption that the normalized features in contrastive learning follow a mixture of von Mises-Fisher (vMF) distributions on unit space, which brings two-fold benefits. First, the distribution parameters can be estimated using only the first sample moment, which can be efficiently computed in an online manner across different batches. Second, based on the estimated distribution, the vMF distribution allows us to sample an infinite number of contrastive pairs and derive a closed form of the expected contrastive loss for efficient optimization. Our code is available at https://github.com/LeapLabTHU/ProCo.
Abstract:Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.