Abstract:Artificial Intelligence (AI) technologies have revolutionized numerous fields, yet their applications often rely on costly and time-consuming data collection processes. Federated Learning (FL) offers a promising alternative by enabling AI models to be trained on decentralized data where data is scattered across clients (distributed nodes). However, existing FL approaches struggle to match the performance of centralized training due to challenges such as heterogeneous data distribution and communication delays, limiting their potential for breakthroughs. We observe that many real-world use cases involve hybrid data regimes, in which a server (center node) has access to some data while a large amount of data is distributed across associated clients. To improve the utilization of decentralized data under this regime, address data heterogeneity issue, and facilitate asynchronous communication between the server and clients, we propose a dual learning approach that leverages centralized data at the server to guide the merging of model updates from clients. Our method accommodates scenarios where server data is out-of-domain relative to decentralized client data, making it applicable to a wide range of use cases. We provide theoretical analysis demonstrating the faster convergence of our method compared to existing methods. Furthermore, experimental results across various scenarios show that our approach significantly outperforms existing technologies, highlighting its potential to unlock the value of large amounts of decentralized data.
Abstract:Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave
Abstract:Remote sensing imagery from systems such as Sentinel provides full coverage of the Earth's surface at around 10-meter resolution. The remote sensing community has transitioned to extensive use of deep learning models due to their high performance on benchmarks such as the UCMerced and ISPRS Vaihingen datasets. Convolutional models such as UNet and ResNet variations are commonly employed for remote sensing but typically only accept three channels, as they were developed for RGB imagery, while satellite systems provide more than ten. Recently, several transformer architectures have been proposed for remote sensing, but they have not been extensively benchmarked and are typically used on small datasets such as Salinas Valley. Meanwhile, it is becoming feasible to obtain dense spatial land-use labels for entire first-level administrative divisions of some countries. Scaling law observations suggest that substantially larger multi-spectral transformer models could provide a significant leap in remote sensing performance in these settings. In this work, we propose ChromaFormer, a family of multi-spectral transformer models, which we evaluate across orders of magnitude differences in model parameters to assess their performance and scaling effectiveness on a densely labeled imagery dataset of Flanders, Belgium, covering more than 13,500 km^2 and containing 15 classes. We propose a novel multi-spectral attention strategy and demonstrate its effectiveness through ablations. Furthermore, we show that models many orders of magnitude larger than conventional architectures, such as UNet, lead to substantial accuracy improvements: a UNet++ model with 23M parameters achieves less than 65% accuracy, while a multi-spectral transformer with 655M parameters achieves over 95% accuracy on the Biological Valuation Map of Flanders.
Abstract:How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
Abstract:The selective classifier (SC) has garnered increasing interest in areas such as medical diagnostics, autonomous driving, and the justice system. The Area Under the Risk-Coverage Curve (AURC) has emerged as the foremost evaluation metric for assessing the performance of SC systems. In this work, we introduce a more straightforward representation of the population AURC, interpretable as a weighted risk function, and propose a Monte Carlo plug-in estimator applicable to finite sample scenarios. We demonstrate that our estimator is consistent and offers a low-bias estimation of the actual weights, with a tightly bounded mean squared error (MSE). We empirically show the effectiveness of this estimator on a comprehensive benchmark across multiple datasets, model architectures, and Confidence Score Functions (CSFs).
Abstract:Novel view synthesis of indoor scenes can be achieved by capturing a monocular video sequence of the environment. However, redundant information caused by artificial movements in the input video data reduces the efficiency of scene modeling. In this work, we tackle this challenge from the perspective of camera selection. We begin by constructing a similarity matrix that incorporates both the spatial diversity of the cameras and the semantic variation of the images. Based on this matrix, we use the Intra-List Diversity (ILD) metric to assess camera redundancy, formulating the camera selection task as an optimization problem. Then we apply a diversity-based sampling algorithm to optimize the camera selection. We also develop a new dataset, IndoorTraj, which includes long and complex camera movements captured by humans in virtual indoor environments, closely mimicking real-world scenarios. Experimental results demonstrate that our strategy outperforms other approaches under time and memory constraints. Remarkably, our method achieves performance comparable to models trained on the full dataset, while using only an average of 15% of the frames and 75% of the allotted time.
Abstract:Computed tomography (CT) reconstruction plays a crucial role in industrial nondestructive testing and medical diagnosis. Sparse view CT reconstruction aims to reconstruct high-quality CT images while only using a small number of projections, which helps to improve the detection speed of industrial assembly lines and is also meaningful for reducing radiation in medical scenarios. Sparse CT reconstruction methods based on implicit neural representations (INRs) have recently shown promising performance, but still produce artifacts because of the difficulty of obtaining useful prior information. In this work, we incorporate a powerful prior: the total number of material categories of objects. To utilize the prior, we design AC-IND, a self-supervised method based on Attenuation Coefficient Estimation and Implicit Neural Distribution. Specifically, our method first transforms the traditional INR from scalar mapping to probability distribution mapping. Then we design a compact attenuation coefficient estimator initialized with values from a rough reconstruction and fast segmentation. Finally, our algorithm finishes the CT reconstruction by jointly optimizing the estimator and the generated distribution. Through experiments, we find that our method not only outperforms the comparative methods in sparse CT reconstruction but also can automatically generate semantic segmentation maps.
Abstract:The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral $8\times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models,but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at https://github.com/imagination-research/EEP.
Abstract:Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem
Abstract:Teaching to improve student models (e.g., knowledge distillation) is an extensively studied methodology in LLMs. However, for humans, teaching not only improves students but also improves teachers. We ask: Can LLMs also learn by teaching (LbT)? If yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models. In this paper, we provide a preliminary exploration of this ambitious agenda. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and provide noticeable improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT in humans: observing students' feedback, learning from the feedback, and learning iteratively, with the goals of improving answer accuracy without training and improving models' inherent capability with fine-tuning. The findings are encouraging. For example, similar to LbT in human, we see that: (1) LbT can induce weak-to-strong generalization: strong models can improve themselves by teaching other weak models; (2) Diversity in students might help: teaching multiple students could be better than teaching one student or the teacher itself. We hope that this early promise can inspire future research on LbT and more broadly adopting the advanced techniques in education to improve LLMs. The code is available at https://github.com/imagination-research/lbt.