Abstract:Large language models (LLMs) primarily trained on English texts, often face biases and inaccuracies in Chinese contexts. Their limitations are pronounced in fields like Traditional Chinese Medicine (TCM), where cultural and clinical subtleties are vital, further hindered by a lack of domain-specific data, such as rheumatoid arthritis (RA). To address these issues, this paper introduces Hengqin-RA-v1, the first large language model specifically tailored for TCM with a focus on diagnosing and treating RA. We also present HQ-GCM-RA-C1, a comprehensive RA-specific dataset curated from ancient Chinese medical literature, classical texts, and modern clinical studies. This dataset empowers Hengqin-RA-v1 to deliver accurate and culturally informed responses, effectively bridging the gaps left by general-purpose models. Extensive experiments demonstrate that Hengqin-RA-v1 outperforms state-of-the-art models, even surpassing the diagnostic accuracy of TCM practitioners in certain cases.
Abstract:Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.
Abstract:While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions. We present TACO, a family of multi-modal large action models designed to improve performance on such complex, multi-step, and multi-modal tasks. During inference, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. To train TACO, we create a large dataset of over 1M synthetic CoTA traces generated with GPT-4o and Python programs. We then experiment with various data filtering and mixing techniques and obtain a final subset of 293K high-quality CoTA examples. This dataset enables TACO to learn complex reasoning and action paths, surpassing existing models trained on instruction tuning data with only direct answers. Our model TACO outperforms the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning. Training on high-quality CoTA traces sets a new standard for complex multi-modal reasoning, highlighting the need for structured, multi-step instruction tuning in advancing open-source mutli-modal models' capabilities.
Abstract:Although the current different types of SAM adaptation methods have achieved promising performance for various downstream tasks, such as prompt-based ones and adapter-based ones, most of them belong to the one-step adaptation paradigm. In real-world scenarios, we are generally confronted with the dynamic scenario where the data comes in a streaming manner. Driven by the practical need, in this paper, we first propose a novel Continual SAM adaptation (CoSAM) benchmark with 8 different task domains and carefully analyze the limitations of the existing SAM one-step adaptation methods in the continual segmentation scenario. Then we propose a novel simple-yet-effective Mixture of Domain Adapters (MoDA) algorithm which utilizes the Global Feature Tokens (GFT) and Global Assistant Tokens (GAT) modules to help the SAM encoder extract well-separated features for different task domains, and then provide the accurate task-specific information for continual learning. Extensive experiments demonstrate that our proposed MoDA obviously surpasses the existing classic continual learning methods, as well as prompt-based and adapter-based approaches for continual segmentation. Moreover, after sequential learning on the CoSAM benchmark with diverse data distributions, our MoDA maintains highly competitive results in the natural image domain, approaching the zero-shot performance of the original SAM, demonstrating its superior capability in knowledge preservation. Notably, the proposed MoDA can be seamlessly integrated into various one-step adaptation methods of SAM, which can consistently bring obvious performance gains. Code is available at \url{https://github.com/yangjl1215/CoSAM}
Abstract:Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
Abstract:Spiking Neural Networks (SNNs) are considered as a potential candidate for the next generation of artificial intelligence with appealing characteristics such as sparse computation and inherent temporal dynamics. By adopting architectures of Artificial Neural Networks (ANNs), SNNs achieve competitive performances on benchmark tasks like image classification. However, successful architectures of ANNs are not optimal for SNNs. In this work, we apply Neural Architecture Search (NAS) to find suitable architectures for SNNs. Previous NAS methods for SNNs focus primarily on the spatial dimension, with a notable lack of consideration for the temporal dynamics that are of critical importance for SNNs. Drawing inspiration from the heterogeneity of biological neural networks, we propose a differentiable approach to optimize SNN on both spatial and temporal dimensions. At spatial level, we have developed a spike-based differentiable hierarchical search (SpikeDHS) framework, where spike-based operation is optimized on both the cell and the layer level under computational constraints. We further propose a differentiable surrogate gradient search (DGS) method to evolve local SG functions independently during training. At temporal level, we explore an optimal configuration of diverse temporal dynamics on different types of spiking neurons by evolving their time constants, based on which we further develop hybrid networks combining SNN and ANN, balancing both accuracy and efficiency. Our methods achieve comparable classification performance of CIFAR10/100 and ImageNet with accuracies of 96.43%, 78.96%, and 70.21%, respectively. On event-based deep stereo, our methods find optimal layer variation and surpass the accuracy of specially designed ANNs with 26$\times$ lower computational cost ($6.7\mathrm{mJ}$), demonstrating the potential of SNN in processing highly sparse and dynamic signals.
Abstract:We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly. We develop the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, two RPO methods, RPO-Traj and RPO-Batch, is introduced to adapt to different settings. Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, effectively learns and applies action principles to enhance performance.
Abstract:Federated Learning (FL) allows collaborative machine learning training without sharing private data. Numerous studies have shown that one significant factor affecting the performance of federated learning models is the heterogeneity of data across different clients, especially when the data is sampled from various domains. A recent paper introduces variance-aware dual-level prototype clustering and uses a novel $\alpha$-sparsity prototype loss, which increases intra-class similarity and reduces inter-class similarity. To ensure that the features converge within specific clusters, we introduce an improved algorithm, Federated Prototype Learning with Convergent Clusters, abbreviated as FedPLCC. To increase inter-class distances, we weight each prototype with the size of the cluster it represents. To reduce intra-class distances, considering that prototypes with larger distances might come from different domains, we select only a certain proportion of prototypes for the loss function calculation. Evaluations on the Digit-5, Office-10, and DomainNet datasets show that our method performs better than existing approaches.
Abstract:Learning meaningful and interpretable representations from high-dimensional volumetric magnetic resonance (MR) images is essential for advancing personalized medicine. While Vision Transformers (ViTs) have shown promise in handling image data, their application to 3D multi-contrast MR images faces challenges due to computational complexity and interpretability. To address this, we propose a novel state-space-model (SSM)-based masked autoencoder which scales ViT-like models to handle high-resolution data effectively while also enhancing the interpretability of learned representations. We propose a latent-to-spatial mapping technique that enables direct visualization of how latent features correspond to specific regions in the input volumes in the context of SSM. We validate our method on two key neuro-oncology tasks: identification of isocitrate dehydrogenase mutation status and 1p/19q co-deletion classification, achieving state-of-the-art accuracy. Our results highlight the potential of SSM-based self-supervised learning to transform radiomics analysis by combining efficiency and interpretability.
Abstract:Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4