Abstract:While multi-agent systems have been shown to significantly enhance the performance of Large Language Models (LLMs) across various tasks and applications, the dense interaction between scaling agents potentially hampers their efficiency and diversity. To address these challenges, we draw inspiration from the sparse mixture-of-agents (SMoE) and propose a sparse mixture-of-agents (SMoA) framework to improve the efficiency and diversity of multi-agent LLMs. Unlike completely connected structures, SMoA introduces novel Response Selection and Early Stopping mechanisms to sparsify information flows among individual LLM agents, striking a balance between performance and efficiency. Additionally, inspired by the expert diversity principle in SMoE frameworks for workload balance between experts, we assign distinct role descriptions to each LLM agent, fostering diverse and divergent thinking. Extensive experiments on reasoning, alignment, and fairness benchmarks demonstrate that SMoA achieves performance comparable to traditional mixture-of-agents approaches but with significantly lower computational costs. Further analysis reveals that SMoA is more stable, has a greater capacity to scale, and offers considerable potential through hyper-parameter optimization. Code and data will be available at: https://github.com/David-Li0406/SMoA.
Abstract:Hybrid massive arrays have been widely used in direction of arrival (DOA) estimation for it can provide larger aperture with lower hardware complexity. However, as the signals received by a hybrid array are compressed by the phase shifter network or the switch network, the degree of freedom (DOF) or spatial resolution of hybrid array is lower than fully-digital (FD) array with same number of antennas. Therefore, we develop a novel sparse hybrid array called switches-based sparse hybrid array (SW-SHA) which by combining nested array and switches-based hybrid array to achieve a huge improvement on DOF over traditional hybrid arrays. Simulations of the spatial spectrums verify that SW-SHA can accurately solve the problem of DOA estimation with the number of signal sources much larger than the number of RF chains. Finally, to further improve the accuracy of DOA estimation for SW-SHA, MMV-SW-SHA is proposed by transforming the single-snapshot co-array signal into MMV form. The simulation results also show that MMV-SW-SHA has better performance than SW-SHA when signal-to-noise ratio (SNR) is low.
Abstract:In the digital era, blockchain technology, cryptocurrencies, and non-fungible tokens (NFTs) have transformed financial and decentralized systems. However, existing research often neglects the spatiotemporal variations in public sentiment toward these technologies, limiting macro-level insights into their global impact. This study leverages Twitter data to explore public attention and sentiment across 150 countries, analyzing over 150 million geotagged tweets from 2012 to 2022. Sentiment scores were derived using a BERT-based multilingual sentiment model trained on 7.4 billion tweets. The analysis integrates global cryptocurrency regulations and economic indicators from the World Development Indicators database. Results reveal significant global sentiment variations influenced by economic factors, with more developed nations engaging more in discussions, while less developed countries show higher sentiment levels. Geographically weighted regression indicates that GDP-tweet engagement correlation intensifies following Bitcoin price surges. Topic modeling shows that countries within similar economic clusters share discussion trends, while different clusters focus on distinct topics. This study highlights global disparities in sentiment toward decentralized finance, shaped by economic and regional factors, with implications for poverty alleviation, cryptocurrency crime, and sustainable development. The dataset and code are publicly available on GitHub.
Abstract:This project proposes a new method that uses fuzzy comprehensive evaluation method to integrate ResNet-50 self-supervised and RepVGG supervised learning. The source image dataset HWOBC oracle is taken as input, the target image is selected, and finally the most similar image is output in turn without any manual intervention. The same feature encoding method is not used for images of different modalities. Before the model training, the image data is preprocessed, and the image is enhanced by random rotation processing, self-square graph equalization theory algorithm, and gamma transform, which effectively enhances the key feature learning. Finally, the fuzzy comprehensive evaluation method is used to combine the results of supervised training and unsupervised training, which can better solve the "most similar" problem that is difficult to quantify. At present, there are many unknown oracle-bone inscriptions waiting for us to crack. Contacting with the glyphs can provide new ideas for cracking.
Abstract:Visual-Language Alignment (VLA) has gained a lot of attention since CLIP's groundbreaking work. Although CLIP performs well, the typical direct latent feature alignment lacks clarity in its representation and similarity scores. On the other hand, lexical representation, a vector whose element represents the similarity between the sample and a word from the vocabulary, is a natural sparse representation and interpretable, providing exact matches for individual words. However, lexical representations is difficult to learn due to no ground-truth supervision and false-discovery issues, and thus requires complex design to train effectively. In this paper, we introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design. We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability. To avoid the false discovery, we propose an overuse penalty to refrain the lexical representation from falsely frequently activating meaningless words. We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset and avoid intricate training configurations. On cross-modal retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC-12M). We conduct extensive experiments to analyze LexVLA.
Abstract:Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionally changing the model's behavior. This threat is critical as it mirrors real-world usage scenarios where RAG systems interact with publicly accessible knowledge bases, such as web scrapings and user-contributed data pools. To be more realistic, we target a realistic setting where the adversary has no knowledge of users' queries, knowledge base data, and the LLM parameters. We demonstrate that it is possible to exploit the model successfully through crafted content uploads with access to the retriever. Our findings emphasize an urgent need for security measures in the design and deployment of RAG systems to prevent potential manipulation and ensure the integrity of machine-generated content.
Abstract:Databases are fundamental to contemporary information systems, yet traditional rule-based configuration methods struggle to manage the complexity of real-world applications with hundreds of tunable parameters. Deep reinforcement learning (DRL), which combines perception and decision-making, presents a potential solution for intelligent database configuration tuning. However, due to black-box property of RL-based method, the generated database tuning strategies still face the urgent problem of lack explainability. Besides, the redundant parameters in large scale database always make the strategy learning become unstable. This paper proposes KnobTree, an interpertable framework designed for the optimization of database parameter configuration. In this framework, an interpertable database tuning algorithm based on RL-based differentatial tree is proposed, which building a transparent tree-based model to generate explainable database tuning strategies. To address the problem of large-scale parameters, We also introduce a explainable method for parameter importance assessment, by utilizing Shapley Values to identify parameters that have significant impacts on database performance. Experiments conducted on MySQL and Gbase8s databases have verified exceptional transparency and interpretability of the KnobTree model. The good property makes generated strategies can offer practical guidance to algorithm designers and database administrators. Moreover, our approach also slightly outperforms the existing RL-based tuning algorithms in aspects such as throughput, latency, and processing time.
Abstract:With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
Abstract:A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF DOA estimation, which divides the large-scale receive array into small-scale groups and each group contains several subarrays. Thus the NF direction-of-arrival (DOA) estimation problem is viewed as far-field (FF) within each group, and some existing methods such as MUSIC, Root-MUSIC, ESPRIT, etc., can be adopted. Then by angle calibration, a candidate position set is generated. To eliminate the phase ambiguity arising from the HAD structure and obtain the emitter position, two low-complexity clustering-based methods, minimum sample distance clustering (MSDC) and range scatter diagram (RSD) - angle scatter diagram (ASD)-based DBSCAN (RSD-ASD-DBSCAN), are proposed based on the distribution features of samples in the candidate position set. Then to further improve the localization accuracy, a model-driven regression network (RegNet) is designed, which consists of a multi-layer neural network (MLNN) for false solution elimination and a perceptron for angle fusion. Finally, the Cramer-Rao lower bound (CRLB) of NF emitter localization for the proposed grouped HAD structure is also derived. The simulation results show that the proposed methods can achieve CRLB at different SNR regions, the RegNet has great performance advantages at low SNR regions and the clustering-based methods have much lower complexity.
Abstract:In this paper, we present the TacShade a newly designed 3D-printed soft optical tactile sensor. The sensor is developed for shape reconstruction under the inspiration of sketch drawing that uses the density of sketch lines to draw light and shadow, resulting in the creation of a 3D-view effect. TacShade, building upon the strengths of the TacTip, a single-camera tactile sensor of large in-depth deformation and being sensitive to edge and surface following, improves the structure in that the markers are distributed within the gap of papillae pins. Variations in light, dark, and grey effects can be generated inside the sensor through external contact interactions. The contours of the contacting objects are outlined by white markers, while the contact depth characteristics can be indirectly obtained from the distribution of black pins and white markers, creating a 2.5D visualization. Based on the imaging effect, we improve the Shape from Shading (SFS) algorithm to process tactile images, enabling a coarse but fast reconstruction for the contact objects. Two experiments are performed. The first verifies TacShade s ability to reconstruct the shape of the contact objects through one image for object distinction. The second experiment shows the shape reconstruction capability of TacShade for a large panel with ridged patterns based on the location of robots and image splicing technology.