National Mobile Communications Research Laboratory, Southeast University, Nanjing, China
Abstract:In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) of state-of-the-art (SOTA) performance with 2B parameters. We introduce three key improvements that contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a visual understanding data construction pipeline, which enables hundred-million-scale high-quality recaption data annotation. Equipped with this pipeline, we curate SAIL-Caption, a large-scale caption dataset with large quantity and the highest data quality compared with opensource caption datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 131B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting expected data size scaling laws in visual understanding and instruction following performance. (3) Scalable SFT via quantity and quality scaling: We introduce general guidance for instruction data curation to scale up instruction data continuously, allowing us to construct a large SFT dataset with the highest quality. To further improve SAIL-VL's performance, we propose quality scaling, a multi-stage training recipe with curriculum learning, to improve model performance scaling curves w.r.t. data sizes from logarithmic to be near-linear. SAIL-VL obtains the highest average score in 19 commonly used benchmarks in our evaluation and achieves top1 performance among VLMs of comparable sizes on OpenCompass (https://rank.opencompass.org.cn/leaderboard-multimodal). We release our SAIL-VL-2B model at HuggingFace (https://huggingface.co/BytedanceDouyinContent/SAIL-VL-2B).
Abstract:Federated Learning (FL) is widely recognized as a privacy-preserving machine learning paradigm due to its model-sharing mechanism that avoids direct data exchange. However, model training inevitably leaves exploitable traces that can be used to infer sensitive information. In Decentralized FL (DFL), the overlay topology significantly influences its models' convergence, robustness, and security. This study explores the feasibility of inferring the overlay topology of DFL systems based solely on model behavior, introducing a novel Topology Inference Attack. A taxonomy of topology inference attacks is proposed, categorizing them by the attacker's capabilities and knowledge. Practical attack strategies are developed for different scenarios, and quantitative experiments are conducted to identify key factors influencing the attack effectiveness. Experimental results demonstrate that analyzing only the public models of individual nodes can accurately infer the DFL topology, underscoring the risk of sensitive information leakage in DFL systems. This finding offers valuable insights for improving privacy preservation in decentralized learning environments.
Abstract:Federated Learning (FL) performance is highly influenced by data distribution across clients, and non-Independent and Identically Distributed (non-IID) leads to a slower convergence of the global model and a decrease in model effectiveness. The existing algorithms for solving the non-IID problem are focused on the traditional centralized FL (CFL), where a central server is used for model aggregation. However, in decentralized FL (DFL), nodes lack the overall vision of the federation. To address the non-IID problem in DFL, this paper proposes a novel DFL aggregation algorithm, Federated Entropy Pooling (FedEP). FedEP mitigates the client drift problem by incorporating the statistical characteristics of local distributions instead of any actual data. Prior to training, each client conducts a local distribution fitting using a Gaussian Mixture Model (GMM) and shares the resulting statistical characteristics with its neighbors. After receiving the statistical characteristics shared by its neighbors, each node tries to fit the global data distribution. In the aggregation phase, each node calculates the Kullback-Leibler (KL) divergences of the local data distribution over the fitted global data distribution, giving the weights to generate the aggregated model. Extensive experiments have demonstrated that FedEP can achieve faster convergence and show higher test performance than state-of-the-art approaches.
Abstract:Federated Learning (FL), introduced in 2016, was designed to enhance data privacy in collaborative model training environments. Among the FL paradigm, horizontal FL, where clients share the same set of features but different data samples, has been extensively studied in both centralized and decentralized settings. In contrast, Vertical Federated Learning (VFL), which is crucial in real-world decentralized scenarios where clients possess different, yet sensitive, data about the same entity, remains underexplored. Thus, this work introduces De-VertiFL, a novel solution for training models in a decentralized VFL setting. De-VertiFL contributes by introducing a new network architecture distribution, an innovative knowledge exchange scheme, and a distributed federated training process. Specifically, De-VertiFL enables the sharing of hidden layer outputs among federation clients, allowing participants to benefit from intermediate computations, thereby improving learning efficiency. De-VertiFL has been evaluated using a variety of well-known datasets, including both image and tabular data, across binary and multiclass classification tasks. The results demonstrate that De-VertiFL generally surpasses state-of-the-art methods in F1-score performance, while maintaining a decentralized and privacy-preserving framework.
Abstract:Inter-user interference (IUI) mitigation has been an essential issue for multi-user multiple-input multiple-output (MU-MIMO) communications. The commonly used linear processing schemes include the maximum-ratio combining (MRC), zero-forcing (ZF) and minimum mean squared error (MMSE) beamforming, which may result in the unfavorable performance or complexity as the antenna number grows. In this paper, we introduce a low-complexity linear beamforming solution for the IUI mitigation by using the convolutional beamspace (CBS) technique. Specifically, the dimension of channel matrix can be significantly reduced via the CBS preprocessing, thanks to its beamspace and spatial filtering effects. However, existing methods of the spatial filter design mainly benefit from the Vandermonde structure of channel matrix, which only holds for the far-field scenario with the uniform plane wave (UPW) model. As the antenna size increases, this characteristic may vanish in the near-field region of the array, where the uniform spherical wave (USW) propagation becomes dominant. To gain useful insights, we first investigate the beamforming design and performance analysis of the CBS-based beamforming based on the UPW model. Our results unveil that the proposed CBS-based MMSE beamforming is able to achieve a near-optimal performance but demands remarkably lower complexity than classical ZF and MMSE schemes. Furthermore, our analysis is also extended to the near-field case. To this end, a novel optimization-based CBS approach is proposed for preserving spatial filtering effects, thus rendering the compatibility of the CBS-based beamforming. Finally, numerical results are provided to demonstrate the effectiveness of our proposed CBS-based beamforming method.
Abstract:With the development of deep learning techniques, deep recommendation models also achieve remarkable improvements in terms of recommendation accuracy. However, due to the large number of candidate items in practice and the high cost of preference computation, these methods also suffer from low efficiency of recommendation. The recently proposed tree-based deep recommendation models alleviate the problem by directly learning tree structure and representations under the guidance of recommendation objectives. However, such models have shortcomings. The max-heap assumption in the hierarchical tree, in which the preference for a parent node should be the maximum between the preferences for its children, is difficult to satisfy in their binary classification objectives. To this end, we propose Tree-based Deep Retrieval (TDR for short) for efficient recommendation. In TDR, all the trees generated during the training process are retained to form the forest. When learning the node representation of each tree, we have to satisfy the max-heap assumption as much as possible and mimic beam search behavior over the tree in the training stage. This is achieved by TDR to regard the training task as multi-classification over tree nodes at the same level. However, the number of tree nodes grows exponentially with levels, making us train the preference model with the guidance of the sampled-softmax technique. The experiments are conducted on real-world datasets, validating the effectiveness of the proposed preference model learning method and tree learning method.
Abstract:Multiple-input multiple-output has been a key technology for wireless systems for decades. For typical MIMO communication systems, antenna array elements are usually separated by half of the carrier wavelength, thus termed as conventional MIMO. In this paper, we investigate the performance of multi-user MIMO communication, with sparse arrays at both the transmitter and receiver side, i.e., the array elements are separated by more than half wavelength. Given the same number of array elements, the performance of sparse MIMO is compared with conventional MIMO. On one hand, sparse MIMO has a larger aperture, which can achieve narrower main lobe beams that make it easier to resolve densely located users. Besides, increased array aperture also enlarges the near-field communication region, which can enhance the spatial multiplexing gain, thanks to the spherical wavefront property in the near-field region. On the other hand, element spacing larger than half wavelength leads to undesired grating lobes, which, if left unattended, may cause severe inter-user interference. To gain further insights, we first study the spatial multiplexing gain of the basic single-user sparse MIMO communication system, where a closed-form expression of the near-field effective degree of freedom is derived. The result shows that the EDoF increases with the array sparsity for sparse MIMO before reaching its upper bound, which equals to the minimum value between the transmit and receive antenna numbers. Furthermore, the scaling law for the achievable data rate with varying array sparsity is analyzed and an array sparsity-selection strategy is proposed. We then consider the more general multi-user sparse MIMO communication system. It is shown that sparse MIMO is less likely to experience severe IUI than conventional MIMO.
Abstract:Understanding neural activity and information representation is crucial for advancing knowledge of brain function and cognition. Neural activity, measured through techniques like electrophysiology and neuroimaging, reflects various aspects of information processing. Recent advances in deep neural networks offer new approaches to analyzing these signals using pre-trained models. However, challenges arise due to discrepancies between different neural signal modalities and the limited scale of high-quality neural data. To address these challenges, we present NeuroBind, a general representation that unifies multiple brain signal types, including EEG, fMRI, calcium imaging, and spiking data. To achieve this, we align neural signals in these image-paired neural datasets to pre-trained vision-language embeddings. Neurobind is the first model that studies different neural modalities interconnectedly and is able to leverage high-resource modality models for various neuroscience tasks. We also showed that by combining information from different neural signal modalities, NeuroBind enhances downstream performance, demonstrating the effectiveness of the complementary strengths of different neural modalities. As a result, we can leverage multiple types of neural signals mapped to the same space to improve downstream tasks, and demonstrate the complementary strengths of different neural modalities. This approach holds significant potential for advancing neuroscience research, improving AI systems, and developing neuroprosthetics and brain-computer interfaces.
Abstract:We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: https://cfeng16.github.io/this-and-that/.
Abstract:The lifelong user behavior sequence provides abundant information of user preference and gains impressive improvement in the recommendation task, however increases computational consumption significantly. To meet the severe latency requirement in online service, a short sub-sequence is sampled based on similarity to the target item. Unfortunately, items not in the sub-sequence are abandoned, leading to serious information loss. In this paper, we propose a new efficient paradigm to model the full lifelong sequence, which is named as \textbf{I}nteraction \textbf{F}idelity \textbf{A}ttention (\textbf{IFA}). In IFA, we input all target items in the candidate set into the model at once, and leverage linear transformer to reduce the time complexity of the cross attention between the candidate set and the sequence without any interaction information loss. We also additionally model the relationship of all target items for optimal set generation, and design loss function for better consistency of training and inference. We demonstrate the effectiveness and efficiency of our model by off-line and online experiments in the recommender system of Kuaishou.