Abstract:Vertical Federated Learning (VFL) is a privacy-preserving collaborative learning paradigm that enables multiple parties with distinct feature sets to jointly train machine learning models without sharing their raw data. Despite its potential to facilitate cross-organizational collaborations, the deployment of VFL systems in real-world applications remains limited. To investigate the gap between existing VFL research and practical deployment, this survey analyzes the real-world data distributions in potential VFL applications and identifies four key findings that highlight this gap. We propose a novel data-oriented taxonomy of VFL algorithms based on real VFL data distributions. Our comprehensive review of existing VFL algorithms reveals that some common practical VFL scenarios have few or no viable solutions. Based on these observations, we outline key research directions aimed at bridging the gap between current VFL research and real-world applications.
Abstract:Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.
Abstract:We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
Abstract:Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.
Abstract:Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance. However, the class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation. Existing class-imbalanced semi-supervised learning (CISSL) methods mainly focus on rebalancing datasets but ignore the potential of using hard examples to enhance performance, making it difficult to fully harness the power of unlabeled data even with sophisticated algorithms. To address this issue, we propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi). This method distinguishes the entropy differences among logits of hard and easy examples, thereby identifying hard examples and increasing the utility of unlabeled data, better addressing the imbalance problem in CISSL. In addition, we maintain a class-balanced memory bank with confidence decay for storing high-confidence embeddings to enhance the pseudo-labels' reliability. Although our method is simple, it is effective and seamlessly integrates with existing approaches. We perform comprehensive experiments on standard CISSL benchmarks and experimentally demonstrate that our proposed SeMi outperforms existing state-of-the-art methods on multiple benchmarks, especially in reversed scenarios, where our best result shows approximately a 54.8\% improvement over the baseline methods.
Abstract:As intelligent reflecting surface (IRS) has emerged as a new and promising technology capable of configuring the wireless environment favorably, channel estimation for IRS-assisted multiple-input multiple-output (MIMO) systems has garnered extensive attention in recent years. While various algorithms have been proposed to address this challenge, there is a lack of rigorous theoretical error analysis. This paper aims to address this gap by providing theoretical guarantees in terms of stable recovery of channel matrices for noisy measurements. We begin by establishing the equivalence between IRS-assisted MIMO systems and a compact tensor train (TT)-based tensor-on-tensor (ToT) regression. Building on this equivalence, we then investigate the restricted isometry property (RIP) for complex-valued subgaussian measurements. Our analysis reveals that successful recovery hinges on the relationship between the number of user terminals (in the uplink scenario) or base stations (in the downlink scenario) and the number of time slots during which channel matrices remain invariant. Utilizing the RIP condition, we analyze the theoretical recovery error for the solution to a constrained least-squares optimization problem, including upper error bound and minimax lower bound, demonstrating that the error decreases inversely with the number of time slots and increases proportionally with the number of unknown elements in the channel matrices. In addition, we extend our error analysis to two more specialized IRS-assisted MIMO systems, incorporating low-rank channel matrices or an unknown IRS. Furthermore, we explore a multi-hop IRS scheme and analyze the corresponding recovery errors. Finally, we introduce and implement two nonconvex optimization algorithms--alternating least squares and alternating gradient descent--to validate our conclusions through simulations.
Abstract:Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct behaviours at high and low spatial compression levels, underscoring challenges in representing high-dimensional latent spaces. We show that GSQ can restructure high-dimensional latent into compact, low-dimensional spaces, thus enabling efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x down-sampling with a reconstruction FID (rFID) of 0.50.
Abstract:A large amount of instructional text data is essential to enhance the performance of pre-trained large language models (LLMs) for downstream tasks. This data can contain sensitive information and therefore cannot be shared in practice, resulting in data silos that limit the effectiveness of LLMs on various tasks. Federated learning (FL) enables collaborative fine-tuning across different clients without sharing their data. Nonetheless, in practice, this instructional text data is highly heterogeneous in both quantity and distribution across clients, necessitating distinct model structures to best accommodate the variations. However, existing federated fine-tuning approaches either enforce the same model structure or rely on predefined ad-hoc architectures unaware of data distribution, resulting in suboptimal performance. To address this challenge, we propose FedAMoLE, a lightweight personalized federated fine-tuning framework that leverages data-driven heterogeneous model architectures. FedAMoLE introduces the Adaptive Mixture of LoRA Experts (AMoLE) module, which facilitates model heterogeneity with minimal communication overhead by allocating varying numbers of LoRA-based domain experts to each client. Furthermore, we develop a reverse selection-based expert assignment (RSEA) strategy, which enables data-driven model architecture adjustment during fine-tuning by allowing domain experts to select clients that best align with their knowledge domains. Extensive experiments across six different scenarios of data heterogeneity demonstrate that FedAMoLE significantly outperforms existing methods for federated LLM fine-tuning, achieving superior accuracy while maintaining good scalability.
Abstract:The process of reconstructing quantum states from experimental measurements, accomplished through quantum state tomography (QST), plays a crucial role in verifying and benchmarking quantum devices. A key challenge of QST is to find out how the accuracy of the reconstruction depends on the number of state copies used in the measurements. When multiple measurement settings are used, the total number of state copies is determined by multiplying the number of measurement settings with the number of repeated measurements for each setting. Due to statistical noise intrinsic to quantum measurements, a large number of repeated measurements is often used in practice. However, recent studies have shown that even with single-sample measurements--where only one measurement sample is obtained for each measurement setting--high accuracy QST can still be achieved with a sufficiently large number of different measurement settings. In this paper, we establish a theoretical understanding of the trade-off between the number of measurement settings and the number of repeated measurements per setting in QST. Our focus is primarily on low-rank density matrix recovery using Pauli measurements. We delve into the global landscape underlying the low-rank QST problem and demonstrate that the joint consideration of measurement settings and repeated measurements ensures a bounded recovery error for all second-order critical points, to which optimization algorithms tend to converge. This finding suggests the advantage of minimizing the number of repeated measurements per setting when the total number of state copies is held fixed. Additionally, we prove that the Wirtinger gradient descent algorithm can converge to the region of second-order critical points with a linear convergence rate. We have also performed numerical experiments to support our theoretical findings.
Abstract:Tensor train (TT) decomposition represents an $N$-order tensor using $O(N)$ matrices (i.e., factors) of small dimensions, achieved through products among these factors. Due to its compact representation, TT decomposition has found wide applications, including various tensor recovery problems in signal processing and quantum information. In this paper, we study the problem of reconstructing a TT format tensor from measurements that are contaminated by outliers with arbitrary values. Given the vulnerability of smooth formulations to corruptions, we use an $\ell_1$ loss function to enhance robustness against outliers. We first establish the $\ell_1/\ell_2$-restricted isometry property (RIP) for Gaussian measurement operators, demonstrating that the information in the TT format tensor can be preserved using a number of measurements that grows linearly with $N$. We also prove the sharpness property for the $\ell_1$ loss function optimized over TT format tensors. Building on the $\ell_1/\ell_2$-RIP and sharpness property, we then propose two complementary methods to recover the TT format tensor from the corrupted measurements: the projected subgradient method (PSubGM), which optimizes over the entire tensor, and the factorized Riemannian subgradient method (FRSubGM), which optimizes directly over the factors. Compared to PSubGM, the factorized approach FRSubGM significantly reduces the memory cost at the expense of a slightly slower convergence rate. Nevertheless, we show that both methods, with diminishing step sizes, converge linearly to the ground-truth tensor given an appropriate initialization, which can be obtained by a truncated spectral method.