Abstract:Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a post-hoc nonlinear integer programming based debiasing method that ensembles weight correction and membership correction to enable flexible rectifications of class probabilities at both class and sample levels, enhancing the performance of LLMs directly from their outputs. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. The resulted probability correction scheme demonstrates that sample-level corrections are necessary to elevate weak classes. In addition, due to effectively correcting weak classes, our method also brings significant performance gains to Llama-2-70B, especially on a biomedical domain task, demonstrating its effectiveness across both small and large model variants.
Abstract:Federated Learning (FL) is a widely adopted privacy-preserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two components: the out-of-sample gap, which measures the gap between the empirical and true risk for participating clients, and the participation gap, which quantifies the risk difference between participating and non-participating clients. In this work, we apply an information-theoretic analysis via the conditional mutual information (CMI) framework to study FL's two-level generalization. Beyond the traditional supersample-based CMI framework, we introduce a superclient construction to accommodate the two-level generalization setting in FL. We derive multiple CMI-based bounds, including hypothesis-based CMI bounds, illustrating how privacy constraints in FL can imply generalization guarantees. Furthermore, we propose fast-rate evaluated CMI bounds that recover the best-known convergence rate for two-level FL generalization in the small empirical risk regime. For specific FL model aggregation strategies and structured loss functions, we refine our bounds to achieve improved convergence rates with respect to the number of participating clients. Empirical evaluations confirm that our evaluated CMI bounds are non-vacuous and accurately capture the generalization behavior of FL algorithms.
Abstract:Robotic manipulators often face challenges in handling objects of different sizes and materials, limiting their effectiveness in practical applications. This issue is particularly pronounced when manipulating meter-scale objects or those with varying stiffness, as traditional gripping techniques and strategies frequently prove inadequate. In this letter, we introduce a novel surface-based multi-module robotic manipulation framework that utilizes a Central Pattern Generator (CPG)-based motion generator, combined with a simulation-based optimization method to determine the optimal manipulation parameters for a multi-module origami robotic surface (Ori-Pixel). This approach allows for the manipulation of objects ranging from centimeters to meters in size, with varying stiffness and shape. The optimized CPG parameters are tested through both dynamic simulations and a series of prototype experiments involving a wide range of objects differing in size, weight, shape, and material, demonstrating robust manipulation capabilities.
Abstract:Intelligence lies not only in the brain but in the body. The shape of our bodies can influence how we think and interact with the physical world. In robotics research, interacting with the physical world is crucial as it allows robots to manipulate objects in various real-life scenarios. Conventional robotic manipulation strategies mainly rely on finger-shaped end effectors. However, achieving stable grasps on fragile, deformable, irregularly shaped, or slippery objects is challenging due to difficulties in establishing stable force or geometric constraints. Here, we present surface-based manipulation strategies that diverge from classical grasping approaches, using with flat surfaces as minimalist end-effectors. By changing the position and orientation of these surfaces, objects can be translated, rotated and even flipped across the surface using closed-loop control strategies. Since this method does not rely on stable grasp, it can adapt to objects of various shapes, sizes, and stiffness levels, even enabling the manipulation the shape of deformable objects. Our results provide a new perspective for solving complex manipulation problems.
Abstract:This paper introduces a novel problem, distributional information embedding, motivated by the practical demands of multi-bit watermarking for large language models (LLMs). Unlike traditional information embedding, which embeds information into a pre-existing host signal, LLM watermarking actively controls the text generation process--adjusting the token distribution--to embed a detectable signal. We develop an information-theoretic framework to analyze this distributional information embedding problem, characterizing the fundamental trade-offs among three critical performance metrics: text quality, detectability, and information rate. In the asymptotic regime, we demonstrate that the maximum achievable rate with vanishing error corresponds to the entropy of the LLM's output distribution and increases with higher allowable distortion. We also characterize the optimal watermarking scheme to achieve this rate. Extending the analysis to the finite-token case, we identify schemes that maximize detection probability while adhering to constraints on false alarm and distortion.
Abstract:Hierarchical text classification (HTC) aims to assign one or more labels in the hierarchy for each text. Many methods represent this structure as a global hierarchy, leading to redundant graph structures. To address this, incorporating a text-specific local hierarchy is essential. However, existing approaches often model this local hierarchy as a sequence, focusing on explicit parent-child relationships while ignoring implicit correlations among sibling/peer relationships. In this paper, we first integrate local hierarchies into a manual depth-level prompt to capture parent-child relationships. We then apply Mixup to this hierarchical prompt tuning scheme to improve the latent correlation within sibling/peer relationships. Notably, we propose a novel Mixup ratio guided by local hierarchy correlation to effectively capture intrinsic correlations. This Local Hierarchy Mixup (LH-Mix) model demonstrates remarkable performance across three widely-used datasets.
Abstract:Model merging combines multiple fine-tuned models into a single one via parameter fusion, achieving improvements across many tasks. However, in the classification task, we find a misalignment issue between merging outputs and the fine-tuned classifier, which limits its effectiveness. In this paper, we demonstrate the following observations: (1) The embedding quality of the merging outputs is already very high, and the primary reason for the differences in classification performance lies in the misalignment issue. (2) We propose FT-Classifier, a new protocol that fine-tunes an aligned classifier with few-shot samples to alleviate misalignment, enabling better evaluation of merging outputs and improved classification performance. (3) The misalignment is relatively straightforward and can be formulated as an orthogonal transformation. Experiments demonstrate the existence of misalignment and the effectiveness of our FT-Classifier evaluation protocol.
Abstract:In this work, we introduce novel information-theoretic generalization bounds using the conditional $f$-information framework, an extension of the traditional conditional mutual information (MI) framework. We provide a generic approach to derive generalization bounds via $f$-information in the supersample setting, applicable to both bounded and unbounded loss functions. Unlike previous MI-based bounds, our proof strategy does not rely on upper bounding the cumulant-generating function (CGF) in the variational formula of MI. Instead, we set the CGF or its upper bound to zero by carefully selecting the measurable function invoked in the variational formula. Although some of our techniques are partially inspired by recent advances in the coin-betting framework (e.g., Jang et al. (2023)), our results are independent of any previous findings from regret guarantees of online gambling algorithms. Additionally, our newly derived MI-based bound recovers many previous results and improves our understanding of their potential limitations. Finally, we empirically compare various $f$-information measures for generalization, demonstrating the improvement of our new bounds over the previous bounds.
Abstract:Large Language Models (LLMs) boosts human efficiency but also poses misuse risks, with watermarking serving as a reliable method to differentiate AI-generated content from human-created text. In this work, we propose a novel theoretical framework for watermarking LLMs. Particularly, we jointly optimize both the watermarking scheme and detector to maximize detection performance, while controlling the worst-case Type-I error and distortion in the watermarked text. Within our framework, we characterize the universally minimum Type-II error, showing a fundamental trade-off between detection performance and distortion. More importantly, we identify the optimal type of detectors and watermarking schemes. Building upon our theoretical analysis, we introduce a practical, model-agnostic and computationally efficient token-level watermarking algorithm that invokes a surrogate model and the Gumbel-max trick. Empirical results on Llama-13B and Mistral-8$\times$7B demonstrate the effectiveness of our method. Furthermore, we also explore how robustness can be integrated into our theoretical framework, which provides a foundation for designing future watermarking systems with improved resilience to adversarial attacks.
Abstract:Model merging combines multiple homologous models into one model, achieving convincing generalization without the necessity of additional training. A key challenge in this problem is resolving parameter redundancies and conflicts across multiple models. Existing models have demonstrated that dropping a portion of delta parameters can alleviate conflicts while maintaining performance. However, these methods often drop parameters either randomly or based on magnitude, overlooking task-specific information embedded in fine-tuned models. In this paper, we propose an Activated Parameter Locating (APL) method that utilizes causal intervention to estimate parameter importance, enabling more precise parameter drops and better conflict mitigation. Moreover, to reduce the computational complexity associated with a large number of parameter partitions, we also introduce a theoretically supported gradient approximation strategy for APL. Experiments on model merging within both in-domain and out-of-domain settings, along with associated analyses, showcase the effectiveness of APL.