Alibaba Group
Abstract:The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.
Abstract:Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at https://github.com/THU-MIG/yoloe.
Abstract:We investigate joint bistatic positioning (BP) and monostatic sensing (MS) within a multi-input multi-output orthogonal frequency-division system. Based on the derived Cram\'er-Rao Bounds (CRBs), we propose novel beamforming optimization strategies that enable flexible performance trade-offs between BP and MS. Two distinct objectives are considered in this multi-objective optimization problem, namely, enabling user equipment to estimate its own position while accounting for unknown clock bias and orientation, and allowing the base station to locate passive targets. We first analyze digital schemes, proposing both weighted-sum CRB and weighted-sum mismatch (of beamformers and covariance matrices) minimization approaches. These are examined under full-dimension beamforming (FDB) and low-complexity codebook-based power allocation (CPA). To adapt to low-cost hardwares, we develop unit-amplitude analog FDB and CPA schemes based on the weighted-sum mismatch of the covariance matrices paradigm, solved using distinct methods. Numerical results confirm the effectiveness of our designs, highlighting the superiority of minimizing the weighted-sum mismatch of covariance matrices, and the advantages of mutual information fusion between BP and MS.
Abstract:In this paper, we consider near-field localization and sensing with an extremely large aperture array under partial blockage of array antennas, where spherical wavefront and spatial non-stationarity are accounted for. We propose an Ising model to characterize the clustered sparsity feature of the blockage pattern, develop an algorithm based on alternating optimization for joint channel parameter estimation and visibility region detection, and further estimate the locations of the user and environmental scatterers. The simulation results confirm the effectiveness of the proposed algorithm compared to conventional methods.
Abstract:Positioning technology, which aims to determine the geometric information of a device in a global coordinate, is a key component in integrated sensing and communication systems. In addition to traditional active anchor-based positioning systems, reconfigurable intelligent surfaces (RIS) have shown great potential for enhancing system performance. However, their ability to manipulate electromagnetic waves and ease of deployment pose potential risks, as unauthorized RIS may be intentionally introduced to jeopardize the positioning service. Such an unauthorized RIS can cause unexpected interference in the original localization system, distorting the transmitted signals, and leading to degraded positioning accuracy. In this work, we investigate the scenario of RIS-aided positioning in the presence of interference from an unauthorized RIS. Theoretical lower bounds are employed to analyze the impact of unauthorized RIS on channel parameter estimation and positioning accuracy. Several codebook design strategies for unauthorized RIS are evaluated, and various system arrangements are discussed. The simulation results show that an unauthorized RIS path with a high channel gain or a delay similar to that of legitimate RIS paths leads to poor positioning performance. Furthermore, unauthorized RIS generates more effective interference when using directional beamforming codebooks compared to random codebooks.
Abstract:Localization and tracking are critical components of integrated sensing and communication (ISAC) systems, enhancing resource management, beamforming accuracy, and overall system reliability through precise sensing. Due to the high path loss of the high-frequency systems, antenna arrays are required at the transmitter and receiver sides for beamforming gain. However, beam misalignment may occur, which requires accurate tracking of the six-dimensional (6D) state, namely, 3D position and 3D orientation. In this work, we first address the challenge that the rotation matrix, being part of the Lie group rather than Euclidean space, necessitates the derivation of the ICRB for an intrinsic performance benchmark. Then, leveraging the derived ICRB, we develop two filters-one utilizing pose fusion and the other employing error-state Kalman filter to estimate the UE's 6D state for different computational resource consumption and accuracy requirements. Simulation results validate the ICRB and assess the performance of the proposed filters, demonstrating their effectiveness and improved accuracy in 6D state tracking.
Abstract:Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert's contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.
Abstract:As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
Abstract:Beamforming plays a crucial role in millimeter wave (mmWave) communication systems to mitigate the severe attenuation inherent to this spectrum. However, the use of large active antenna arrays in conventional architectures often results in high implementation costs and excessive power consumption, limiting their practicality. As an alternative, deploying large arrays at transceivers using passive devices, such as reconfigurable intelligent surfaces (RISs), offers a more cost-effective and energy-efficient solution. In this paper, we investigate a promising base station (BS) architecture that integrates a beyond diagonal RIS (BD-RIS) within the BS to enable passive beamforming. By utilizing Takagi's decomposition and leveraging the effective beamforming vector, the RIS profile can be designed to enable passive beamforming directed toward the target. Through the beamforming analysis, we reveal that BD-RIS provides robust beamforming performance across various system configurations, whereas the traditional diagonal RIS (D-RIS) exhibits instability with increasing RIS size and decreasing BS-RIS separation-two critical factors in optimizing RIS-assisted systems. Comprehensive computer simulation results across various aspects validate the superiority of the proposed BS-integrated BD-RIS over conventional D-RIS architectures, showcasing performance comparable to active analog beamforming antenna arrays.
Abstract:We investigate the performance tradeoff between \textit{bistatic positioning (BP)} and \textit{monostatic sensing (MS)} in a multi-input multi-output orthogonal frequency division multiplexing scenario. We derive the Cram\'er-Rao bounds (CRBs) for BP at the user equipment and MS at the base station. To balance these objectives, we propose a multi-objective optimization framework that optimizes beamformers using a weighted-sum CRB approach, ensuring the weak Pareto boundary. We also introduce two mismatch-minimizing approaches, targeting beamformer mismatch and variance matrix mismatch, and solve them distinctly. Numerical results demonstrate the performance tradeoff between BP and MS, revealing significant gains with the proposed methods and highlighting the advantages of minimizing the weighted-sum mismatch of variance matrices.