Abstract:In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
Abstract:In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional counterparts.To bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model.
Abstract:The recent success of Large Language Models (LLMs) has garnered significant attention in both academia and industry. Prior research on LLMs has primarily focused on enhancing or leveraging their generalization capabilities in zero- and few-shot settings. However, there has been limited investigation into effectively fine-tuning LLMs for a specific natural language understanding task in supervised settings. In this study, we conduct an experimental analysis by fine-tuning LLMs for the task of Chinese short text matching. We explore various factors that influence performance when fine-tuning LLMs, including task modeling methods, prompt formats, and output formats.
Abstract:Traffic congestion is a persistent problem in urban areas, which calls for the development of effective traffic signal control (TSC) systems. While existing Reinforcement Learning (RL)-based methods have shown promising performance in optimizing TSC, it is challenging to generalize these methods across intersections of different structures. In this work, a universal RL-based TSC framework is proposed for Vehicle-to-Everything (V2X) environments. The proposed framework introduces a novel agent design that incorporates a junction matrix to characterize intersection states, making the proposed model applicable to diverse intersections. To equip the proposed RL-based framework with enhanced capability of handling various intersection structures, novel traffic state augmentation methods are tailor-made for signal light control systems. Finally, extensive experimental results derived from multiple intersection configurations confirm the effectiveness of the proposed framework. The source code in this work is available at https://github.com/wmn7/Universal_Light
Abstract:In this paper, we initiate the study of rate-splitting multiple access (RSMA) for a mono-static integrated sensing and communication (ISAC) system, where the dual-functional base station (BS) simultaneously communicates with multiple users and detects multiple moving targets. We aim at optimizing the ISAC waveform to jointly maximize the max-min fairness (MMF) rate of the communication users and minimize the largest eigenvalue of the Cram\'er-Rao bound (CRB) matrix for unbiased estimation. The CRB matrix considered in this work is general as it involves the estimation of angular direction, complex reflection coefficient, and Doppler frequency for multiple moving targets. Simulation results demonstrate that RSMA maintains a larger communication and sensing trade-off than conventional space-division multiple access (SDMA) and it is capable of detecting multiple targets with a high detection accuracy. The finding highlights the potential of RSMA as an effective and powerful strategy for interference management in the general multi-user multi-target ISAC systems.
Abstract:Traffic signal control has the potential to reduce congestion in dynamic networks. Recent studies show that traffic signal control with reinforcement learning (RL) methods can significantly reduce the average waiting time. However, a shortcoming of existing methods is that they require model retraining for new intersections with different structures. In this paper, we propose a novel reinforcement learning approach with augmented data (ADLight) to train a universal model for intersections with different structures. We propose a new agent design incorporating features on movements and actions with set current phase duration to allow the generalized model to have the same structure for different intersections. A new data augmentation method named \textit{movement shuffle} is developed to improve the generalization performance. We also test the universal model with new intersections in Simulation of Urban MObility (SUMO). The results show that the performance of our approach is close to the models trained in a single environment directly (only a 5% loss of average waiting time), and we can reduce more than 80% of training time, which saves a lot of computational resources in scalable operations of traffic lights.
Abstract:Dual-Functional Radar-Communication (DFRC) system is an essential and promising technique for beyond 5G. In this work, we propose a powerful and unified multi-antenna DFRC transmission framework, where an additional radar sequence is transmitted apart from communication streams to enhance radar beampattern matching capability, and Rate-Splitting Multiple Access (RSMA) is adopted to better manage the interference. RSMA relies on multi-antenna Rate-Splitting (RS) with Successive Interference Cancellation (SIC) receivers, and the split and encoding of messages into common and private streams. We design the message split and the precoders of the radar sequence and communication streams to jointly maximize the Weighted Sum Rate (WSR) and minimize the radar beampattern approximation Mean Square Error (MSE) subject to the per antenna power constraint. An iterative algorithm based on Alternating Direction Method of Multipliers (ADMM) is developed to solve the problem. Numerical results first show that RSMA-assisted DFRC achieves a better tradeoff between WSR and beampattern approximation than Space-Division Multiple Access (SDMA)-assisted DFRC with or without radar sequence, and other simpler radar-communication strategies using orthogonal resources. We also show that the RSMA-assisted DFRC frameworks with and without radar sequence achieve the same tradeoff performance. This is because that the common stream is better exploited in the proposed framework. The common stream of RSMA fulfils the triple function of managing interference among communication users, managing interference between communication and radar, and beampattern approximation. Therefore, by enabling RSMA in DFRC, the system performance is enhanced while the system architecture is simplified since there is no need to use additional radar sequence and SIC. We conclude that RSMA is a more powerful multiple access for DFRC.
Abstract:In order to further exploit the potential of joint multi-antenna radar-communication (RadCom) system, we propose two transmission techniques respectively based on separated and shared antenna deployments. Both techniques are designed to maximize the weighted sum rate (WSR) and the probing power at target's location under average power constraints at the antennas such that the system can simultaneously communicate with downlink users and detect the target within the same frequency band. Based on a Weighted Minimized Mean Square Errors (WMMSE) method, the separated deployment transmission is designed via semidefinite programming (SDP) while the shared deployment problem is solved by majorization-minimization (MM) algorithm. Numerical results show that the shared deployment outperforms the separated deployment in radar beamforming. The tradeoffs between WSR and probing power at target are compared among both proposed transmissions and two practically simpler dual-function implementations i.e., time division and frequency division. Results show that although the separated deployment enables spectrum sharing, it experiences a performance loss compared with frequency division, while the shared deployment outperforms both and surpasses time division in certain conditions.