Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guanghui Song

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

Jun 16, 2025

Guanghui Song, Dongping Liao, Yiren Zhao, Kejiang Ye, Cheng-zhong Xu, Xitong Gao

Abstract:Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding "low-priority" tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA's superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.

Via

Access Paper or Ask Questions

Constrained Coding and Deep Learning Aided Threshold Detection for Resistive Memories

Nov 19, 2024

Xingwei Zhong, Kui Cai, Guanghui Song, Weijie Wang, Yao Zhu

Abstract:Resistive random access memory (ReRAM) is a promising emerging non-volatile memory (NVM) technology that shows high potential for both data storage and computing. However, its crossbar array architecture leads to the sneak path problem, which may severely degrade the reliability of data stored in the ReRAM cell. Due to the complication of memory physics and unique features of the sneak path induced interference (SPI), it is difficult to derive an accurate channel model for it. The deep learning (DL)-based detection scheme \cite{zhong2020sneakdl} can better mitigate the SPI, at the cost of additional power consumption and read latency. In this letter, we first propose a novel CC scheme which can not only reduce the SPI in the memory array, but also effectively differentiate the memory arrays into two categories of sneak-path-free and sneak-path-affected arrays. For the sneak-path-free arrays, we can use a simple middle-point threshold detector to detect the low and high resistance cells of ReRAM. For the sneak-path-affected arrays, a DL detector is first trained off-line (prior to the data detection of ReRAM). To avoid the additional power consumption and latency introduced by the DL detector, we further propose a DL-based threshold detector, whose detection threshold can be derived based on the outputs of the DL detector. It is then utilized for the online data detection of all the identified sneak-path-affected arrays. Simulation results demonstrate that the above CC and DL aided threshold detection scheme can effectively mitigate the SPI of the ReRAM array and achieve better error rate performance than the prior art detection schemes, without the prior knowledge of the channel.

Via

Access Paper or Ask Questions

Deep-Learning-Based Adaptive Error-Correction Decoding for Spin-Torque Transfer Magnetic Random Access Memory (STT-MRAM)

Oct 07, 2024

Xingwei Zhong, Kui Cai, Peng Kang, Guanghui Song, Bin Dai

Figure 1 for Deep-Learning-Based Adaptive Error-Correction Decoding for Spin-Torque Transfer Magnetic Random Access Memory (STT-MRAM)

Figure 2 for Deep-Learning-Based Adaptive Error-Correction Decoding for Spin-Torque Transfer Magnetic Random Access Memory (STT-MRAM)

Figure 3 for Deep-Learning-Based Adaptive Error-Correction Decoding for Spin-Torque Transfer Magnetic Random Access Memory (STT-MRAM)

Figure 4 for Deep-Learning-Based Adaptive Error-Correction Decoding for Spin-Torque Transfer Magnetic Random Access Memory (STT-MRAM)

Abstract:Spin-torque transfer magnetic random access memory (STT-MRAM) is a promising emerging non-volatile memory (NVM) technology with wide applications. However, the data recovery of STT-MRAM is affected by the diversity of channel raw bit error rate (BER) across different dies caused by process variations, as well as the unknown resistance offset due to temperature change. Therefore, it is critical to develop effective decoding algorithms of error correction codes (ECCs) for STT-MRAM. In this article, we first propose a neural bit-flipping (BF) decoding algorithm, which can share the same trellis representation as the state-of-the-art neural decoding algorithms, such as the neural belief propagation (NBP) and neural offset min-sum (NOMS) algorithm. Hence, a neural network (NN) decoder with a uniform architecture but different NN parameters can realize all these neural decoding algorithms. Based on such a unified NN decoder architecture, we further propose a novel deep-learning (DL)-based adaptive decoding algorithm whose decoding complexity can be adjusted according to the change of the channel conditions of STT-MRAM. Extensive experimental evaluation results demonstrate that the proposed neural decoders can greatly improve the performance over the standard decoders, with similar decoding latency and energy consumption. Moreover, the DL-based adaptive decoder can work well over different channel conditions of STT-MRAM irrespective of the unknown resistance offset, with a 50% reduction of the decoding latency and energy consumption compared to the fixed decoder.

Via

Access Paper or Ask Questions

Asynchronous Grant-Free Random Access: Receiver Design with Partially Uni-Directional Message Passing and Interference Suppression Analysis

May 17, 2023

Zhaoji Zhang, Yuhao Chi, Qinghua Guo, Ying Li, Guanghui Song, Chongwen Huang

Abstract:Massive Machine-Type Communications (mMTC) features a massive number of low-cost user equipments (UEs) with sparse activity. Tailor-made for these features, grant-free random access (GF-RA) serves as an efficient access solution for mMTC. However, most existing GF-RA schemes rely on strict synchronization, which incurs excessive coordination burden for the low-cost UEs. In this work, we propose a receiver design for asynchronous GF-RA, and address the joint user activity detection (UAD) and channel estimation (CE) problem in the presence of asynchronization-induced inter-symbol interference. Specifically, the delay profile is exploited at the receiver to distinguish different UEs. However, a sample correlation problem in this receiver design impedes the factorization of the joint likelihood function, which complicates the UAD and CE problem. To address this correlation problem, we design a partially uni-directional (PUD) factor graph representation for the joint likelihood function. Building on this PUD factor graph, we further propose a PUD message passing based sparse Bayesian learning (SBL) algorithm for asynchronous UAD and CE (PUDMP-SBL-aUADCE). Our theoretical analysis shows that the PUDMP-SBL-aUADCE algorithm exhibits higher signal-to-interference-and-noise ratio (SINR) in the asynchronous case than in the synchronous case, i.e., the proposed receiver design can exploit asynchronization to suppress multi-user interference. In addition, considering potential timing error from the low-cost UEs, we investigate the impacts of imperfect delay profile, and reveal the advantages of adopting the SBL method in this case. Finally, extensive simulation results are provided to demonstrate the performance of the PUDMP-SBL-aUADCE algorithm.

* submitted to IEEE IoTJ

Via

Access Paper or Ask Questions

Capacity Optimal Generalized Multi-User MIMO: A Theoretical and Practical Framework

Nov 22, 2021

Yuhao Chi, Lei Liu, Guanghui Song, Ying Li, Yong Liang Guan, Chau Yuen

Figure 1 for Capacity Optimal Generalized Multi-User MIMO: A Theoretical and Practical Framework

Figure 2 for Capacity Optimal Generalized Multi-User MIMO: A Theoretical and Practical Framework

Figure 3 for Capacity Optimal Generalized Multi-User MIMO: A Theoretical and Practical Framework

Figure 4 for Capacity Optimal Generalized Multi-User MIMO: A Theoretical and Practical Framework

Abstract:Conventional multi-user multiple-input multiple-output (MU-MIMO) mainly focused on Gaussian signaling, independent and identically distributed (IID) channels, and a limited number of users. It will be laborious to cope with the heterogeneous requirements in next-generation wireless communications, such as various transmission data, complicated communication scenarios, and massive user access. Therefore, this paper studies a generalized MU-MIMO (GMU-MIMO) system with more practical constraints, i.e., non-Gaussian signaling, non-IID channel, and massive users and antennas. These generalized assumptions bring new challenges in theory and practice. For example, there is no accurate capacity analysis for GMU-MIMO. In addition, it is unclear how to achieve the capacity optimal performance with practical complexity. To address these challenges, a unified framework is proposed to derive the GMU-MIMO capacity and design a capacity optimal transceiver, which jointly considers encoding, modulation, detection, and decoding. Group asymmetry is developed to make a tradeoff between user rate allocation and implementation complexity. Specifically, the capacity region of group asymmetric GMU-MIMO is characterized by using the celebrated mutual information and minimum mean-square error (MMSE) lemma and the MMSE optimality of orthogonal approximate message passing (OAMP)/vector AMP (VAMP). Furthermore, a theoretically optimal multi-user OAMP/VAMP receiver and practical multi-user low-density parity-check (MU-LDPC) codes are proposed to achieve the capacity region of group asymmetric GMU-MIMO. Numerical results verify that the gaps between theoretical detection thresholds of the proposed framework with optimized MU-LDPC codes and QPSK modulation and the sum capacity of GMU-MIMO are about 0.2 dB. Moreover, their finite-length performances are about 1~2 dB away from the associated sum capacity.

Via

Access Paper or Ask Questions