Abstract:Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.




Abstract:Massive multiple-input multiple-output (MIMO) with frequency division duplex (FDD) mode is a promising approach to increasing system capacity and link robustness for the fifth generation (5G) wireless cellular systems. The premise of these advantages is the accurate downlink channel state information (CSI) fed back from user equipment. However, conventional feedback methods have difficulties in reducing feedback overhead due to significant amount of base station (BS) antennas in massive MIMO systems. Recently, deep learning (DL)-based CSI feedback conquers many difficulties, yet still shows insufficiency to decrease the occupation of uplink bandwidth resources. In this paper, to solve this issue, we combine DL and superimposed coding (SC) for CSI feedback, in which the downlink CSI is spread and then superimposed on uplink user data sequences (UL-US) toward the BS. Then, a multi-task neural network (NN) architecture is proposed at BS to recover the downlink CSI and UL-US by unfolding two iterations of the minimum mean-squared error (MMSE) criterion-based interference reduction. In addition, for a network training, a subnet-by-subnet approach is exploited to facilitate the parameter tuning and expedite the convergence rate. Compared with standalone SC-based CSI scheme, our multi-task NN, trained in a specific signal-to-noise ratio (SNR) and power proportional coefficient (PPC), consistently improves the estimation of downlink CSI with similar or better UL-US detection under SNR and PPC varying.