Abstract:Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges, particularly in terms of HBM bandwidth bottlenecks and inter-device communication overhead. In this paper, we present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.
Abstract:Over the past year, Speculative Decoding has gained popularity as a technique for accelerating Large Language Model inference. While several methods have been introduced, most struggle to deliver satisfactory performance at batch sizes typical for data centers ($\geq 8$) and often involve significant deployment complexities. In this work, we offer a theoretical explanation of how Speculative Decoding can be effectively utilized with larger batch sizes. We also introduce a method that integrates seamlessly into existing systems without additional training or the complexity of deploying a small LLM. In a continuous batching setting, we achieve a 4x increase in throughput without any latency impact for short context generation, and a 1.7-2x improvement in both latency and throughput for longer contexts.
Abstract:In massive multiple-input multiple-output (MIMO) systems, the downlink transmission performance heavily relies on accurate channel state information (CSI). Constrained by the transmitted power, user equipment always transmits sounding reference signals (SRSs) to the base station through frequency hopping, which will be leveraged to estimate uplink CSI and subsequently predict downlink CSI. This paper aims to investigate joint channel estimation and prediction (JCEP) for massive MIMO with frequency hopping sounding (FHS). Specifically, we present a multiple-subband (MS) delay-angle-Doppler (DAD) domain channel model with off-grid basis to tackle the energy leakage problem. Furthermore, we formulate the JCEP problem with FHS as a multiple measurement vector (MMV) problem, facilitating the sharing of common CSI across different subbands. To solve this problem, we propose an efficient Off-Grid-MS hybrid message passing (HMP) algorithm under the constrained Bethe free energy (BFE) framework. Aiming to address the lack of prior CSI in practical scenarios, the proposed algorithm can adaptively learn the hyper-parameters of the channel by minimizing the corresponding terms in the BFE expression. To alleviate the complexity of channel hyper-parameter learning, we leverage the approximations of the off-grid matrices to simplify the off-grid hyper-parameter estimation. Numerical results illustrate that the proposed algorithm can effectively mitigate the energy leakage issue and exploit the common CSI across different subbands, acquiring more accurate CSI compared to state-of-the-art counterparts.
Abstract:Vessel segmentation in fundus is a key diagnostic capability in ophthalmology, and there are various challenges remained in this essential task. Early approaches indicate that it is often difficult to obtain desirable segmentation performance on thin vessels and boundary areas due to the imbalance of vessel pixels with different thickness levels. In this paper, we propose a novel two-stream Meticulous-Processing Network (MP-Net) for tackling this problem. To pay more attention to the thin vessels and boundary areas, we firstly propose an efficient hierarchical model automatically stratifies the ground-truth masks into different thickness levels. Then a novel two-stream adversarial network is introduced to use the stratification results with a balanced loss function and an integration operation to achieve a better performance, especially in thin vessels and boundary areas detecting. Our model is proved to outperform state-of-the-art methods on DRIVE, STARE, and CHASE_DB1 datasets.