Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luis Ceze

University of Washington

xKV: Cross-Layer SVD for KV-Cache Compression

Mar 24, 2025

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah

Abstract:Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.

Via

Access Paper or Ask Questions

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Feb 28, 2025

Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang(+4 more)

Figure 1 for TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Figure 2 for TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Figure 3 for TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Figure 4 for TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

Abstract:Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

Via

Access Paper or Ask Questions

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Jan 02, 2025

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy(+1 more)

Figure 1 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Figure 2 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Figure 3 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Figure 4 for FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Abstract:Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

* code available at http://github.com/flashinfer-ai/flashinfer

Via

Access Paper or Ask Questions

Palu: Compressing KV-Cache with Low-Rank Projection

Jul 30, 2024

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu

Figure 1 for Palu: Compressing KV-Cache with Low-Rank Projection

Figure 2 for Palu: Compressing KV-Cache with Low-Rank Projection

Figure 3 for Palu: Compressing KV-Cache with Low-Rank Projection

Figure 4 for Palu: Compressing KV-Cache with Low-Rank Projection

Abstract:KV-Cache compression methods generally sample a KV-Cache of effectual tokens or quantize it into lower bits. However, these methods cannot exploit the redundancy of the hidden dimension of KV tensors. This paper investigates a unique hidden dimension approach called Palu, a novel KV-Cache compression framework that utilizes low-rank projection. Palu decomposes the linear layers into low-rank matrices, caches the smaller intermediate states, and reconstructs the full keys and values on the fly. To improve accuracy, compression rate, and efficiency, Palu further encompasses (1) a medium-grained low-rank decomposition scheme, (2) an efficient rank search algorithm, (3) a low-rank-aware quantization algorithm, and (4) matrix fusion with optimized GPU kernels. Our extensive experiments with popular LLMs show that Palu can compress KV-Cache by more than 91.25% while maintaining a significantly better accuracy (up to 1.19 lower perplexity) than state-of-the-art KV-Cache quantization methods at a similar or even higher memory usage. When compressing KV-Cache for 50%, Palu delivers up to 1.61x end-to-end speedup for the attention module. Our code is publicly available at https://github.com/shadowpa0327/Palu.

Via

Access Paper or Ask Questions

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Nov 07, 2023

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Figure 1 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Figure 2 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Figure 3 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Figure 4 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Abstract:The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to $7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 quantization, while maintaining the same latency target.

Via

Access Paper or Ask Questions

Punica: Multi-Tenant LoRA Serving

Oct 28, 2023

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy

Figure 1 for Punica: Multi-Tenant LoRA Serving

Figure 2 for Punica: Multi-Tenant LoRA Serving

Figure 3 for Punica: Multi-Tenant LoRA Serving

Figure 4 for Punica: Multi-Tenant LoRA Serving

Abstract:Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica .

Via

Access Paper or Ask Questions

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Jul 11, 2022

Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, Luis Ceze

Figure 1 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Figure 2 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Figure 3 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Figure 4 for SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Abstract:Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. We show that the key to addressing both challenges is two forms of composability. In this paper, we propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries on GPUs for single operators: 1.1-3.3x for GNN operators and 1.1-4.4x for sparse transformer operators. SparseTIR also accelerates end-to-end GNNs by 1.1-2.2x for GraphSAGE training and 0.9-26x for RGCN inference.

Via

Access Paper or Ask Questions

Characterizing and Taming Resolution in Convolutional Neural Networks

Oct 28, 2021

Eddie Yan, Liang Luo, Luis Ceze

Figure 1 for Characterizing and Taming Resolution in Convolutional Neural Networks

Figure 2 for Characterizing and Taming Resolution in Convolutional Neural Networks

Figure 3 for Characterizing and Taming Resolution in Convolutional Neural Networks

Figure 4 for Characterizing and Taming Resolution in Convolutional Neural Networks

Abstract:Image resolution has a significant effect on the accuracy and computational, storage, and bandwidth costs of computer vision model inference. These costs are exacerbated when scaling out models to large inference serving systems and make image resolution an attractive target for optimization. However, the choice of resolution inherently introduces additional tightly coupled choices, such as image crop size, image detail, and compute kernel implementation that impact computational, storage, and bandwidth costs. Further complicating this setting, the optimal choices from the perspective of these metrics are highly dependent on the dataset and problem scenario. We characterize this tradeoff space, quantitatively studying the accuracy and efficiency tradeoff via systematic and automated tuning of image resolution, image quality and convolutional neural network operators. With the insights from this study, we propose a dynamic resolution mechanism that removes the need to statically choose a resolution ahead of time.

Via

Access Paper or Ask Questions

Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

May 28, 2021

Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze

Figure 1 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Figure 2 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Figure 3 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Figure 4 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

Abstract:ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.

Via

Access Paper or Ask Questions

Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Apr 23, 2021

Chien-Yu Lin, Liang Luo, Luis Ceze

Figure 1 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Figure 2 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Figure 3 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Figure 4 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Abstract:Graph neural networks (GNNs), an emerging deep learning model class, can extract meaningful representations from highly expressive graph-structured data and are therefore gaining popularity for wider ranges of applications. However, current GNNs suffer from the poor performance of their sparse-dense matrix multiplication (SpMM) operator, even when using powerful GPUs. Our analysis shows that 95% of the inference time could be spent on SpMM when running popular GNN models on NVIDIA's advanced V100 GPU. Such SpMM performance bottleneck hinders GNNs' applicability to large-scale problems or the development of more sophisticated GNN models. To address this inference time bottleneck, we introduce ES-SpMM, a cache-first edge sampling mechanism and codesigned SpMM kernel. ES-SpMM uses edge sampling to downsize the graph to fit into GPU's shared memory. It thus reduces the computation cost and improves SpMM's cache locality. To evaluate ES-SpMM's performance, we integrated it with a popular GNN framework, DGL, and tested it using representative GNN models and datasets. Our results show that ES-SpMM outperforms the highly optimized cuSPARSE SpMM kernel by up to 4.35x with no accuracy loss and by 45.3x with less than a 1% accuracy loss.

Via

Access Paper or Ask Questions