Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Layer-Condensed KV Cache for Efficient Inference of Large Language Models

May 17, 2024

Haoyi Wu, Kewei Tu

Figure 1 for Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Figure 2 for Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Figure 3 for Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Figure 4 for Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Share this with someone who'll enjoy it:

Abstract:Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a small number of layers, thus significantly saving memory consumption and improving inference throughput. Our experiments on large language models show that our method achieves up to 26$\times$ higher throughput than standard transformers and competitive performance in language modeling and downstream tasks. In addition, our method is orthogonal to existing transformer memory-saving techniques, so it is straightforward to integrate them with our model, achieving further improvement in inference efficiency. Our code is available at https://github.com/whyNLP/LCKV.

* Accepted to ACL2024 main conference

View paper on

Share this with someone who'll enjoy it:

Title:Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Paper and Code