Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Purushotham Kamath

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Mar 14, 2024

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath

Figure 1 for Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Figure 2 for Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Figure 3 for Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Figure 4 for Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Abstract:Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.

* Proceedings of the 7th Annual Conference on Machine Learning and Systems (MLSys), 2024
* A collaborative effort by d-matrix and the University of British Columbia

Via

Access Paper or Ask Questions

Neural Architecture Construction using EnvelopeNets

May 22, 2018

Purushotham Kamath, Abhishek Singh, Debo Dutta

Figure 1 for Neural Architecture Construction using EnvelopeNets

Figure 2 for Neural Architecture Construction using EnvelopeNets

Figure 3 for Neural Architecture Construction using EnvelopeNets

Figure 4 for Neural Architecture Construction using EnvelopeNets

Abstract:In recent years, several automated search methods for neural network architectures have been proposed using methods such as evolutionary algorithms and reinforcement learning. These methods use an objective function (usually accuracy) that is evaluated after a full training and evaluation cycle. We show that statistics derived from filter featuremaps reach a state where the utility of different filters within a network can be compared and hence can be used to construct networks. The training epochs needed for filters within a network to reach this state is much less than the training epochs needed for the accuracy of a network to stabilize. EnvelopeNets is a construction method that exploits this finding to design convolutional neural nets (CNNs) in a fraction of the time needed by conventional search methods. The constructed networks show close to state of the art performance on the image classification problem on well known datasets (CIFAR-10, ImageNet) and consistently show better performance than hand constructed and randomly generated networks of the same depth, operators and approximately the same number of parameters.

Via

Access Paper or Ask Questions