Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:HERA: High-efficiency Matrix Compression via Element Replacement

Jul 04, 2024

Yanshu Wang, Wang Li, Tong Yang

Figure 1 for HERA: High-efficiency Matrix Compression via Element Replacement

Figure 2 for HERA: High-efficiency Matrix Compression via Element Replacement

Figure 3 for HERA: High-efficiency Matrix Compression via Element Replacement

Figure 4 for HERA: High-efficiency Matrix Compression via Element Replacement

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) have significantly advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. However, their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment, particularly in resource-constrained environments like mobile devices and edge computing platforms. Additionally, the key-value (k-v) cache used to speed up query processing requires substantial memory and storage, exacerbating these challenges. Vector databases have emerged as a crucial technology to efficiently manage and retrieve the high-dimensional vectors produced by LLMs, facilitating faster data access and reducing computational demands. Effective compression and quantization techniques are essential to address these challenges, as they reduce the memory footprint and computational requirements without significantly compromising performance. Traditional methods that uniformly map parameters to compressed spaces often fail to account for the uneven distribution of parameters, leading to considerable accuracy loss. Therefore, innovative approaches are needed to achieve better compression ratios while preserving model performance. In this work, we propose HERA, a novel algorithm that employs heuristic Element Replacement for compressing matrix. HERA systematically replaces elements within the model using heuristic methods, which simplifies the structure of the model and makes subsequent compression more effective. By hierarchically segmenting, compressing, and reorganizing the matrix dataset, our method can effectively reduce the quantization error to 12.3% of the original at the same compression ratio.

View paper on

Share this with someone who'll enjoy it:

Title:HERA: High-efficiency Matrix Compression via Element Replacement

Paper and Code