Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

Sep 06, 2024

Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung

Figure 1 for OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

Figure 2 for OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

Figure 3 for OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

Figure 4 for OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

Share this with someone who'll enjoy it:

Abstract:To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT multipliers for dominant non-outlier related operations. In addition, OPAL uses log2-based approximation on softmax operations that only requires shift and subtraction to maximize power efficiency. As a result, we are able to improve the energy efficiency by 1.6~2.2x, and reduce the area by 2.4~3.1x with negligible accuracy loss, i.e., <1 perplexity increase.

* 7 pages, 8 figures, DAC2024 accepted

View paper on

Share this with someone who'll enjoy it:

Title:OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models

Paper and Code