Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:OHQ: On-chip Hardware-aware Quantization

Sep 05, 2023

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yifu Ding, Ying Li, Xianglong Liu

Figure 1 for OHQ: On-chip Hardware-aware Quantization

Figure 2 for OHQ: On-chip Hardware-aware Quantization

Figure 3 for OHQ: On-chip Hardware-aware Quantization

Figure 4 for OHQ: On-chip Hardware-aware Quantization

Share this with someone who'll enjoy it:

Abstract:Quantization emerges as one of the most promising approaches for deploying advanced deep models on resource-constrained hardware. Mixed-precision quantization leverages multiple bit-width architectures to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization suffers exhaustive search space that causes immense computational overhead. The quantization process thus relies on separate high-performance devices rather than locally, which also leads to a significant gap between the considered hardware metrics and the real deployment.In this paper, we propose an On-chip Hardware-aware Quantization (OHQ) framework that performs hardware-aware mixed-precision quantization without accessing online devices. First, we construct the On-chip Quantization Awareness (OQA) pipeline, enabling perceive the actual efficiency metrics of the quantization operator on the hardware.Second, we propose Mask-guided Quantization Estimation (MQE) technique to efficiently estimate the accuracy metrics of operators under the constraints of on-chip-level computing power.By synthesizing network and hardware insights through linear programming, we obtain optimized bit-width configurations. Notably, the quantization process occurs on-chip entirely without any additional computing devices and data access. We demonstrate accelerated inference after quantization for various architectures and compression ratios, achieving 70% and 73% accuracy for ResNet-18 and MobileNetV3, respectively. OHQ improves latency by 15~30% compared to INT8 on deployment.

* 10 pages, 6 figures

View paper on

Share this with someone who'll enjoy it:

Title:OHQ: On-chip Hardware-aware Quantization

Paper and Code