Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kahkuen Fu

MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Aug 04, 2023

Manasa Manohara, Sankalp Dayal, Tariq Afzal, Rahul Bakshi, Kahkuen Fu

Figure 1 for MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Figure 2 for MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Figure 3 for MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Figure 4 for MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Abstract:Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g., only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep learning models cannot be easily quantized for diverse fixed-point hardwares, mainly due to slightly different quantization requirements. In this paper, we envision a new type of model quantization approach called MRQ (model re-quantization), which takes existing quantized models and quickly transforms the models to meet different quantization requirements (e.g., asymmetric -> symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much simpler than quantizing from scratch because it avoids costly re-training and provides support for multiple quantization schemes simultaneously. To minimize re-quantization error, we developed a new set of re-quantization algorithms including weight correction and rounding error folding. We have demonstrated that MobileNetV2 QAT model [7] can be quickly re-quantized into two different quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less than 0.64 units of accuracy loss. We believe our work is the first to leverage this concept of re-quantization for model quantization and models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.

* 8 pages, 6 figures, 3 tables, TinyML Conference

Via

Access Paper or Ask Questions