Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thea K. Årrestad

Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

May 01, 2024

Chang Sun, Thea K. Årrestad, Vladimir Loncar, Jennifer Ngadiuba, Maria Spiropulu

Figure 1 for Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Figure 2 for Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Figure 3 for Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Figure 4 for Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Abstract:Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method designed to fine-tune the per-weight and per-activation precision in an automatic way for ultra-low latency and low power neural networks which are to be deployed on FPGAs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

Via

Access Paper or Ask Questions