Abstract:Recent advances in machine learning by deep neural networks are significant. But using these networks has been accompanied by a huge number of parameters for storage and computations that leads to an increase in the hardware cost and posing challenges. Therefore, compression approaches have been proposed to design efficient accelerators. One important approach for deep neural network compression is quantization that full-precision values are stored in low bit-width. In this way, in addition to memory saving, the operations will be replaced by simple ones with low cost. Many methods are suggested for DNNs Quantization in recent years, because of flexibility and influence in designing efficient hardware. Therefore, an integrated report is essential for better understanding, analysis, and comparison. In this paper, we provide a comprehensive survey. We describe the quantization concepts and categorize the methods from different perspectives. We discuss using the scale factor to match the quantization levels with the distribution of the full-precision values and describe the clustering-based methods. For the first time, we review the training of a quantized deep neural network and using Straight-Through Estimator comprehensively. Also, we describe the simplicity of operations in quantized deep convolutional neural networks and explain the sensitivity of the different layers in quantization. Finally, we discuss the evaluation of the quantization methods and compare the accuracy of previous methods with various bit-width for weights and activations on CIFAR-10 and the large-scale dataset, ImageNet.