Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pierce I-Jen Chuang

Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Jun 25, 2021

David Eriksson, Pierce I-Jen Chuang, Samuel Daulton, Peng Xia, Akshat Shrivastava, Arun Babu, Shicong Zhao, Ahmed Aly, Ganesh Venkatesh, Maximilian Balandat

Figure 1 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Figure 2 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Figure 3 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Figure 4 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Abstract:When tuning the architecture and hyperparameters of large machine learning models for on-device deployment, it is desirable to understand the optimal trade-offs between on-device latency and model accuracy. In this work, we leverage recent methodological advances in Bayesian optimization over high-dimensional search spaces and multi-objective Bayesian optimization to efficiently explore these trade-offs for a production-scale on-device natural language understanding model at Facebook.

* To Appear at the 8th ICML Workshop on Automated Machine Learning, ICML 2021

Via

Access Paper or Ask Questions

One Weight Bitwidth to Rule Them All

Aug 28, 2020

Ting-Wu Chin, Pierce I-Jen Chuang, Vikas Chandra, Diana Marculescu

Figure 1 for One Weight Bitwidth to Rule Them All

Figure 2 for One Weight Bitwidth to Rule Them All

Figure 3 for One Weight Bitwidth to Rule Them All

Figure 4 for One Weight Bitwidth to Rule Them All

Abstract:Weight quantization for deep ConvNets has shown promising results for applications such as image classification and semantic segmentation and is especially important for applications where memory storage is limited. However, when aiming for quantization without accuracy degradation, different tasks may end up with different bitwidths. This creates complexity for software and hardware support and the complexity accumulates when one considers mixed-precision quantization, in which case each layer's weights use a different bitwidth. Our key insight is that optimizing for the least bitwidth subject to no accuracy degradation is not necessarily an optimal strategy. This is because one cannot decide optimality between two bitwidths if one has a smaller model size while the other has better accuracy. In this work, we take the first step to understand if some weight bitwidth is better than others by aligning all to the same model size using a width-multiplier. Under this setting, somewhat surprisingly, we show that using a single bitwidth for the whole network can achieve better accuracy compared to mixed-precision quantization targeting zero accuracy degradation when both have the same model size. In particular, our results suggest that when the number of channels becomes a target hyperparameter, a single weight bitwidth throughout the network shows superior results for model compression.

* Accepted at ECCV 2020 Embedded Vision Workshop (Best paper)

Via

Access Paper or Ask Questions

Bridging the Accuracy Gap for 2-bit Quantized Neural Networks

Jul 17, 2018

Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan

Figure 1 for Bridging the Accuracy Gap for 2-bit Quantized Neural Networks

Figure 2 for Bridging the Accuracy Gap for 2-bit Quantized Neural Networks

Figure 3 for Bridging the Accuracy Gap for 2-bit Quantized Neural Networks

Figure 4 for Bridging the Accuracy Gap for 2-bit Quantized Neural Networks

Abstract:Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. In order to reduce this cost, several quantization schemes have gained attention recently with some focusing on weight quantization, and others focusing on quantizing activations. This paper proposes novel techniques that target weight and activation quantizations separately resulting in an overall quantized neural network (QNN). The activation quantization technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $\alpha$ that is optimized during training to find the right quantization scale. The weight quantization scheme, statistics-aware weight binning (SAWB), finds the optimal scaling factor that minimizes the quantization error based on the statistical characteristics of the distribution of weights without the need for an exhaustive search. The combination of PACT and SAWB results in a 2-bit QNN that achieves state-of-the-art classification accuracy (comparable to full precision networks) across a range of popular models and datasets.

* arXiv admin note: substantial text overlap with arXiv:1805.06085

Via

Access Paper or Ask Questions

PACT: Parameterized Clipping Activation for Quantized Neural Networks

Jul 17, 2018

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, Kailash Gopalakrishnan

Figure 1 for PACT: Parameterized Clipping Activation for Quantized Neural Networks

Figure 2 for PACT: Parameterized Clipping Activation for Quantized Neural Networks

Figure 3 for PACT: Parameterized Clipping Activation for Quantized Neural Networks

Figure 4 for PACT: Parameterized Clipping Activation for Quantized Neural Networks

Abstract:Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost. To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations. This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation. This technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $\alpha$ that is optimized during training to find the right quantization scale. PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes. We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets. We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories.

Via

Access Paper or Ask Questions