Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jerry Quinn

Zero-Shot Dynamic Quantization for Transformer Inference

Nov 17, 2022

Yousef El-Kurdi, Jerry Quinn, Avirup Sil

Figure 1 for Zero-Shot Dynamic Quantization for Transformer Inference

Figure 2 for Zero-Shot Dynamic Quantization for Transformer Inference

Figure 3 for Zero-Shot Dynamic Quantization for Transformer Inference

Abstract:We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure,or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.

* To appear in EMNLP 2022 industry track

Via

Access Paper or Ask Questions

Optimal Mini-Batch Size Selection for Fast Gradient Descent

Nov 15, 2019

Michael P. Perrone, Haidar Khan, Changhoan Kim, Anastasios Kyrillidis, Jerry Quinn, Valentina Salapura

Figure 1 for Optimal Mini-Batch Size Selection for Fast Gradient Descent

Figure 2 for Optimal Mini-Batch Size Selection for Fast Gradient Descent

Abstract:This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By decoupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold. Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning. We demonstrate the inverse law empirically, on both image recognition (MNIST, CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic justification via proving a novel bound on mini-batch SGD training.

Via

Access Paper or Ask Questions

Pieces of Eight: 8-bit Neural Machine Translation

Apr 13, 2018

Jerry Quinn, Miguel Ballesteros

Figure 1 for Pieces of Eight: 8-bit Neural Machine Translation

Figure 2 for Pieces of Eight: 8-bit Neural Machine Translation

Figure 3 for Pieces of Eight: 8-bit Neural Machine Translation

Figure 4 for Pieces of Eight: 8-bit Neural Machine Translation

Abstract:Neural machine translation has achieved levels of fluency and adequacy that would have been surprising a short time ago. Output quality is extremely relevant for industry purposes, however it is equally important to produce results in the shortest time possible, mainly for latency-sensitive applications and to control cloud hosting costs. In this paper we show the effectiveness of translating with 8-bit quantization for models that have been trained using 32-bit floating point values. Results show that 8-bit translation makes a non-negligible impact in terms of speed with no degradation in accuracy and adequacy.

* To appear at NAACL 2018 Industry Track

Via

Access Paper or Ask Questions