Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Asit Mishra

Accelerating Sparse Deep Neural Networks

Apr 16, 2021

Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, Paulius Micikevicius

Figure 1 for Accelerating Sparse Deep Neural Networks

Figure 2 for Accelerating Sparse Deep Neural Networks

Figure 3 for Accelerating Sparse Deep Neural Networks

Figure 4 for Accelerating Sparse Deep Neural Networks

Abstract:As neural network model sizes have dramatically increased, so has the interest in various techniques to reduce their parameter counts and accelerate their execution. An active area of research in this field is sparsity - encouraging zero values in parameters that can then be discarded from storage or computations. While most research focuses on high levels of sparsity, there are challenges in universally maintaining model accuracy as well as achieving significant speedups over modern matrix-math hardware. To make sparsity adoption practical, the NVIDIA Ampere GPU architecture introduces sparsity support in its matrix-math units, Tensor Cores. We present the design and behavior of Sparse Tensor Cores, which exploit a 2:4 (50%) sparsity pattern that leads to twice the math throughput of dense matrix units. We also describe a simple workflow for training networks that both satisfy 2:4 sparsity pattern requirements and maintain accuracy, verifying it on a wide range of common tasks and model architectures. This workflow makes it easy to prepare accurate models for efficient deployment on Sparse Tensor Cores.

Via

Access Paper or Ask Questions

WRPN & Apprentice: Methods for Training and Inference using Low-Precision Numerics

Mar 01, 2018

Asit Mishra, Debbie Marr

Figure 1 for WRPN & Apprentice: Methods for Training and Inference using Low-Precision Numerics

Figure 2 for WRPN & Apprentice: Methods for Training and Inference using Low-Precision Numerics

Figure 3 for WRPN & Apprentice: Methods for Training and Inference using Low-Precision Numerics

Figure 4 for WRPN & Apprentice: Methods for Training and Inference using Low-Precision Numerics

Abstract:Today's high performance deep learning architectures involve large models with numerous parameters. Low precision numerics has emerged as a popular technique to reduce both the compute and memory requirements of these large models. However, lowering precision often leads to accuracy degradation. We describe three schemes whereby one can both train and do efficient inference using low precision numerics without hurting accuracy. Finally, we describe an efficient hardware accelerator that can take advantage of the proposed low precision numerics.

* Tech report

Via

Access Paper or Ask Questions

Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Nov 15, 2017

Asit Mishra, Debbie Marr

Figure 1 for Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Figure 2 for Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Figure 3 for Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Figure 4 for Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Abstract:Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems - the models (often deep networks or wide networks or both) are compute and memory intensive. Low-precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low-precision networks can be significantly improved by using knowledge distillation techniques. Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of ResNet architecture on ImageNet dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.

Via

Access Paper or Ask Questions

Low Precision RNNs: Quantizing RNNs Without Losing Accuracy

Oct 20, 2017

Supriya Kapur, Asit Mishra, Debbie Marr

Figure 1 for Low Precision RNNs: Quantizing RNNs Without Losing Accuracy

Figure 2 for Low Precision RNNs: Quantizing RNNs Without Losing Accuracy

Figure 3 for Low Precision RNNs: Quantizing RNNs Without Losing Accuracy

Figure 4 for Low Precision RNNs: Quantizing RNNs Without Losing Accuracy

Abstract:Similar to convolution neural networks, recurrent neural networks (RNNs) typically suffer from over-parameterization. Quantizing bit-widths of weights and activations results in runtime efficiency on hardware, yet it often comes at the cost of reduced accuracy. This paper proposes a quantization approach that increases model size with bit-width reduction. This approach will allow networks to perform at their baseline accuracy while still maintaining the benefits of reduced precision and overall model size reduction.

Via

Access Paper or Ask Questions

WRPN: Wide Reduced-Precision Networks

Sep 04, 2017

Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, Debbie Marr

Figure 1 for WRPN: Wide Reduced-Precision Networks

Figure 2 for WRPN: Wide Reduced-Precision Networks

Figure 3 for WRPN: Wide Reduced-Precision Networks

Figure 4 for WRPN: Wide Reduced-Precision Networks

Abstract:For computer vision applications, prior works have shown the efficacy of reducing numeric precision of model parameters (network weights) in deep neural networks. Activation maps, however, occupy a large memory footprint during both the training and inference step when using mini-batches of inputs. One way to reduce this large memory footprint is to reduce the precision of activations. However, past works have shown that reducing the precision of activations hurts model accuracy. We study schemes to train networks from scratch using reduced-precision activations without hurting accuracy. We reduce the precision of activation maps (along with model parameters) and increase the number of filter maps in a layer, and find that this scheme matches or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly improve the execution efficiency (e.g. reduce dynamic memory footprint, memory bandwidth and computational energy) and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results and show that WRPN scheme is better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.

Via

Access Paper or Ask Questions

WRPN: Training and Inference using Wide Reduced-Precision Networks

Apr 10, 2017

Asit Mishra, Jeffrey J Cook, Eriko Nurvitadhi, Debbie Marr

Figure 1 for WRPN: Training and Inference using Wide Reduced-Precision Networks

Figure 2 for WRPN: Training and Inference using Wide Reduced-Precision Networks

Abstract:For computer vision applications, prior works have shown the efficacy of reducing the numeric precision of model parameters (network weights) in deep neural networks but also that reducing the precision of activations hurts model accuracy much more than reducing the precision of model parameters. We study schemes to train networks from scratch using reduced-precision activations without hurting the model accuracy. We reduce the precision of activation maps (along with model parameters) using a novel quantization scheme and increase the number of filter maps in a layer, and find that this scheme compensates or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly reduce the dynamic memory footprint, memory bandwidth, computational energy and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results using our proposed schemes and show that our results are better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks.

* Under submission to CVPR Workshop

Via

Access Paper or Ask Questions