Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kevin Siu

Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

May 16, 2018

Sayeh Sharify, Alberto Delmas Lascorz, Kevin Siu, Patrick Judd, Andreas Moshovos

Figure 1 for Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

Figure 2 for Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

Figure 3 for Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

Figure 4 for Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

Abstract:Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM's execution time scales inversely proportionally with the precisions of both weights and activations. For fully-connected layers LM's performance scales inversely proportionally with the precision of the weights. LM targets area- and bandwidth-constrained System-on-a-Chip designs such as those found on mobile devices that cannot afford the multi-megabyte buffers that would be needed to store each layer on-chip. Accordingly, given a data bandwidth budget, LM boosts energy efficiency and performance over an equivalent bit-parallel accelerator. For both weights and activations LM can exploit profile-derived perlayer precisions. However, at runtime LM further trims activation precisions at a much smaller than a layer granularity. Moreover, it can naturally exploit weight precision variability at a smaller granularity than a layer. On average, across several image classification CNNs and for a configuration that can perform the equivalent of 128 16b x 16b multiply-accumulate operations per cycle LM outperforms a state-of-the-art bit-parallel accelerator [1] by 4.38x without any loss in accuracy while being 3.54x more energy efficient. LM can trade-off accuracy for additional improvements in execution performance and energy efficiency and compares favorably to an accelerator that targeted only activation precisions. We also study 2- and 4-bit LM variants and find the the 2-bit per cycle variant is the most energy efficient.

Via

Access Paper or Ask Questions

DPRed: Making Typical Activation Values Matter In Deep Learning Computing

May 15, 2018

Alberto Delmas, Sayeh Sharify, Patrick Judd, Kevin Siu, Milos Nikolic, Andreas Moshovos

Figure 1 for DPRed: Making Typical Activation Values Matter In Deep Learning Computing

Figure 2 for DPRed: Making Typical Activation Values Matter In Deep Learning Computing

Figure 3 for DPRed: Making Typical Activation Values Matter In Deep Learning Computing

Figure 4 for DPRed: Making Typical Activation Values Matter In Deep Learning Computing

Abstract:We show that selecting a fixed precision for all values in Convolutional Neural Networks, even if that precision is different per layer, amounts to worst case design. We show that much lower precisions can be used if we could target the common case instead by tailoring the precision at a much finer granularity than that of a layer. While this observation may not be surprising, to date no design takes advantage of it in practice. We propose Dynamic Prediction Reduction (DPRed), where hardware on-the-fly detects the precision activations need at a much finer granularity than a whole layer. Further we encode activations and weights using the respective per group dynamically and statically detected precisions to reduce off- and on-chip storage and communication. We demonstrate a practical implementation of DPRed with DPRed Stripes (DPRS), a data-parallel hardware accelerator that adjusts precision on-the-fly to accommodate the values of the activations it processes concurrently. DPRS accelerates convolutional layers and executes unmodified convolutional neural networks. Ignoring offchip communication, DPRS is 2.61x faster and 1.84x more energy efficient than a fixed-precision accelerator for a set of convolutional neural networks. We further extend DPRS to exploit activation and weight precisions for fully-connected layers. The enhanced design improves average performance and energy efficiency respectively by 2.59x and 1.19x over the fixed-precision accelerator for a broader set of neural networks. Finally, we consider a lower cost variant that supports only even precision widths which offers better energy efficiency. Taking into account off-chip communication, DPRed compression reduces off-chip traffic to nearly 35% on average compared to no compression making it possible to sustain higher performance for a given off-chip memory interface while also boosting energy efficiency.

* This paper was originally submitted to HPCA-24 in August 2017 and subsequently revised to include the motivation section. It was also submitted to ISCA-18 in November 2017. The current version has been updated to include an external bandwidth analysis. arXiv admin note: text overlap with arXiv:1707.09068

Via

Access Paper or Ask Questions