Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

Nov 25, 2020

Nick Iliev, Amit Ranjan Trivedi

Figure 1 for Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

Figure 2 for Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

Figure 3 for Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

Figure 4 for Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

Share this with someone who'll enjoy it:

Abstract:We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz clock. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FClayer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.

View paper on

Share this with someone who'll enjoy it:

Title:Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

Paper and Code