Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shreyas Kolala Venkataramanaiah

Automatic Compiler Based FPGA Accelerator for CNN Training

Aug 15, 2019

Shreyas Kolala Venkataramanaiah, Yufei Ma, Shihui Yin, Eriko Nurvithadhi, Aravind Dasu, Yu Cao, Jae-sun Seo

Figure 1 for Automatic Compiler Based FPGA Accelerator for CNN Training

Figure 2 for Automatic Compiler Based FPGA Accelerator for CNN Training

Figure 3 for Automatic Compiler Based FPGA Accelerator for CNN Training

Figure 4 for Automatic Compiler Based FPGA Accelerator for CNN Training

Abstract:Training of convolutional neural networks (CNNs)on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hard-ware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler-based FPGA accelerator with 16-bit fixed-point precision for complete CNNtraining, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAMto efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10-GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance.

* 6 pages, 9 figures, paper accepted at FPL2019 conference

Via

Access Paper or Ask Questions

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Feb 27, 2019

Paul N. Whatmough, Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo, Matthew Mattina

Figure 1 for FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Figure 2 for FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Figure 3 for FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Figure 4 for FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

Abstract:The computational demands of computer vision tasks based on state-of-the-art Convolutional Neural Network (CNN) image classification far exceed the energy budgets of mobile devices. This paper proposes FixyNN, which consists of a fixed-weight feature extractor that generates ubiquitous CNN features, and a conventional programmable CNN accelerator which processes a dataset-specific CNN. Image classification models for FixyNN are trained end-to-end via transfer learning, with the common feature extractor representing the transfered part, and the programmable part being learnt on the target dataset. Experimental results demonstrate FixyNN hardware can achieve very high energy efficiencies up to 26.6 TOPS/W ($4.81 \times$ better than iso-area programmable accelerator). Over a suite of six datasets we trained models via transfer learning with an accuracy loss of $<1\%$ resulting in up to 11.2 TOPS/W - nearly $2 \times$ more efficient than a conventional programmable CNN accelerator of the same area.

* 10 pages, 8 figures, paper accepted at SysML2019 conference

Via

Access Paper or Ask Questions