Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kuo-Wei Chang

Efficient Accelerator for Dilated and Transposed Convolution with Decomposition

May 02, 2022

Kuo-Wei Chang, Tian-Sheuan Chang

Figure 1 for Efficient Accelerator for Dilated and Transposed Convolution with Decomposition

Figure 2 for Efficient Accelerator for Dilated and Transposed Convolution with Decomposition

Figure 3 for Efficient Accelerator for Dilated and Transposed Convolution with Decomposition

Figure 4 for Efficient Accelerator for Dilated and Transposed Convolution with Decomposition

Abstract:Hardware acceleration for dilated and transposed convolution enables real time execution of related tasks like segmentation, but current designs are specific for these convolutional types or suffer from complex control for reconfigurable designs. This paper presents a design that decomposes input or weight for dilated and transposed convolutions respectively to skip redundant computations and thus executes efficiently on existing dense CNN hardware as well. The proposed architecture can cut down 87.8\% of the cycle counts to achieve 8.2X speedup over a naive execution for the ENet case.

* 10 pages, 12 figures, published in IEEE ISCAS 2020

Via

Access Paper or Ask Questions

A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

May 02, 2022

Kuo-Wei Chang, Hsu-Tung Shih, Tian-Sheuan Chang, Shang-Hong Tsai, Chih-Chyau Yang, Chien-Ming Wu, Chun-Ming Huang

Figure 1 for A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

Figure 2 for A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

Figure 3 for A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

Figure 4 for A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic

Abstract:Memory bandwidth has become the real-time bottleneck of current deep learning accelerators (DLA), particularly for high definition (HD) object detection. Under resource constraints, this paper proposes a low memory traffic DLA chip with joint hardware and software optimization. To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model to reduce intermediate data access. This reduces the YOLOv2's feature memory traffic from 2.9 GB/s to 0.15 GB/s. To support group fusion, our previous DLA based hardware employes a unified buffer with write-masking for simple layer-by-layer processing in a fusion group. When compared to our previous DLA with the same PE numbers, the chip implemented in a TSMC 40nm process supports 1280x720@30FPS object detection and consumes 7.9X less external DRAM access energy, from 2607 mJ to 327.6 mJ.

* 11 pages, 14 figures, to be published IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Via

Access Paper or Ask Questions