Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

M. Akif Özkan

Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking Neural Networks

Mar 23, 2022

Jan Sommer, M. Akif Özkan, Oliver Keszocze, Jürgen Teich

Figure 1 for Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking Neural Networks

Figure 2 for Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking Neural Networks

Figure 3 for Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking Neural Networks

Figure 4 for Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking Neural Networks

Abstract:Spiking Neural Networks (SNNs) compute in an event-based matter to achieve a more efficient computation than standard Neural Networks. In SNNs, neuronal outputs (i.e. activations) are not encoded with real-valued activations but with sequences of binary spikes. The motivation of using SNNs over conventional neural networks is rooted in the special computational aspects of SNNs, especially the very high degree of sparsity of neural output activations. Well established architectures for conventional Convolutional Neural Networks (CNNs) feature large spatial arrays of Processing Elements (PEs) that remain highly underutilized in the face of activation sparsity. We propose a novel architecture that is optimized for the processing of Convolutional SNNs (CSNNs) that feature a high degree of activation sparsity. In our architecture, the main strategy is to use less but highly utilized PEs. The PE array used to perform the convolution is only as large as the kernel size, allowing all PEs to be active as long as there are spikes to process. This constant flow of spikes is ensured by compressing the feature maps (i.e. the activations) into queues that can then be processed spike by spike. This compression is performed in run-time using dedicated circuitry, leading to a self-timed scheduling. This allows the processing time to scale directly with the number of spikes. A novel memory organization scheme called memory interlacing is used to efficiently store and retrieve the membrane potentials of the individual neurons using multiple small parallel on-chip RAMs. Each RAM is hardwired to its PE, reducing switching circuitry and allowing RAMs to be located in close proximity to the respective PE. We implemented the proposed architecture on an FPGA and achieved a significant speedup compared to other implementations while needing less hardware resources and maintaining a lower energy consumption.

* 12 pages, 12 figures, 5 tables, submitted to CODES 2022

Via

Access Paper or Ask Questions

HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Aug 26, 2020

M. Akif Özkan, Burak Ok, Bo Qiao, Jürgen Teich, Frank Hannig

Figure 1 for HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Figure 2 for HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Figure 3 for HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Figure 4 for HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Abstract:Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX' algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing Domain-Specific Language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.

Via

Access Paper or Ask Questions