Abstract:The rapid advancement of neural network applications necessitates hardware that not only accelerates computation but also adapts efficiently to dynamic processing requirements. While processing-in-pixel has emerged as a promising solution to overcome the bottlenecks of traditional architectures at the extreme-edge, existing implementations face limitations in reconfigurability and scalability due to their static nature and inefficient area usage. Addressing these challenges, we present a novel architecture that significantly enhances the capabilities of processing-in-pixel for convolutional neural networks. Our design innovatively integrates non-volatile memory (NVM) with novel unit pixel circuit design, enabling dynamic reconfiguration of synaptic weights, kernel size, channel size and stride size. Thus offering unprecedented flexibility and adaptability. With using a separate die for pixel circuit and storing synaptic weights, our circuit achieves a substantial reduction in the required area per pixel thereby increasing the density and scalability of the pixel array. Simulation results demonstrate dot product operations of the circuit, the non-linearity of its analog output and a novel bucket-select curvefit model is proposed to capture it. This work not only addresses the limitations of current in-pixel computing approaches but also opens new avenues for developing more efficient, flexible, and scalable neural network hardware, paving the way for advanced AI applications.
Abstract:The massive amounts of data generated by camera sensors motivate data processing inside pixel arrays, i.e., at the extreme-edge. Several critical developments have fueled recent interest in the processing-in-pixel-in-memory paradigm for a wide range of visual machine intelligence tasks, including (1) advances in 3D integration technology to enable complex processing inside each pixel in a 3D integrated manner while maintaining pixel density, (2) analog processing circuit techniques for massively parallel low-energy in-pixel computations, and (3) algorithmic techniques to mitigate non-idealities associated with analog processing through hardware-aware training schemes. This article presents a comprehensive technology-circuit-algorithm landscape that connects technology capabilities, circuit design strategies, and algorithmic optimizations to power, performance, area, bandwidth reduction, and application-level accuracy metrics. We present our results using a comprehensive co-design framework incorporating hardware and algorithmic optimizations for various complex real-life visual intelligence tasks mapped onto our P2M paradigm.
Abstract:Due to the high activation sparsity and use of accumulates (AC) instead of expensive multiply-and-accumulates (MAC), neuromorphic spiking neural networks (SNNs) have emerged as a promising low-power alternative to traditional DNNs for several computer vision (CV) applications. However, most existing SNNs require multiple time steps for acceptable inference accuracy, hindering real-time deployment and increasing spiking activity and, consequently, energy consumption. Recent works proposed direct encoding that directly feeds the analog pixel values in the first layer of the SNN in order to significantly reduce the number of time steps. Although the overhead for the first layer MACs with direct encoding is negligible for deep SNNs and the CV processing is efficient using SNNs, the data transfer between the image sensors and the downstream processing costs significant bandwidth and may dominate the total energy. To mitigate this concern, we propose an in-sensor computing hardware-software co-design framework for SNNs targeting image recognition tasks. Our approach reduces the bandwidth between sensing and processing by 12-96x and the resulting total energy by 2.32x compared to traditional CV processing, with a 3.8% reduction in accuracy on ImageNet.
Abstract:In order to deploy current computer vision (CV) models on resource-constrained low-power devices, recent works have proposed in-sensor and in-pixel computing approaches that try to partly/fully bypass the image signal processor (ISP) and yield significant bandwidth reduction between the image sensor and the CV processing unit by downsampling the activation maps in the initial convolutional neural network (CNN) layers. However, direct inference on the raw images degrades the test accuracy due to the difference in covariance of the raw images captured by the image sensors compared to the ISP-processed images used for training. Moreover, it is difficult to train deep CV models on raw images, because most (if not all) large-scale open-source datasets consist of RGB images. To mitigate this concern, we propose to invert the ISP pipeline, which can convert the RGB images of any dataset to its raw counterparts, and enable model training on raw images. We release the raw version of the COCO dataset, a large-scale benchmark for generic high-level vision tasks. For ISP-less CV systems, training on these raw images result in a 7.1% increase in test accuracy on the visual wake works (VWW) dataset compared to relying on training with traditional ISP-processed RGB datasets. To further improve the accuracy of ISP-less CV models and to increase the energy and bandwidth benefits obtained by in-sensor/in-pixel computing, we propose an energy-efficient form of analog in-pixel demosaicing that may be coupled with in-pixel CNN computations. When evaluated on raw images captured by real sensors from the PASCALRAW dataset, our approach results in a 8.1% increase in mAP. Lastly, we demonstrate a further 20.5% increase in mAP by using a novel application of few-shot learning with thirty shots each for the novel PASCALRAW dataset, constituting 3 classes.
Abstract:Neuromorphic image sensors draw inspiration from the biological retina to implement visual computations in electronic hardware. Gain control in phototransduction and temporal differentiation at the first retinal synapse inspired the first generation of neuromorphic sensors, but processing in downstream retinal circuits, much of which has been discovered in the past decade, has not been implemented in image sensor technology. We present a technology-circuit co-design solution that implements two motion computations occurring at the output of the retina that could have wide applications for vision based decision making in dynamic environments. Our simulations on Globalfoundries 22nm technology node show that, by taking advantage of the recent advances in semiconductor chip stacking technology, the proposed retina-inspired circuits can be fabricated on image sensing platforms in existing semiconductor foundries. Integrated Retinal Functionality in Image Sensors (IRIS) technology could drive advances in machine vision applications that demand robust, high-speed, energy-efficient and low-bandwidth real-time decision making.
Abstract:Today's high resolution, high frame rate cameras in autonomous vehicles generate a large volume of data that needs to be transferred and processed by a downstream processor or machine learning (ML) accelerator to enable intelligent computing tasks, such as multi-object detection and tracking. The massive amount of data transfer incurs significant energy, latency, and bandwidth bottlenecks, which hinders real-time processing. To mitigate this problem, we propose an algorithm-hardware co-design framework called Processing-in-Pixel-in-Memory-based object Detection and Tracking (P2M-DeTrack). P2M-DeTrack is based on a custom faster R-CNN-based model that is distributed partly inside the pixel array (front-end) and partly in a separate FPGA/ASIC (back-end). The proposed front-end in-pixel processing down-samples the input feature maps significantly with judiciously optimized strided convolution and pooling. Compared to a conventional baseline design that transfers frames of RGB pixels to the back-end, the resulting P2M-DeTrack designs reduce the data bandwidth between sensor and back-end by up to 24x. The designs also reduce the sensor and total energy (obtained from in-house circuit simulations at Globalfoundries 22nm technology node) per frame by 5.7x and 1.14x, respectively. Lastly, they reduce the sensing and total frame latency by an estimated 1.7x and 3x, respectively. We evaluate our approach on the multi-object object detection (tracking) task of the large-scale BDD100K dataset and observe only a 0.5% reduction in the mean average precision (0.8% reduction in the identification F1 score) compared to the state-of-the-art.
Abstract:The demand to process vast amounts of data generated from state-of-the-art high resolution cameras has motivated novel energy-efficient on-device AI solutions. Visual data in such cameras are usually captured in the form of analog voltages by a sensor pixel array, and then converted to the digital domain for subsequent AI processing using analog-to-digital converters (ADC). Recent research has tried to take advantage of massively parallel low-power analog/digital computing in the form of near- and in-sensor processing, in which the AI computation is performed partly in the periphery of the pixel array and partly in a separate on-board CPU/accelerator. Unfortunately, high-resolution input images still need to be streamed between the camera and the AI processing unit, frame by frame, causing energy, bandwidth, and security bottlenecks. To mitigate this problem, we propose a novel Processing-in-Pixel-in-memory (P2M) paradigm, that customizes the pixel array by adding support for analog multi-channel, multi-bit convolution, batch normalization, and ReLU (Rectified Linear Units). Our solution includes a holistic algorithm-circuit co-design approach and the resulting P2M paradigm can be used as a drop-in replacement for embedding memory-intensive first few layers of convolutional neural network (CNN) models within foundry-manufacturable CMOS image sensor platforms. Our experimental results indicate that P2M reduces data transfer bandwidth from sensors and analog to digital conversions by ~21x, and the energy-delay product (EDP) incurred in processing a MobileNetV2 model on a TinyML use case for visual wake words dataset (VWW) by up to ~11x compared to standard near-processing or in-sensor implementations, without any significant drop in test accuracy.
Abstract:Hyperspectral cameras generate a large amount of data due to the presence of hundreds of spectral bands as opposed to only three channels (red, green, and blue) in traditional cameras. This requires a significant amount of data transmission between the hyperspectral image sensor and a processor used to classify/detect/track the images, frame by frame, expending high energy and causing bandwidth and security bottlenecks. To mitigate this problem, we propose a form of processing-in-pixel (PIP) that leverages advanced CMOS technologies to enable the pixel array to perform a wide range of complex operations required by the modern convolutional neural networks (CNN) for hyperspectral image recognition (HSI). Consequently, our PIP-optimized custom CNN layers effectively compress the input data, significantly reducing the bandwidth required to transmit the data downstream to the HSI processing unit. This reduces the average energy consumption associated with pixel array of cameras and the CNN processing unit by 25.06x and 3.90x respectively, compared to existing hardware implementations. Our custom models yield average test accuracies within 0.56% of the baseline models for the standard HSI benchmarks.
Abstract:The quest for highly efficient cognitive computing has led to extensive research interest for the field of neuromorphic computing. Neuromorphic computing aims to mimic the behavior of biological neurons and synapses using solid-state devices and circuits. Among various approaches, emerging non-volatile memory technologies are of special interest for mimicking neuro-synaptic behavior. These devices allow the mapping of the rich dynamics of biological neurons and synapses onto their intrinsic device physics. In this letter, we focus on Spike Timing Dependent Plasticity (STDP) behavior of biological synapses and propose a method to implement the STDP behavior in Magnetic Tunnel Junction (MTJ) devices. Specifically, we exploit the time-dependent heat dynamics and the response of an MTJ to the instantaneous temperature to imitate the STDP behavior. Our simulations, based on a macro-spin model for magnetization dynamics, show that, STDP can be imitated in stochastic magnetic tunnel junctions by applying simple voltage waveforms as the spiking response of pre- and post-neurons across an MTJ device.