Alex
Abstract:Spiking Neural Networks (SNNs), with their inherent recurrence, offer an efficient method for processing the asynchronous temporal data generated by Dynamic Vision Sensors (DVS), making them well-suited for event-based vision applications. However, existing SNN accelerators suffer from limitations in adaptability to diverse neuron models, bit precisions and network sizes, inefficient membrane potential (Vmem) handling, and limited sparse optimizations. In response to these challenges, we propose a scalable and reconfigurable digital compute-in-memory (CIM) SNN accelerator \chipname with a set of key features: 1) It uses in-memory computations and reconfigurable operating modes to minimize data movement associated with weight and Vmem data structures while efficiently adapting to different workloads. 2) It supports multiple weight/Vmem bit precision values, enabling a trade-off between accuracy and energy efficiency and enhancing adaptability to diverse application demands. 3) A zero-skipping mechanism for sparse inputs significantly reduces energy usage by leveraging the inherent sparsity of spikes without introducing high overheads for low sparsity. 4) Finally, the asynchronous handshaking mechanism maintains the computational efficiency of the pipeline for variable execution times of different computation units. We fabricated \chipname in 65 nm Taiwan Semiconductor Manufacturing Company (TSMC) low-power (LP) technology. It demonstrates competitive performance (scaled to the same technology node) to other digital SNN accelerators proposed in the recent literature and supports advanced reconfigurability. It achieves up to 5 TOPS/W energy efficiency at 95% input sparsity with 4-bit weights and 7-bit Vmem precision.
Abstract:Supervised deep learning has emerged as an effective tool for carrying out power side-channel attacks on cryptographic implementations. While increasingly-powerful deep learning-based attacks are regularly published, comparatively-little work has gone into using deep learning to defend against these attacks. In this work we propose a technique for identifying which timesteps in a power trace are responsible for leaking a cryptographic key, through an adversarial game between a deep learning-based side-channel attacker which seeks to classify a sensitive variable from the power traces recorded during encryption, and a trainable noise generator which seeks to thwart this attack by introducing a minimal amount of noise into the power traces. We demonstrate on synthetic datasets that our method can outperform existing techniques in the presence of common countermeasures such as Boolean masking and trace desynchronization. Results on real datasets are weak because the technique is highly sensitive to hyperparameters and early-stop point, and we lack a holdout dataset with ground truth knowledge of leaking points for model selection. Nonetheless, we believe our work represents an important first step towards deep side-channel leakage localization without relying on strong assumptions about the implementation or the nature of its leakage. An open-source PyTorch implementation of our experiments is provided.
Abstract:The ability of neural networks to perform robotic perception and control tasks such as depth and optical flow estimation, simultaneous localization and mapping (SLAM), and automatic control has led to their widespread adoption in recent years. Deep Reinforcement Learning has been used extensively in these settings, as it does not have the unsustainable training costs associated with supervised learning. However, DeepRL suffers from poor sample efficiency, i.e., it requires a large number of environmental interactions to converge to an acceptable solution. Modern RL algorithms such as Deep Q Learning and Soft Actor-Critic attempt to remedy this shortcoming but can not provide the explainability required in applications such as autonomous robotics. Humans intuitively understand the long-time-horizon sequential tasks common in robotics. Properly using such intuition can make RL policies more explainable while enhancing their sample efficiency. In this work, we propose SHIRE, a novel framework for encoding human intuition using Probabilistic Graphical Models (PGMs) and using it in the Deep RL training pipeline to enhance sample efficiency. Our framework achieves 25-78% sample efficiency gains across the environments we evaluate at negligible overhead cost. Additionally, by teaching RL agents the encoded elementary behavior, SHIRE enhances policy explainability. A real-world demonstration further highlights the efficacy of policies trained using our framework.
Abstract:In the rapidly evolving field of vision-language navigation (VLN), ensuring robust safety mechanisms remains an open challenge. Control barrier functions (CBFs) are efficient tools which guarantee safety by solving an optimal control problem. In this work, we consider the case of a teleoperated drone in a VLN setting, and add safety features by formulating a novel scene-aware CBF using ego-centric observations obtained through an RGB-D sensor. As a baseline, we implement a vision-language understanding module which uses the contrastive language image pretraining (CLIP) model to query about a user-specified (in natural language) landmark. Using the YOLO (You Only Look Once) object detector, the CLIP model is queried for verifying the cropped landmark, triggering downstream navigation. To improve navigation safety of the baseline, we propose ASMA -- an Adaptive Safety Margin Algorithm -- that crops the drone's depth map for tracking moving object(s) to perform scene-aware CBF evaluation on-the-fly. By identifying potential risky observations from the scene, ASMA enables real-time adaptation to unpredictable environmental conditions, ensuring optimal safety bounds on a VLN-powered drone actions. Using the robot operating system (ROS) middleware on a parrot bebop2 quadrotor in the gazebo environment, ASMA offers 59.4% - 61.8% increase in success rates with insignificant 5.4% - 8.2% increases in trajectory lengths compared to the baseline CBF-less VLN while recovering from unsafe situations.
Abstract:The convergence of fully homomorphic encryption (FHE) and machine learning offers unprecedented opportunities for private inference of sensitive data. FHE enables computation directly on encrypted data, safeguarding the entire machine learning pipeline, including data and model confidentiality. However, existing FHE-based implementations for deep neural networks face significant challenges in computational cost, latency, and scalability, limiting their practical deployment. This paper introduces DCT-CryptoNets, a novel approach that leverages frequency-domain learning to tackle these issues. Our method operates directly in the frequency domain, utilizing the discrete cosine transform (DCT) commonly employed in JPEG compression. This approach is inherently compatible with remote computing services, where images are usually transmitted and stored in compressed formats. DCT-CryptoNets reduces the computational burden of homomorphic operations by focusing on perceptually relevant low-frequency components. This is demonstrated by substantial latency reduction of up to 5.3$\times$ compared to prior work on image classification tasks, including a novel demonstration of ImageNet inference within 2.5 hours, down from 12.5 hours compared to prior work on equivalent compute resources. Moreover, DCT-CryptoNets improves the reliability of encrypted accuracy by reducing variability (e.g., from $\pm$2.5\% to $\pm$1.0\% on ImageNet). This study demonstrates a promising avenue for achieving efficient and practical privacy-preserving deep learning on high resolution images seen in real-world applications.
Abstract:In memory computing (IMC) architectures for deep learning (DL) accelerators leverage energy-efficient and highly parallel matrix vector multiplication (MVM) operations, implemented directly in memory arrays. Such IMC designs have been explored based on CMOS as well as emerging non-volatile memory (NVM) technologies like RRAM. IMC architectures generally involve a large number of cores consisting of memory arrays, storing the trained weights of the DL model. Peripheral units like DACs and ADCs are also used for applying inputs and reading out the output values. Recently reported designs reveal that the ADCs required for reading out the MVM results, consume more than 85% of the total compute power and also dominate the area, thereby eschewing the benefits of the IMC scheme. Mitigation of imperfections in the ADCs, namely, non-linearity and variations, incur significant design overheads, due to dedicated calibration units. In this work we present peripheral aware design of IMC cores, to mitigate such overheads. It involves incorporating the non-idealities of ADCs in the training of the DL models, along with that of the memory units. The proposed approach applies equally well to both current mode as well as charge mode MVM operations demonstrated in recent years., and can significantly simplify the design of mixed-signal IMC units.
Abstract:Large language models (LLMs) represent a groundbreaking advancement in the domain of natural language processing due to their impressive reasoning abilities. Recently, there has been considerable interest in increasing the context lengths for these models to enhance their applicability to complex tasks. However, at long context lengths and large batch sizes, the key-value (KV) cache, which stores the attention keys and values, emerges as the new bottleneck in memory usage during inference. To address this, we propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. Our proposed approach is orthogonal to existing KV cache compression techniques and can be used synergistically with them. Through extensive experiments over OPT, MPT, and Llama model families, we demonstrate that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance.
Abstract:This paper introduces a Scalable Hierarchical Aware Convolutional Neural Network (SHA-CNN) model architecture for Edge AI applications. The proposed hierarchical CNN model is meticulously crafted to strike a balance between computational efficiency and accuracy, addressing the challenges posed by resource-constrained edge devices. SHA-CNN demonstrates its efficacy by achieving accuracy comparable to state-of-the-art hierarchical models while outperforming baseline models in accuracy metrics. The key innovation lies in the model's hierarchical awareness, enabling it to discern and prioritize relevant features at multiple levels of abstraction. The proposed architecture classifies data in a hierarchical manner, facilitating a nuanced understanding of complex features within the datasets. Moreover, SHA-CNN exhibits a remarkable capacity for scalability, allowing for the seamless incorporation of new classes. This flexibility is particularly advantageous in dynamic environments where the model needs to adapt to evolving datasets and accommodate additional classes without the need for extensive retraining. Testing has been conducted on the PYNQ Z2 FPGA board to validate the proposed model. The results achieved an accuracy of 99.34%, 83.35%, and 63.66% for MNIST, CIFAR-10, and CIFAR-100 datasets, respectively. For CIFAR-100, our proposed architecture performs hierarchical classification with 10% reduced computation while compromising only 0.7% accuracy with the state-of-the-art. The adaptability of SHA-CNN to FPGA architecture underscores its potential for deployment in edge devices, where computational resources are limited. The SHA-CNN framework thus emerges as a promising advancement in the intersection of hierarchical CNNs, scalability, and FPGA-based Edge AI.
Abstract:Generative AI, especially via Large Language Models (LLMs), has transformed content creation across text, images, and music, showcasing capabilities in following instructions through prompting, largely facilitated by instruction tuning. Instruction tuning is a supervised fine-tuning method where LLMs are trained on datasets formatted with specific tasks and corresponding instructions. This method systematically enhances the model's ability to comprehend and execute the provided directives. Despite these advancements, LLMs still face challenges in consistently interpreting complex, multi-step instructions and generalizing them to novel tasks, which are essential for broader applicability in real-world scenarios. This article explores why neurosymbolic AI offers a better path to enhance the instructability of LLMs. We explore the use a symbolic task planner to decompose high-level instructions into structured tasks, a neural semantic parser to ground these tasks into executable actions, and a neuro-symbolic executor to implement these actions while dynamically maintaining an explicit representation of state. We also seek to show that neurosymbolic approach enhances the reliability and context-awareness of task execution, enabling LLMs to dynamically interpret and respond to a wider range of instructional contexts with greater precision and flexibility.
Abstract:In this article, we propose a novel standalone hybrid Spiking-Convolutional Neural Network (SC-NN) model and test on using image inpainting tasks. Our approach uses the unique capabilities of SNNs, such as event-based computation and temporal processing, along with the strong representation learning abilities of CNNs, to generate high-quality inpainted images. The model is trained on a custom dataset specifically designed for image inpainting, where missing regions are created using masks. The hybrid model consists of SNNConv2d layers and traditional CNN layers. The SNNConv2d layers implement the leaky integrate-and-fire (LIF) neuron model, capturing spiking behavior, while the CNN layers capture spatial features. In this study, a mean squared error (MSE) loss function demonstrates the training process, where a training loss value of 0.015, indicates accurate performance on the training set and the model achieved a validation loss value as low as 0.0017 on the testing set. Furthermore, extensive experimental results demonstrate state-of-the-art performance, showcasing the potential of integrating temporal dynamics and feature extraction in a single network for image inpainting.